Large VCF Annotation

For large, many-sample VCF inputs,

WGS studies often use down-stream tools which require VCF input
The OpenCRAVAT SQLite output database format performs poorly with large sample counts

The vcfanno tool will directly annotate a typical BGZipped VCF file, and produce a BGZipped output that can be indexed with tabix.

Usage

        usage: oc vcfanno [-h] [-a [ANNOTATORS ...]] [-t THREADS] [--temp-dir TEMP_DIR] [-o OUTPUT_PATH]
                        [--chunk-size CHUNK_SIZE]
                        input_path

positional arguments:
        input_path

options:
        -h, --help            show this help message and exit
        -a [ANNOTATORS ...], --annotators [ANNOTATORS ...]
        -t THREADS, --threads THREADS
                              Number of CPU threads to use
        --temp-dir TEMP_DIR   Temporary directory for working files
        -o OUTPUT_PATH, --output-path OUTPUT_PATH
                              Output vcf path (gzipped). Defaults to input_path.oc.vcf.gz
        --chunk-size CHUNK_SIZE
                              Number of lines to annotate in each thread before syncing to disk. Affects
                              performance.

Output Format

Annotations are added to the INFO field of the VCF. OpenCRAVAT provided keys are in the format OC_AnnotatorName_AnnotationName, for example a gnomAD4 allele frequency call would be OC_GNOMAD4_AF. Further details of each annotation are in the vcf header.

Some annotations provided by OpenCRAVAT are either complex structured data, or free-form text. These types of data can make vcf files difficult to read or parse. To resolve this, OpenCRAVAT fields are encoded using HTTP Percent-encoding. Structured data is typically also JSON encoded.

While this makes the data less human-readable, most users do not directly read large VCFs. These encoding schemes are widely used, and most programming languages have standard tools to decode them.

For example, the OC_ALL_ANNOTATIONS field for a missense variant that affects multiple transcripts is JSON encoded to

{"RP1":[["P56715","p.Asn985Tyr","MIS","ENST00000220676.2","c.2953A>T"],["","","INT","ENST00000636932.1","c.787+4547A>T"],["","","INT","ENST00000637698.1","c.787+4547A>T"]]}

And then HTTP Percent-encoded

%7B%22RP1%22%3A%5B%5B%22P56715%22%2C%22p.Asn985Tyr%22%2C%22MIS%22%2C%22ENST00000220676.2%22%2C%22c.2953A%3ET%22%5D%2C%5B%22%22%2C%22%22%2C%22INT%22%2C%22ENST00000636932.1%22%2C%22c.787%2B4547A%3ET%22%5D%2C%5B%22%22%2C%22%22%2C%22INT%22%2C%22ENST00000637698.1%22%2C%22c.787%2B4547A%3ET%22%5D%5D%7D