Large VCF Annotation
For large, many-sample VCF inputs,
WGS studies often use down-stream tools which require VCF input
The OpenCRAVAT SQLite output database format performs poorly with large sample counts
The vcfanno
tool will directly annotate a typical BGZipped VCF file, and
produce a BGZipped output that can be indexed with tabix
.
Usage
usage: oc vcfanno [-h] [-a [ANNOTATORS ...]] [-t THREADS] [--temp-dir TEMP_DIR] [-o OUTPUT_PATH]
[--chunk-size CHUNK_SIZE]
input_path
positional arguments:
input_path
options:
-h, --help show this help message and exit
-a [ANNOTATORS ...], --annotators [ANNOTATORS ...]
-t THREADS, --threads THREADS
Number of CPU threads to use
--temp-dir TEMP_DIR Temporary directory for working files
-o OUTPUT_PATH, --output-path OUTPUT_PATH
Output vcf path (gzipped). Defaults to input_path.oc.vcf.gz
--chunk-size CHUNK_SIZE
Number of lines to annotate in each thread before syncing to disk. Affects
performance.
Output Format
Annotations are added to the INFO field of the VCF. OpenCRAVAT provided keys are in the format OC_AnnotatorName_AnnotationName
, for example a gnomAD4 allele frequency call would be OC_GNOMAD4_AF
. Further details of each annotation are in the vcf header.
Some annotations provided by OpenCRAVAT are either complex structured data, or free-form text. These types of data can make vcf files difficult to read or parse. To resolve this, OpenCRAVAT fields are encoded using HTTP Percent-encoding. Structured data is typically also JSON encoded.
While this makes the data less human-readable, most users do not directly read large VCFs. These encoding schemes are widely used, and most programming languages have standard tools to decode them.
For example, the OC_ALL_ANNOTATIONS field for a missense variant that affects multiple transcripts is JSON encoded to
{"RP1":[["P56715","p.Asn985Tyr","MIS","ENST00000220676.2","c.2953A>T"],["","","INT","ENST00000636932.1","c.787+4547A>T"],["","","INT","ENST00000637698.1","c.787+4547A>T"]]}
And then HTTP Percent-encoded
%7B%22RP1%22%3A%5B%5B%22P56715%22%2C%22p.Asn985Tyr%22%2C%22MIS%22%2C%22ENST00000220676.2%22%2C%22c.2953A%3ET%22%5D%2C%5B%22%22%2C%22%22%2C%22INT%22%2C%22ENST00000636932.1%22%2C%22c.787%2B4547A%3ET%22%5D%2C%5B%22%22%2C%22%22%2C%22INT%22%2C%22ENST00000637698.1%22%2C%22c.787%2B4547A%3ET%22%5D%5D%7D