Annotator reference
This section provides detailed reference documentation for each component of an annotator. An annotator consists of a python file, a YAML file, a data directory, and a markdown file (optional). The file structure is as follows:
annotator/
|───annotator.md
|───annotator.yml
|───annotator.py
└───data/
In an actual annotator, the name ‘annotator’ would be substituted for
the annotator’s name. Also note that the annotator’s data source (the
directiory annotator/data
) will be filled with some configuration of
data files. More info on the data source can be found at
data.
annotator.md
The markdown file describes the module to prospective users. It is not required to run an annotator; however, it is required when publishing an annotator to the OpenCRAVAT store.
annotator.yml
The YAML file defines the input and output interfaces between an
annotator and the rest of OpenCRAVAT. The YAML file specifies what data
will be fed to annotator.py
from OpenCRAVAT, and what data
OpenCRAVAT should expect annotator.py
to return. Below are the valid
keys a developer may use in their YAML file.
Properties
Property |
Required |
Required to Publish |
Description |
---|---|---|---|
|
X |
X |
The name of the module that will be displayed to the user. |
|
X |
X |
The version number of the annotator. It is primarily used when publishing a module, but is required for all modules. |
|
X |
X |
The module type, in this case ‘annotator’. |
|
X |
X |
Either ‘variant’ or ‘gene’. |
|
The input file
type. Accepted
values are
|
||
|
X |
X |
A list of the output columns from the annotator. See Output Columns sub-section for more details. |
|
X |
A short description of the module’s purpose and use. |
|
|
X |
Information about the developer. Subkeys: name, email, organization, website, citation. |
Output Columns
The output_columns
property is a YAML list that enumerates the
expected keys of the dictionary returned by annotator.py
. The
preparation of this dictionary is explained in greater detail in
annotator.py. Each entry in the
output_columns
list requires three properties: name
, title
,
and type
described in the table below.
Property |
Required |
Required to Publish |
Description |
---|---|---|---|
|
X |
X |
The column’s
internal name.
Used to identify
the column in
the output
dictionary from
|
|
X |
X |
The column’s display name. Used in the final report. |
|
X |
X |
The data-type of
that column.
Either
|
annotator.py
The python module receives input data describing a single variant/gene,
and uses it to lookup additional information specific to that annotator.
An annotator.py
works by extending a provided base class,
BaseAnnotator
, and implementing three instance methods: setup
,
annotate
, and cleanup
.
setup
The setup
method executes once before the main loop over the input
file. It is normally used to open file-handlers or database connections.
More information on accessing the data source can be found in
data.
annotate
The annotate
method executes once per iteration of the main loop.
The annotate
method will receive an input_data
argument, and
possibly an optional secondary_data
argument. These arguments
represent the input data for a single variant/gene. Both arguments will
be python dictionaries, whose format (including presence altogether for
secondary_data
) is determined by the input_format
property of
annotate.yml
. The following table enumerates the possible keys of
input_data
, and which keys will be present in relation to the value
of input_format
.
Key |
.crv |
.crx |
.crg |
Description |
Example |
---|---|---|---|---|---|
|
X |
X |
An id. |
1, 2 |
|
|
X |
X |
The chromosome, with prefix ‘chr’. 1-based indexing. |
‘chr1’, ‘chr23’, ‘chrX’ |
|
|
X |
X |
An integer genomic position. |
112501307, 104770363 |
|
|
X |
X |
The reference base. |
‘A’, ‘GCC’ |
|
|
X |
X |
The alternate base. |
‘G’, ‘AT’, ‘-’ |
|
|
X |
X |
The gene name |
TP53 |
|
|
X |
The predicted primary transcript |
ENST00000617 185.4 |
||
|
X |
X |
Most severe sequence ontology |
MIS |
|
|
X |
All affected transcripts. Details here |
|||
|
X |
Number of variants on this gene |
5 |
||
|
X |
Sequence ontologies and counts for this gene |
STL(1),MIS(3 ) |
OpenCRAVAT expects annotator.py
to return a python dictionary. The
keys present in this dictionary, and the data-types of their values are
both determined by the output_columns
property in annotator.yml
.
cleanup
The cleanup
method executes once after the main loop has finished.
It is normally used to close any database connection or file-handlers
opened in setup
.
data
The sub-directory data
contains the data source for the annotator.
This can be a flat-data file, a sqlite database, or a combination of
multiples data files. To access the data, the developer will open a
file-handler or database connection depending on the file type. This
should be done in the instance method setup
in
annotator.py. The developer should then store the
opened data-accessor as a self
instance property to be accessible
during the annotate
method.
Note that there is special support for a sqlite database which shares
the name of the annotator module. In this case, a database connection
and cursor are automatically opened in the BaseConverter
of
annotate.py
. The connection and cursor are stored as self.dbconn
and self.cursosr
respectively. This functionality is intened to aid
a primary use case where the data source is a single sqlite database. A
developer can safely overwrite self.dbconn
and self.cursor
if
they wish, albeit at the loss of the automatic functionality.
The developer should close any active database connections or
file-handlers during the cleanup
method of annotate.py
.
Automatically opened database connections will also be automatically
closed.
Secondary Inputs
Annotators can be piped together so that the output of one annotator can be used in the input of another annotator. This can be useful to create annotators that summarize groups of other annotators, or to use the data from another annotator in a query.
For example, lets say we have data that is indexed on ClinVar IDs. We can make an annotator that depends on the clinvar annotator, then use the ID to lookup our values.
Edit annotator.yml
and add a secondary data input.
secondary_inputs:
clinvar: {}
Now, in the annotate
method of annotator.py
, the
secondary_data
argument will sometimes contain data from clinvar.
if secondary_data['clinvar']:
clinvar_id = secondary_data['clinvar'][0]['id']
else:
clinvar_id = None
We also want to make sure that users who install our annotator have clinvar installed. Do do this, we need to add an install requirement to our annotator’s config.
requires:
- clinvar
If you need to require certain version of the secondary annotator, you can do so with boolean expressions similar to those in pip install.
clinvar==2.0.0
clinvar>=2.0.0
clinvar<2.0.0
Specifying a version is discouraged unless absolutely needed. OpenCRAVAT has very limited ability to resolve dependency issues between modules.
Table-in-table output
Originally, an output field of an OpenCRAVAT annotator module was supposed to be one of string, integer, and float types. However, from OpenCRAVAT 2.2.1, an output field can contain a table of values. This way, table-in-table output is possible for annotation modules. This feature is useful for organizing complex data. For example, VEST4 annotation module’s “All transcripts” column used to have such a string as “ENST00000612895.4(0.884:0.04118), *ENST00000614428.4(0.928:0.02102), ENST00000617649.4(0.866:0.05418)”. This string contains the VEST score and p-value for three different transcripts for a variant. To get the score and p-value of a specific transcript, parsing the string and extracting the values was necessary. However, the new VEST annotation module which works with OpenCRAVAT 2.2.1 and later has the following data instead of the string: [[ENST00000612895.4, 0.884, 0.04118], [ENST00000614428.4, 0.928, 0.02102], [ENST00000617649.4, 0.866, 0.05418]], which shows the transcript-score-pvaule organization of data much more clearly. This type of data is still stored as string in result databases, but OpenCRAVAT automatically performs the conversion between string and JSON object as it communicates with annotator modules. Thus, in writing an annotation module, the return dictionary of an annotate method can have a dictionary as the value of an output field. No conversion to a JSON string is necessary.
To enable table-in-table output support for an output column, add table: true
property to the definition of the column in the module’s configuration yml file. There is another property, table_headers
, but this one is optional. With these two new properties, “All annotations” (previously “All transcripts”) column of VEST module is defined as below.
name: all
title: All annotations
type: string
table: true
table_headers:
name: transcript
title: Transcript
type: string
name: score
title: Score
type: float
name: pval
title: p-value
type: float
…
When an output column with table data is used by a reporter module, the reporter module will receive a JSON object instead of a string, as OpenCRAVAT does the conversion automatically. In the same way, widget modules will also receive JSON objects instead of strings for output columns with table data. (edited)