Reporter tutorial
Reporters are used to generate output files from an OpenCRAVAT output database.
Example
Readers who prefer to learn by reading code should see the example reporter in open-cravat-modules-karchinlab.
The rest of this document will explain how reporters are structured, and the meaning of some low-level data.
Overview
Levels
OpenCRAVAT data is structured into 4 levels: variant, gene, sample, and mapping. These levels correspond directly to four tables in the output database. The first two levels, variant and gene, contain information about variants and genes that OpenCRAVAT generates, and data from annotators. The sample level keeps track of which samples contain a certain variant, and can include information about specific ocurrences of a variant, such as zygosity. Finally, the mapping level keeps track of where in the original input files a variant was found.
Most reporters will be concerned with the variant and gene level data, and that will be the focus of this page. Sample and mapping level data can be handled in a similar way.
Structure of a reporter
Reporters must define four functions: setup
, write_header
,
write_table_row
, and end
.
setup
is called once. It is used to open file handlers and do other
such initialization tasks.
write_header
is called once per level. It is used to write
information about the data that will be in that level.
write_table_row
is called once per “row” of data. In the variant
level this means once per variant. In the gene level, once per gene.
end
is used to close file handlers and do other such cleanup tasks.
Most of the work of a reporter is done in write_header
and
write_table_row
. The following two sections will cover the data
structures of those two functions.
Low level
write_table_row
takes a single argument, row
, which is a list of
the values for that row. For example, at the variant level, the first
part of row might be:
[1, 'chr3', 41266113, 'C', 'T', None, None, 'ULK4', 'ENST00000301831.9', ...]
Each of these columns is identified in a attribute called
self.colinfo
, a multi-level dict with information about all the
columns in the job. A summary of it’s structure is below.
self.colinfo = {
'variant': {
'columns': [
{
'col_name': 'base__uid',
'col_title': 'UID',
'col_type': 'int',
...
},
{
'col_name': 'base__chrom',
'col_title': 'Chrom',
'col_type': 'string',
...
},
{
'col_name': 'base__pos',
'col_title': 'Position',
'col_type': 'int',
...
},
...
]
'colgroups': [
{
'name': 'base',
'displayname': 'Variant Annotation',
'count': 16,
'lastcol': 16,
...
},
{
'name': 'clinvar',
'displayname': 'ClinVar',
'count': 5,
'lastcol': 21,
...
}
]
},
'gene':{...},
'sample':{...},
'mapping':{...}
}
At the top, self.colinfo
is divided into each level, then into
'columns'
and 'colgroups'
. The list
self.colinfo['level']['columns']
matches up with the row
list
passed to write_table_row
. Columns from the same annotator are
located next to each other. The list colgroups
contains information
about each annotator, and can be used to select data from that
annotator.
In practice
Generally, write_header
reads information from self.colinfo
and
write descriptions of each column to the output file. Then
write_table_row
writes the value of each cell.