Reporter tutorial

Reporters are used to generate output files from an OpenCRAVAT output database.

Example

Readers who prefer to learn by reading code should see the example reporter in open-cravat-modules-karchinlab.

The rest of this document will explain how reporters are structured, and the meaning of some low-level data.

Overview

Levels

OpenCRAVAT data is structured into 4 levels: variant, gene, sample, and mapping. These levels correspond directly to four tables in the output database. The first two levels, variant and gene, contain information about variants and genes that OpenCRAVAT generates, and data from annotators. The sample level keeps track of which samples contain a certain variant, and can include information about specific ocurrences of a variant, such as zygosity. Finally, the mapping level keeps track of where in the original input files a variant was found.

Most reporters will be concerned with the variant and gene level data, and that will be the focus of this page. Sample and mapping level data can be handled in a similar way.

Structure of a reporter

Reporters must define four functions: setup, write_header, write_table_row, and end.

setup is called once. It is used to open file handlers and do other such initialization tasks.

write_header is called once per level. It is used to write information about the data that will be in that level.

write_table_row is called once per “row” of data. In the variant level this means once per variant. In the gene level, once per gene.

end is used to close file handlers and do other such cleanup tasks.

Most of the work of a reporter is done in write_header and write_table_row. The following two sections will cover the data structures of those two functions.

Low level

write_table_row takes a single argument, row, which is a list of the values for that row. For example, at the variant level, the first part of row might be:

[1, 'chr3', 41266113, 'C', 'T', None, None, 'ULK4', 'ENST00000301831.9', ...]

Each of these columns is identified in a attribute called self.colinfo, a multi-level dict with information about all the columns in the job. A summary of it’s structure is below.

self.colinfo = {
    'variant': {
        'columns': [
            {
                'col_name': 'base__uid',
                'col_title': 'UID',
                'col_type': 'int',
                ...
            },
            {
                'col_name': 'base__chrom',
                'col_title': 'Chrom',
                'col_type': 'string',
                ...
            },
            {
                'col_name': 'base__pos',
                'col_title': 'Position',
                'col_type': 'int',
                ...
            },
            ...
        ]
        'colgroups': [
            {
                'name': 'base',
                'displayname': 'Variant Annotation',
                'count': 16,
                'lastcol': 16,
                ...
            },
            {
                'name': 'clinvar',
                'displayname': 'ClinVar',
                'count': 5,
                'lastcol': 21,
                ...
            }
        ]
    },
    'gene':{...},
    'sample':{...},
    'mapping':{...}
}

At the top, self.colinfo is divided into each level, then into 'columns' and 'colgroups'. The list self.colinfo['level']['columns'] matches up with the row list passed to write_table_row. Columns from the same annotator are located next to each other. The list colgroups contains information about each annotator, and can be used to select data from that annotator.

In practice

Generally, write_header reads information from self.colinfo and write descriptions of each column to the output file. Then write_table_row writes the value of each cell.