Introduction

The Categorical Variation Working Group is developing a data framework and specification for a computable model of categorical variants. This is related to broader GA4GH efforts to streamline genomic knowledge standards across disparate genomic knowledge repositories. A categorical variation representation specification is required to facilitate the needs for storing, searching, and interpreting knowledge related to both individual genomic variants and the categories of variation to which they belong.

Background and Problem Statement

Genomic medicine is the discipline of interpreting genomic information about an individual as part of their clinical care for diagnosis, prognosis, or therapeutic decision-making. Integral to the practice of genome interpretation is the collection of multiple lines of evidence from disparate genomic data resources to support or refute the clinical significance of evaluated variants. However, this process is rarely as straightforward as exact pattern matching. The reason for this complication comes from a subtle but crucial difference between the information that the analyst posesses and the information to which evidence is typically attached in knowledgebases.

Suppose an analyst is interpreting a variant, NC_000007.13:g.140453136A>T, that was assayed in a patient. This assayed variant label represents one specific genomic variant. However, the evidence connected with this variant and its association with cancer are often not directly attached to that exact assayed variant. Rather, the variant NC_000007.13:g.140453136A>T belongs to a larger class of related variants, BRAF V600E variants, and the underlying evidence items are associated with this class label.

The figure depicts a stack of clinical reports, each of which represents a single assayed variant.  Each assayed variant is connected to a common node labelled "BRAF V600E Variant", to indicate that they are all members of that class of variants.  BRAF V600E is connected to a variety of genomic knowledge statements, such as being found in various cancers, having implications for drug sensitivity, and that its effect is that of gene amplification.

This class, BRAF V600E, is a categorical variant, so-called because it represents an entire category of variation. Categorical variants are sets of properties related to different dimensions of genomic and biological variation. The members of a categorical variant are individual assayed variants.

To return to our rhetorical analyst, the variant they are interpreting, NC_000007.13:g.140453136A>T, is an assayed variant. That variant exists in the genome of an individual patient. The labelled entity to which the genomic knowledge is associated, however, the categorical variant BRAF V600E, does not exist in the genome. Categorical variants exist solely within genomics knowledgebases. Therefore, one critical step in the interpretation of an assayed variant is determining which categorical variants to which it belongs in order to connect the assayed variant to the evidence items associated with that variant.

Challenges to Unifying the Representation of Categorical Variants

Categorical variants arise organically and continuously in the course of genomics research. When clinical studies are run and journal papers published, the results are typically not charactorized in terms of an exhaustive list of assayed variants to which the conclusions apply. Rather, the domain of the conclusions are currently characterized in terms of a categorical variant, all of the individual assayed variants that fall into the same biological bucket. Like all scientific abstractions, these models have several useful properties. They describe insightful conclusions related to the biological events that underly a function common to a class of variants. They also make useful predictions, namely that the same conclusions should apply to variants that weren’t explicitly tested but ought to function in a similar way to those explicitly tested. They thus allow us to generalize genomic knowledge.

To return to the running example, the BRAF V600E categorical variant inlcudes as its members any of 2 single-nucleotide substitutions and 6 double-nucleotide substitions that convert a Valine codon into one coding for Glutamic acid. The Valine to Glutamic Acid amino acid substitution variant is also a member of that set. Any other variant or series of variants that would have the net effect of substituting Glutamic acid for Valine in the same location of the resulting polypeptide chain is also a member of the same categorical variant.

While a single categorical variant may have many assayed variant members, the same is true in the other direction. A single assayed variant is a member of many possible categorical variants simultaneously. While NC_000007.13:g.140453136A>T is a member of the BRAF V600E categorical variant, it is also a Change-of-Function variant, a protein missense variant, and a chromosome 7 variant, among other categorical variants.

The figure depicts a single centralized assayed variant, with arrows radiating out to a number of categorical variants to which it is a member.  Among these, the assayed variant NC_000007.13:g.140453136A>T is a BRAF V600E variant, a BRAF gene variant, and a chromosome 7 variant.

Because a single categorical variant may have many assayed variants as members, while a single assayed variant can be a member of many categorical variants, different categorical have complex heirarchical relationships with each other. the figure below depicts some of the relationships between some of the categorical variants to which NC_000007.13:g.140453136A>T is a member. For example, all BRAF V600E variants are also BRAF gene variants. And all BRAF V600E variants and BRAF gene variants are chromosome 7 variants. A BRAF V600E variant is also an inframe protein variant, which is itself a type of sequence variant.

The figure depicts the same assayed variant and categorical variants as the previous figure, but with arrows added to show subset relationships between various categorical variants.  One arrow connects BRAF V600E variants to BRAF gene variants to show that all BRAF V600E variants are also members of the set of BRAF gene variants.  A simialr arrow shows that all BRAF gene variants are members of the set of chromosome 7 variants.

To make categoricla variant matching even more complicated, it is often the case that identical labels across different resources in fact describe different categroical variants, as seen in the figure below where an ACT sequence has been inserted directly 3’ of a ACTG sequence. While this would not be considered a duplication variant in the HGVS nomenclature due to the intervening G base pair, it could appear in other resources as a duplication of the preceeding ACT sequence. This implies that the catgorical variant descriptor “duplication” has different meanings across different resources.

The figure depicts a hypothetical variant where an ACT sequence has been inserted directly 3' of a ACTG sequence.  While this would not be considered a duplication variant in the HGVS nomenclature due to the intervening G base pair, it could appear in other resources as a duplication of the preceeding ACT sequence, or alternately simply as an insertion of ACT.  This implies that the catgorical variant descriptor "duplication" has different meanings across different resources.

On the other hand, it is also often the case that spurious ambiguity exists within resources. The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT. In HGVS, this variant could either validly be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C. This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.

The figure depicts a hypothetical case where compared to a reference sequence ACT, the variant sequence is ACCCCCT.  In HGVS, this variant could either be described as an insertion of 4 C nucleotides, or else a five repetitions of the single nucleotide sequence C.  This demonstrates spurious ambiguity of categorical variant descriptors, as both categorical variants desribe two sets with all and only the same member variants.

Discussion

In summary, a crucial step in the course of genomic variant interpretation is assayed-categorical variant matching, where one determines all and only those categorical variants to which the assayed variant in question is a member. Successful assayed-categorical variant matching makes it possible to connect evidence to support or refute determinations of pathogenicity and/or oncogenicity of the assayed variants. In a different but related use case, categorical-categorical variant matching is crucial to the process of data harmonization and knowledgebase curation.

To address these challenges, we introduce the Categorical Variation Representation Specification (Cat-VRS). The Cat-VRS captures the semantics that are specified, implied, or missing in genomic knowledge resources, providing a computable framework for expressing how genomic knowledge may match to assayed variation. Much like the VRS objects used in this specification, Categorical Variation classes are designed to instantiate objects that are readily usable by genomic knowledge search engines.