CNV Annotation Formats
With the "dual origin" in cytogenetics ("chromosome based") and genomics ("sequencing based") analyses the annotation of copy number variants has evolved starting from different directions. This page summarizes some of the common annotation schemes, terminologies and file formats which have some application to genomic copy number variations.
Cytogenetics vs. Molecular Biology...¶
From the cytogenetic side the use of cytogenetic bands as coordinate system, has been amended by increasing use of mapping positions (i.e. for molecular-cytogenetic or hybrid analyses with known probe positions) while for array and sequencing based CNV detection an increasing focus lies in the determination of discrete allelic copy number counts and the assignment of a limited set of CNV classes reflecting common use concepts.
CNV Term Use Comparison in Computational (File/Schema) Formats¶
This table is maintained in parallel with the Beacon v2 documentation.
EFO | Beacon | VCF | SO | GA4GH VRS ⇒ VRS proposal1 |
Notes |
---|---|---|---|---|---|
EFO:0030070 |
DUP 2 orEFO:0030070 |
DUP SVCLAIM=D 3 |
SO:0001742 copy_number_gain |
low-level gain (implicit) ⇒ EFO:0030070 copy number gain |
a sequence alteration whereby the copy number of a given genomic region is greater than the reference sequence |
EFO:0030071 low-level copy number gain |
DUP 2 orEFO:0030071 |
DUP SVCLAIM=D 3 |
SO:0001742 copy_number_gain |
low-level gain ⇒ EFO:0030071 low-level copy number gain |
|
EFO:0030072 high-level copy number gain |
DUP 2 orEFO:0030072 |
DUP SVCLAIM=D 3 |
SO:0001742 copy_number_gain |
high-level gain ⇒ EFO:0030072 high-level copy number gain |
commonly but not consistently used for >=5 copies on a bi-allelic genome region |
EFO:0030073 focal genome amplification |
DUP 2 orEFO:0030073 |
DUP SVCLAIM=D 3 |
SO:0001742 copy_number_gain |
high-level gain ⇒ EFO:0030073 focal genome amplification |
commonly but not consistently used for >=5 copies on a bi-allelic genome region, of limited size (operationally max. 1-5Mb) |
EFO:0030067 copy number loss |
DEL 2 orEFO:0030067 |
DEL SVCLAIM=D 3 |
SO:0001743 copy_number_loss |
partial loss (implicit) ⇒ EFO:0030067 copy number loss |
a sequence alteration whereby the copy number of a given genomic region is smaller than the reference sequence |
EFO:0030068 low-level copy number loss |
DEL 2 orEFO:0030068 |
DEL SVCLAIM=D 3 |
SO:0001743 copy_number_loss |
partial loss ⇒ EFO:0030068 low-level copy number loss |
|
EFO:0020073 high-level copy number loss |
DEL 2 orEFO:0020073 |
DEL SVCLAIM=D 3 |
SO:0001743 copy_number_loss |
partial loss ⇒ EFO:0020073 high-level copy number loss |
a loss of several copies; also used in cases where a complete genomic deletion cannot be asserted |
EFO:0030069 complete genomic deletion |
DEL 2 orEFO:0030069 |
DEL SVCLAIM=D 3 |
SO:0001743 copy_number_loss |
complete loss ⇒ EFO:0030069 complete genomic deletion |
complete genomic deletion (e.g. homozygous deletion on a bi-allelic genome region) |
Last updated 2023-03-22 by @mbaudis (EFO:0020073)¶
updated 2023-03-20 by @mbaudis (VRS proposal)¶
ISCN¶
Sine 1963, the International System for Human Cytogenetic Nomenclature (ISCN) has provided standards and guidelines for annotation of human karyotypes and cytogenetic abnormalities.
Recent editions have tried to accomodate for genomic variants derived from molecular and molecular-cytogenetics technologies such as FISH, genomic microarrays and DNA sequencing.
Examples (CNV)¶
46,XX,trp(8)(q21q24)
ish cgh dim(17p12p11),enh(8)(q24)
- chromosomal Comparativ Genomic Hybridization (CGH)
Links¶
- ISCN 2020 is the latest edition, available as book (Karger)
HGVS¶
Links¶
- HGVS DNA Sequence Variant Nomenclature
VCF¶
While VCF is a file format, originally developed (and optimised) for the representation of possibly recurring variants across a set of analyses, it also allows for the storage & representation of CNV events3.
Links¶
Variant Schemas¶
GA4GH "Variant Representation" schema¶
The "Genomic Knowledge Standards" (GKS) of the Global Alliance for Genomics and Health GA4GH develops a modern schema for the unambiguous representation, transmission and recovery of sequence variants (genomic and beyond).
The first release of the [GA4GH Variation Representation Specification (vr-spec v1.0) does not yet include the option to represent structural variants. However, the internal roadmap of the project points towards an extension for CNV representation in 2020.
Links¶
- vr-spec repository
- documentation
Ad-Hoc & "Community" Formats¶
Progenetix Variant
schema¶
The Progenetix cancer genomics resource store their millions of CNVs in as data objects in MongoDB document databases. The format of the single variants is based on the original GA4GH schema (see above), with some extensions and modifications.
Development of the Progenetix format closely follows the work of the GA4GH GKS group adopts core objects and concepts from vr-spec repository.
The Progenetix data serves as the repository behind the Beacon+ forward looking implementation of the ELIXIR Beacon project.
Progenetix CNV example¶
{
_id: ObjectId("5bab576a727983b2e00b8d32"),
id: 'pgxvar-5bab576a727983b2e00b8d32',
variant_internal_id: '11:52900000-134452384:DEL',
callset_id: 'pgxcs-kftvldsu',
biosample_id: 'pgxbs-kftva59y',
individual_id: 'pgxind-kftx25eh',
variant_state: { id: 'EFO:0030067', label: 'copy number loss' },
type: 'RelativeCopyNumber',
location: {
sequence_id: 'refseq:NC_000011.10',
type: 'SequenceLocation',
interval: {
start: { type: 'Number', value: 52900000 },
end: { type: 'Number', value: 134452384 }
}
},
relative_copy_class: 'partial loss',
updated: '2022-03-29T14:36:47.454674'
}
Links¶
- schema in progenetix/bycon code repository
-
The VRS annotations refer to the status at v1.2 (2022). The GA4GH VRS team is currently (Spring 2023) preparing an updated specification which will introduce the new class
CopyNumberChange
(discussion...) with the use of the EFO terms (including a new term forhigh level deletion (EFO:0020073)
in the April 2023 EFO release). ↩ -
While the use of VCF derived (
DUP
,DEL
) values had been introduced with beacon v1, usage of these terms has always been a recommendation rather than an integral part of the API. We now encourage the support of more specific terms (particularly EFO) by Beacon developers. As example, the Progentix Beacon API uses EFO terms but provides an internal term expansion for legacyDUP
,DEL
support. ↩↩↩↩↩↩↩↩ -
VCFv4.4 introduces an
SVCLAIM
field to disambiguate between in situ events (such as tandem duplications; known adjacency/ break junction:SVCLAIM=J
) and events where e.g. only the change in abundance / read depth (SVCLAIM=D
) has been determined. Both J and D flags can be combined. ↩↩↩↩↩↩↩↩↩