Anotation format
This page describes the annotation file produced by digger / find_alignments
Columns in the Annotation File
In addition to the columns in the first table, the file contains the columns in the second table, prefixed by the reference name, for each reference specified with a -ref argument.
Column Name |
Meaning |
|---|---|
contig |
ID of the sequence in which the gene or pseudogene was found |
start |
start co-ord of the coding region |
end |
end co-ord of the coding region |
start_rev |
start co-ord in the reverse-primed sequence |
end_rev |
end co-ord in the reverse-primed sequence |
sense |
sense (relative to the input sequence) |
gene_type |
gene type (e.g. IGHV) |
gene_start |
start co-ord of the entire gene including flanking regions |
gene_end |
end co-ord of the entire gene including flanking regions |
gene_start |
start co-ord of the entire gene including flanking regions in the reverse-primed sequence |
gene_end |
end co-ord of the entire gene including flanking regions in the reverse-primed sequence |
likelihood |
likelihood that the RSS is that of a functional gene (compared to a random sequence) |
l_part1 |
leader part 1 sequence |
l_part2 |
leader part 2 sequence |
exon1 |
exon 1 sequence |
exon2 |
exon 2 sequence |
donor-splice |
donor splice site sequence |
acceptor-splice |
acceptor splice site sequence |
v_intron |
v-intron sequence |
v_heptamer |
v-heptamer sequence |
v_spacer |
v-spacer sequence |
v_spacer_len |
length of the v-spacer |
v_nonamer |
v-nonamer sequence |
j_heptamer |
j-heptamer sequence |
j_spacer |
j-spacer sequence |
j_spacer_len |
length of the j-spacer |
j_nonamer |
j-nonamer sequence |
j_frame |
coding frame of the first nucleotide of the j region (0, 1 or 2) |
d_3_heptamer |
3-prime d-heptamer sequence |
d_3_spacer |
3-prime d-spacer sequence |
d_3_spacer_len |
length of the 3-prime d-spacer |
d_3_nonamer |
3-prime d-nonamer sequence |
d_5_heptamer |
5-prime d-heptamer sequence |
d_5_spacer |
5-prime d-spacer sequence |
d_5_spacer_len |
length of the 5-prime d-spacer |
d_5_nonamer |
5-prime d-nonamer sequence |
functional |
functionality (see below) |
notes |
annotation notes |
aa |
amino acid translation of the coding region |
v-gene_aligned_aa |
IMGT-gapped amino acid translation of the coding sequence (for V-genes) |
seq |
sequence of the coding region |
seq_gapped |
IMGT-gapped sequence of the coding region (V-genes only) |
5_rss_start |
co-ordinates of the 5-prime RSS |
5_rss_start_rev |
|
5_rss_end |
|
5_rss_end_rev |
|
3_rss_start |
co-ordinates of the 3-prime RSS |
3_rss_start_rev |
|
3_rss_end |
|
3_rss_end_rev |
|
l_part1_start |
co-ordinates of the leader part 1 |
l_part1_start_rev |
|
l_part1_end |
|
l_part1_end_rev |
|
l_part2_start |
co-ordinates of the leader part 2 |
l_part2_start_rev |
|
l_part2_end |
|
l_part2_end_rev |
|
matches |
number of matches to this start/end region that were produced in the BLAST analysis |
blast_match |
gene in the reference file with the highest match score in this start/end region |
blast_score |
the highest BLAST match score in this start/end region |
blast_nt_diffs |
the number of nucleotides differing from the most highly scoring reference sequence in this BLAST match |
evalue |
evalue of the most highly scoring BLAST match in this start/end region |
Columns provided for each -ref:
Column Name |
Meaning |
|---|---|
_match |
ID of the closest matching reference gene |
_score |
score of the closest match |
_nt_diffs |
number of nucleotides differing from the closest reference sequence |
Functionality
Functionality is assigned as follows:
Functional
RSS and leader meet or exceed position-weighted matrix threshold
Highly-conserved nucleotides agree with the definition for the locus, if a definition has been specified
If a V-gene, leader starts with ATG, donor splice ends GT or CT, acceptor splice ends AG, and spliced leader has no stop codons
If a V-gene, coding region has no stop codons before the cysteine at IMGT position 104
If a V-gene, conserved nucleotides are at the expected locations
If a J-gene, donor splice is as expected and coding region has no stop codons
ORF
One or more of the above conditions are not met, but no stop codon has been detected
If a V-gene, leader starts with ATG
Pseudo
Coding region contains stop codon(s)
Leader does not start with ATG
V Leader Annotation
Exons 1 and 2 are annotated in accordance with the customary genomic annotation regarding splice sites. At first glance, it may be thought that the L-PART1 sequence should be the same as the EXON1 sequence, and the L-PART2 sequence the same as the 5’ end of the EXON2 sequence. However IMGT define L-PART1 and L-PART2 in a manner that requires each to occupy an entire number of codons. To achieve this, any extra nucleotides at the end of EXON1 that do not make up a complete codon are assigned to L-PART2. The approach is documented here. In the table on that page, the first row, in which a single nucleotide is transferred to L-PART2, is the case normally encountered in V-genes.
It should be clear from the above that L-PART1 and L-PART2 are not usually separated by the V-INTRON, as is often depicted in the literature.
We provide coordinates for EXON1 and EXON2 but not for L-PART1 and L-PART2, as the latter do not have simple ‘start’ and ‘end’ coordinates in the genomic sequence. They are best thought of as protein-based features.