Anotation format

This page describes the annotation file produced by digger / find_alignments

Columns in the Annotation File

In addition to the columns in the first table, the file contains the columns in the second table, prefixed by the reference name, for each reference specified with a -ref argument.

Column Name

Meaning

contig

ID of the sequence in which the gene or pseudogene was found

start

start co-ord of the coding region

end

end co-ord of the coding region

start_rev

start co-ord in the reverse-primed sequence

end_rev

end co-ord in the reverse-primed sequence

sense

sense (relative to the input sequence)

gene_type

gene type (e.g. IGHV)

gene_start

start co-ord of the entire gene including flanking regions

gene_end

end co-ord of the entire gene including flanking regions

gene_start

start co-ord of the entire gene including flanking regions in the reverse-primed sequence

gene_end

end co-ord of the entire gene including flanking regions in the reverse-primed sequence

likelihood

likelihood that the RSS is that of a functional gene (compared to a random sequence)

l_part1

leader part 1 sequence

l_part2

leader part 2 sequence

exon1

exon 1 sequence

exon2

exon 2 sequence

donor-splice

donor splice site sequence

acceptor-splice

acceptor splice site sequence

v_intron

v-intron sequence

v_heptamer

v-heptamer sequence

v_spacer

v-spacer sequence

v_spacer_len

length of the v-spacer

v_nonamer

v-nonamer sequence

j_heptamer

j-heptamer sequence

j_spacer

j-spacer sequence

j_spacer_len

length of the j-spacer

j_nonamer

j-nonamer sequence

j_frame

coding frame of the first nucleotide of the j region (0, 1 or 2)

d_3_heptamer

3-prime d-heptamer sequence

d_3_spacer

3-prime d-spacer sequence

d_3_spacer_len

length of the 3-prime d-spacer

d_3_nonamer

3-prime d-nonamer sequence

d_5_heptamer

5-prime d-heptamer sequence

d_5_spacer

5-prime d-spacer sequence

d_5_spacer_len

length of the 5-prime d-spacer

d_5_nonamer

5-prime d-nonamer sequence

functional

functionality (see below)

notes

annotation notes

aa

amino acid translation of the coding region

v-gene_aligned_aa

IMGT-gapped amino acid translation of the coding sequence (for V-genes)

seq

sequence of the coding region

seq_gapped

IMGT-gapped sequence of the coding region (V-genes only)

5_rss_start

co-ordinates of the 5-prime RSS

5_rss_start_rev

5_rss_end

5_rss_end_rev

3_rss_start

co-ordinates of the 3-prime RSS

3_rss_start_rev

3_rss_end

3_rss_end_rev

l_part1_start

co-ordinates of the leader part 1

l_part1_start_rev

l_part1_end

l_part1_end_rev

l_part2_start

co-ordinates of the leader part 2

l_part2_start_rev

l_part2_end

l_part2_end_rev

matches

number of matches to this start/end region that were produced in the BLAST analysis

blast_match

gene in the reference file with the highest match score in this start/end region

blast_score

the highest BLAST match score in this start/end region

blast_nt_diffs

the number of nucleotides differing from the most highly scoring reference sequence in this BLAST match

evalue

evalue of the most highly scoring BLAST match in this start/end region

Columns provided for each -ref:

Column Name

Meaning

_match

ID of the closest matching reference gene

_score

score of the closest match

_nt_diffs

number of nucleotides differing from the closest reference sequence

Functionality

Functionality is assigned as follows:

Functional

  • RSS and leader meet or exceed position-weighted matrix threshold

  • Highly-conserved nucleotides agree with the definition for the locus, if a definition has been specified

  • If a V-gene, leader starts with ATG, donor splice ends GT or CT, acceptor splice ends AG, and spliced leader has no stop codons

  • If a V-gene, coding region has no stop codons before the cysteine at IMGT position 104

  • If a V-gene, conserved nucleotides are at the expected locations

  • If a J-gene, donor splice is as expected and coding region has no stop codons

ORF

  • One or more of the above conditions are not met, but no stop codon has been detected

  • If a V-gene, leader starts with ATG

Pseudo

  • Coding region contains stop codon(s)

  • Leader does not start with ATG

V Leader Annotation

Exons 1 and 2 are annotated in accordance with the customary genomic annotation regarding splice sites. At first glance, it may be thought that the L-PART1 sequence should be the same as the EXON1 sequence, and the L-PART2 sequence the same as the 5’ end of the EXON2 sequence. However IMGT define L-PART1 and L-PART2 in a manner that requires each to occupy an entire number of codons. To achieve this, any extra nucleotides at the end of EXON1 that do not make up a complete codon are assigned to L-PART2. The approach is documented here. In the table on that page, the first row, in which a single nucleotide is transferred to L-PART2, is the case normally encountered in V-genes.

It should be clear from the above that L-PART1 and L-PART2 are not usually separated by the V-INTRON, as is often depicted in the literature.

We provide coordinates for EXON1 and EXON2 but not for L-PART1 and L-PART2, as the latter do not have simple ‘start’ and ‘end’ coordinates in the genomic sequence. They are best thought of as protein-based features.