Anotation format

This page describes the annotation file produced by digger / find_alignments

Columns in the Annotation File

In addition to the columns in the first table, the file contains the columns in the second table, prefixed by the reference name, for each reference specified with a -ref argument.

Column Name	Meaning
contig	ID of the sequence in which the gene or pseudogene was found
start	start co-ord of the coding region
end	end co-ord of the coding region
start_rev	start co-ord in the reverse-primed sequence
end_rev	end co-ord in the reverse-primed sequence
sense	sense (relative to the input sequence)
gene_type	gene type (e.g. IGHV)
gene_start start co-ord of the entire gene including flanking regions
gene_end end co-ord of the entire gene including flanking regions
gene_start start co-ord of the entire gene including flanking regions in the reverse-primed sequence
gene_end end co-ord of the entire gene including flanking regions in the reverse-primed sequence
likelihood	likelihood that the RSS is that of a functional gene (compared to a random sequence)
l_part1	leader part 1 equence
l_part2	leader part 2 sequence
v_heptamer	v-heptamer sequence
v_nonamer	v-nonamer sequence
j_heptamer	j-heptamer sequence
j_nonamer	j-nonamer sequence
j_frame	coding frame of the first nucleotide of the j region (0, 1 or 2)
d_3_heptamer	3-prime d-heptamer sequence
d_3_nonamer	3-prime d-nonamer sequence
d_5_heptamer	5-prime d-heptamer sequence
d_5_nonamer	5-prime d-nonamer sequence
functional	functionality (see below)
notes	annotation notes
aa	amino acid translation of the coding region
v-gene_aligned_aa	IMGT-gapped amino acid translation of the coding sequence (for V-genes)
seq	sequence of the coding region
seq_gapped	IMGT-gapped sequence of the coding region (V-genes only)
5_rss_start	co-ordinates of the 5-prime RSS
5_rss_start_rev
5_rss_end
5_rss_end_rev
3_rss_start	co-ordinates of the 3-prime RSS
3_rss_start_rev
3_rss_end
3_rss_end_rev
l_part1_start	co-ordinates of the leader part 1
l_part1_start_rev
l_part1_end
l_part1_end_rev
l_part2_start	co-ordinates of the leader part 2
l_part2_start_rev
l_part2_end
l_part2_end_rev
matches	number of matches to this start/end region that were produced in the BLAST analysis
blast_match	gene in the reference file with the highest match score in this start/end region
blast_score	the highest BLAST match score in this start/end region
blast_nt_diffs	the number of nucleotides differing from the most highly scoring reference sequence in this BLAST match
evalue	evalue of the most highly scoring BLAST match in this start/end region

Columns provided for each -ref:

Column Name	Meaning
_match	ID of the closest matching reference gene
_score	score of the closest match
_nt_diffs	number of nucleotides differing from the closest reference sequence

Functionality

Functionality is assigned as follows:

Functional

RSS and leader meet or exceed position-weighted matrix threshold
Highly-conserved nucleotides agree with the definition for the locus, if a definition has been specified
If a V-gene, leader starts with ATG, and spliced leader has no stop codons
If a V-gene, coding region has no stop codons before the cysteine at IMGT position 104
If a V-gene, conserved nucleotides are at the expected locations
If a J-gene, donor splice is as expected and coding region has no stop codons

ORF

One or more of the above conditions are not met, but no stop codon has been detected
If a V-gene, leader starts with ATG

Pseudo

Coding region contains stop codon(s)
Leader does not start with ATG