Anotation format

This page describes the annotation file produced by digger / find_alignments

Columns in the Annotation File

In addition to the columns in the first table, the file contains the columns in the second table, prefixed by the reference name, for each reference specified with a -ref argument.

Column Name

Meaning

contig

ID of the sequence in which the gene or pseudogene was found

start

start co-ord of the coding region

end

end co-ord of the coding region

start_rev

start co-ord in the reverse-primed sequence

end_rev

end co-ord in the reverse-primed sequence

sense

sense (relative to the input sequence)

gene_type

gene type (e.g. IGHV)

gene_start start co-ord of the entire gene including flanking regions

gene_end end co-ord of the entire gene including flanking regions

gene_start start co-ord of the entire gene including flanking regions in the reverse-primed sequence

gene_end end co-ord of the entire gene including flanking regions in the reverse-primed sequence

likelihood

likelihood that the RSS is that of a functional gene (compared to a random sequence)

l_part1

leader part 1 equence

l_part2

leader part 2 sequence

v_heptamer

v-heptamer sequence

v_nonamer

v-nonamer sequence

j_heptamer

j-heptamer sequence

j_nonamer

j-nonamer sequence

j_frame

coding frame of the first nucleotide of the j region (0, 1 or 2)

d_3_heptamer

3-prime d-heptamer sequence

d_3_nonamer

3-prime d-nonamer sequence

d_5_heptamer

5-prime d-heptamer sequence

d_5_nonamer

5-prime d-nonamer sequence

functional

functionality (see below)

notes

annotation notes

aa

amino acid translation of the coding region

v-gene_aligned_aa

IMGT-gapped amino acid translation of the coding sequence (for V-genes)

seq

sequence of the coding region

seq_gapped

IMGT-gapped sequence of the coding region (V-genes only)

5_rss_start

co-ordinates of the 5-prime RSS

5_rss_start_rev

5_rss_end

5_rss_end_rev

3_rss_start

co-ordinates of the 3-prime RSS

3_rss_start_rev

3_rss_end

3_rss_end_rev

l_part1_start

co-ordinates of the leader part 1

l_part1_start_rev

l_part1_end

l_part1_end_rev

l_part2_start

co-ordinates of the leader part 2

l_part2_start_rev

l_part2_end

l_part2_end_rev

matches

number of matches to this start/end region that were produced in the BLAST analysis

blast_match

gene in the reference file with the highest match score in this start/end region

blast_score

the highest BLAST match score in this start/end region

blast_nt_diffs

the number of nucleotides differing from the most highly scoring reference sequence in this BLAST match

evalue

evalue of the most highly scoring BLAST match in this start/end region

Columns provided for each -ref:

Column Name

Meaning

_match

ID of the closest matching reference gene

_score

score of the closest match

_nt_diffs

number of nucleotides differing from the closest reference sequence

Functionality

Functionality is assigned as follows:

Functional

  • RSS and leader meet or exceed position-weighted matrix threshold

  • Highly-conserved nucleotides agree with the definition for the locus, if a definition has been specified

  • If a V-gene, leader starts with ATG, and spliced leader has no stop codons

  • If a V-gene, coding region has no stop codons before the cysteine at IMGT position 104

  • If a V-gene, conserved nucleotides are at the expected locations

  • If a J-gene, donor splice is as expected and coding region has no stop codons

ORF

  • One or more of the above conditions are not met, but no stop codon has been detected

  • If a V-gene, leader starts with ATG

Pseudo

  • Coding region contains stop codon(s)

  • Leader does not start with ATG