Overview

There are two principal tools: digger itself, and dig-sequence. digger performs de-novo annotation, searching for all germline sequences that are similar to those in a reference set. dig_sequence searches for the single sequence that is most similar to a given reference sequence. Both tools use the same underlying code. dig_sequence can be useful for determining the full annotation of a genomic sequence you have identified, or for verifying a sequence discovered in GenBank. digger is slower, but performs the full annotation of a genomic sequence or assembly.

digger

Scope and Features

digger identifies immunoglobulin receptor germline sequences in a genomic sequence or assembly. It requires two inputs: the sequence or assembly, and a set of reference genes to use as a starting point. An existing reference set for the species under study is an ideal starting point, but if one is not available, good results can be obtained by using the reference set for another species, preferably one that is reasonably closely related. The package is intended to be straightforward to use, but some familiarity with command-line tools is expected.

digger starts by searching the sequence for approximate matches to sequences in the reference set. Where these are found, the match is extended to the full length of the matched sequence. A window at either end is then checked for the expected flanking sequences (e.g. leader, RSS). These are identified by means of position weight matrices (PWMs). The flanking sequences have a well-established ‘canonical’ form that is conserved between species. digger contains PWMs for human IG/TR and rhesus macaque IG loci, and these can be used as a starting point for other species. However, as some variation is observed between species, The toolkit also contains tools for deriving tailored PWMs for a species of interest. These can be obtained from an existing annotation of the locus, or from an initial annotation conducted with the human or rhesus PWMs. Digger’s built-in PWM’s are based on analysis of selected IMGT annotations.

Pipeline

_images/pipeline.jpg

A brief summary of the tools contained in the package is given below. In most cases, annotation can be performed with the digger tool. This will call subsidiary tools as necessary. Please refer to the Examples section for more information. The first example, Annotating the human IGH locus, explains straightforward use. The second example, Annotating the rhesus macaque IGH locus, shows how to generate PWMs for a locus or species that is not already covered by the package. It also shows how to call the underlying tools directly.

dig-sequence

dig_sequence identifies and annotates the closest match to a single specified reference sequence. It provides three modes: one in which the sequence to search is provided in a FASTA file, one in which it is specified as a GenBank accession number, and a third in which multiple GenBank accession numbers to search are provided in a file. Its use is described in dig_sequence.

Tool

Description

blastresults_to_csv

Convert the result of a BLAST search to a simple tabular format

calc_motifs

Determine PWMs from a list of gene features

compare_annotations

Compare two sets of annotations

digger

Run the annotation pipeline

dig_sequence

Targeted annotation of specific sequences

find_alignments

Produce the annotation file, given tabulated BLAST results and other inputs

parse_imgt_annotations

Download and parse an annotation file from IMGT