Overview
There are two principal tools: digger
itself, and dig-sequence
. digger
performs de-novo annotation, searching for all germline sequences that are similar
to those in a reference set. dig_sequence
searches for the single sequence that is most similar to a given reference sequence. Both tools use the same
underlying code. dig_sequence
can be useful for determining the full annotation of a genomic sequence you have identified, or for verifying a sequence
discovered in GenBank. digger
is slower, but performs the full annotation of a genomic sequence or assembly.
digger
Scope and Features
digger
identifies immunoglobulin receptor germline sequences in a genomic sequence or assembly. It requires two inputs: the sequence or assembly, and a set of reference genes to use as a starting point. An existing
reference set for the species under study is an ideal starting point, but if one is not available, good results can be obtained by using the reference set for another species, preferably one that is
reasonably closely related. The package is intended to be straightforward to use, but some familiarity with command-line tools is expected.
digger
starts by searching the sequence for approximate matches to sequences in the reference set. Where these are found, the match is extended to the full length of the matched sequence. A window
at either end is then checked for the expected flanking sequences (e.g. leader, RSS). These are identified by means of position weight matrices (PWMs). The flanking sequences have a well-established
‘canonical’ form that is conserved between species. digger
contains PWMs for human IG/TR and rhesus macaque IG loci, and these can be used as a starting point for other species. However, as some variation
is observed between species, The toolkit also contains tools for deriving tailored PWMs for a species of interest. These can be obtained from an existing annotation of the locus, or from an initial
annotation conducted with the human or rhesus PWMs. Digger’s built-in PWM’s are based on analysis of selected IMGT annotations.
Pipeline
A brief summary of the tools contained in the package is given below. In most cases, annotation can be performed with the digger tool. This will call subsidiary tools as necessary. Please refer to the Examples section for more information. The first example, Annotating the human IGH locus, explains straightforward use. The second example, Annotating the rhesus macaque IGH locus, shows how to generate PWMs for a locus or species that is not already covered by the package. It also shows how to call the underlying tools directly.
dig-sequence
dig_sequence
identifies and annotates the closest match to a single specified reference sequence. It provides three modes: one in which the sequence to search
is provided in a FASTA file, one in which it is specified as a GenBank accession number, and a third in which multiple GenBank accession numbers to search are provided
in a file. Its use is described in dig_sequence.
Tool |
Description |
---|---|
Convert the result of a BLAST search to a simple tabular format |
|
Determine PWMs from a list of gene features |
|
Compare two sets of annotations |
|
Run the annotation pipeline |
|
Targeted annotation of specific sequences |
|
Produce the annotation file, given tabulated BLAST results and other inputs |
|
Download and parse an annotation file from IMGT |