dig_sequence

dig_sequence provides targeted annotation of genomic sequences. The sequences can be as small as a single coding region, or as large as an entire locus. The tool will search for the sequence that best matches a specified target sequence, and annotate just that best match.

Options are available to annotate a single sequence in a FASTA file, a single genbank ID (the sequence will be fetched from Genbank), or a list of sequences or Genbank IDs specified in a CSV file. For GenBank requests, an email address must be provided, as this is a requirement of the GenBank API.

Example usage is described in Targeted Annotation.

Annotate a genomic sequence representing the nominated receptor gene

usage: dig_sequence [-h] {fasta,single,multi,multi_seq} ...

Sub-commands

fasta

Annotate a single genomic sequence in a FASTA file

dig_sequence fasta [-h] [-align ALIGN] [-species SPECIES] [-motif_dir MOTIF_DIR] [-out_file OUT_FILE] [-debug] target germline_file query_file

Positional Arguments

target: Name of nominated sequence in reference set
germline_file: ungapped reference set containing the nominated sequence (FASTA)
query_file: file containing the sequence to annotate (FASTA)

Named Arguments

-align

gapped reference set to use for V gene alignments (required for V gene analysis

-species

use motifs for the specified species provided with the package

-motif_dir

use motif probability files present in the specified directory

-out_file

output file (CSV)

-debug

produce parsing_errors file with debug information

Default: False

single

Annotate a single sequence given its genbank accession number

dig_sequence single [-h] [-align ALIGN] [-species SPECIES] [-motif_dir MOTIF_DIR] [-out_file OUT_FILE] target germline_file genbank_acc email_addr

Positional Arguments

target: Name of nominated sequence
germline_file: ungapped reference set containing the nominated sequence (FASTA)
genbank_acc: genbank accession number of the sequence to annotate
email_addr: email address to provide to genbank

Named Arguments

-align: gapped reference set to use for V gene alignments (required for V gene analysis
-species: use motifs for the specified species provided with the package
-motif_dir: use motif probability files present in the specified directory
-out_file: output file (CSV)

multi

Read allele names and corresponding genbank accession numbers from a CSV file

dig_sequence multi [-h] [-align ALIGN] [-species SPECIES] [-motif_dir MOTIF_DIR] [-out_file OUT_FILE] locus germline_file query_file email_addr

Positional Arguments

locus: Locus of nominated sequences
germline_file: ungapped reference set containing the nominated sequence (FASTA)
query_file: File containing list of targets and associated genbank accession numbers (CSV)
email_addr: email address to provide to genbank

Named Arguments

-align: gapped reference set to use for V gene alignments (required for V gene analysis
-species: use motifs for the specified species provided with the package
-motif_dir: use motif probability files present in the specified directory
-out_file: output file (CSV)

multi_seq

Read allele names and genomic sequences from a CSV file

dig_sequence multi_seq [-h] [-align ALIGN] [-species SPECIES] [-motif_dir MOTIF_DIR] [-out_file OUT_FILE] locus germline_file query_file

Positional Arguments

locus: Locus of nominated sequences
germline_file: ungapped reference set containing the nominated sequence (FASTA)
query_file: File containing list of targets and genomic sequences (CSV)

Named Arguments

-align: gapped reference set to use for V gene alignments (required for V gene analysis
-species: use motifs for the specified species provided with the package
-motif_dir: use motif probability files present in the specified directory
-out_file: output file (CSV)