Introduction
This package contains utilities and code which are useful in working with IG/TR receptor sequences.
Installation
pip install receptor-utils
The module requires Biopython, which needs to be installed separately.
Naming and aligning V, D and J sequences
name_allele - Provide a name for a given sequence. The name includes the nearest germline allele (taken from a reference set supplied to the tool) and incorporating any SNPs. The naming scheme that it uses is described here
gap_sequences - Gap a supplied set of sequences according to the IMGT alignment, warning of any missing conserved residues. It employs a gapped reference set, which must be provided. For each sequence to be gapped, the tool identifies the closest reference sequence, and uses it as a template.
fix_macaque_gaps - The IMGT alignment for macaque IG sequences has inserted codons relative to the alignment used for most other species. This can cause problems for downstream tools. This utility removes the additional codons in macaque IG, reverting to the standard alignment
Using custom databases with IgBlast
These tools are covered in detail in the next section.
make_igblast_ndm - Create a custom ndm file for IgBlast
annotate_j - Create a custom auxiliary data file for IgBlast
Convenience Tools
download_germline_set - Download reference sequences from the Open Germline Receptor Database (OGRDB)
extract_refs - Download reference sequences from IMGT
identical_seqs - Report cases where, in a FASTA file, the same sequence is listed more than once with different IDs
rev_comp - Reverse-complement a nucleotide sequence
at_coords - Find the sequence at specific co-ordinates within the sequence in a fasta file
merge_fasta - Merge sequences in two FASTA files, removing duplicates
simple_bio_seq API
This is a set of simple wrappers around commonly-used functions in Biopython which suit my use case and are used in the tools above. They support, for example, one-line reading and writing of FASTA files and comma/tab-separated files, and manage sequences as strings in a dict, which I find simplifies code compared to the Biopython functions, at the expense of flexibility which I rarely need. They are documented under API.