receptor_utils package
Submodules
receptor_utils.novel_allele_name module
- receptor_utils.novel_allele_name.closest_aligned_ref(seq: str, ref: dict)[source]
Given an input sequence, find the closest entry or entries in an IMGT or other reference set as determined by a global alignment.
- Parameters
seq (str) – the input sequence
ref (dict) – the reference set
- Returns
the name of the closest entry, entries, if >1 sequences in the reference set were equally close
- Return type
list
- receptor_utils.novel_allele_name.name_novel(novel_seq: str, ref_set: dict, v_gene: bool = True)[source]
Make a name for the novel allele, given its gapped or ungapped sequence. The name conforms to the description here . The sequence must be full-length at the 5 prime end, or gapped
- Parameters
novel_seq (str) – the full-length sequence to name, which may be either gapped or ungapped in the case of a V-gene
ref_set (dict) – dict of reference genes (gapped in the case of v-genes), in the format returned by read_fasta
v_gene (bool) – True only if we are naming a v_gene
- Returns
tuple consisting of three strings: novel_name, novel_seq, notes. novel_seq is gapped in the case of a V-gene
- Return type
tuple
receptor_utils.number_v module
receptor_utils.simple_bio_seq module
Wrapper around BioSeq functions for simple applications principles: - store sequences as strings, use dicts for collectiopns - all sequences are coerced to upper case on input - iterators are coreced into lists for ease of debugging basically just make things simple for cases where we don’t need to do more
- receptor_utils.simple_bio_seq.chunks(l: str, n: int)[source]
Yield successive n-sized chunks from l.
- receptor_utils.simple_bio_seq.closest_ref(seq: str, ref: dict)[source]
Given an input sequence, find the closest entry or entries in an IMGT or other reference set as determined by nt_diff.
- Parameters
seq (str) – the input sequence
ref (dict) – the reference set
- Returns
the name of the closest entry, entries, if >1 sequences in the reference set were equally close
- Return type
list
- receptor_utils.simple_bio_seq.dumb_consensus(seqs: dict, threshold: float = 0.7)[source]
Return a dumb consensus on a dict of sequences, using Bio.Align.AlignInfo.SummaryInfo.dumb_consensus. All sequences should be the same length.
- Parameters
seqs (dict) – the sequences to analyse
threshold (float) – the threshold value that is required to add a particular atom
- Returns
the consensus sequence
- Return type
str
- receptor_utils.simple_bio_seq.nt_diff(s1: str, s2: str)[source]
Returns a count of the positions in which the input sequences differ. If the sequences are of different lengths, the count only covers positions in the shorter sequence.
- Parameters
s1 (str) – the first sequence
s2 (str) – the second sequence
- Returns
The number of positions at which the sequences differ
- Return type
int
- receptor_utils.simple_bio_seq.read_csv(file: str, delimiter: Optional[str] = None)[source]
Read a delimited file into a list of dicts (as produced by DictReader)
- Parameters
file (str) – filename of the file
delimiter (str) – the delimiter (‘,’ by default)
- Returns
the list of dicts
- Return type
list
- receptor_utils.simple_bio_seq.read_fasta(infile: str)[source]
Read a FASTA file into a dict
- Parameters
infile (str) – Pathname of the file
- Returns
A dictionary indexed by FASTA ID, containing the sequences (in upper case)
- Return type
dict
- receptor_utils.simple_bio_seq.read_imgt_fasta(infile: str, species: str, chains=('IGHV', 'IGHD', 'IGHJ', 'CH'), functional_only: bool = False, include_orphon: bool = False)[source]
read V,D,J regions from one or more species from an IMGT reference file
- Parameters
infile (str) – The IMGT reference file
species (str) – Species as specified in the file, with spaces replaced by underscore
chains (list) – List of chains to read (e.g. [‘IGHV’, ‘IGHJ’])
functional_only – If True, returns only sequences marked as functional (F)
functional_only – If True, includes orphons
- Returns
A dict containing the serquences, indexed by name
- Return type
dict
- receptor_utils.simple_bio_seq.read_single_fasta(infile)[source]
Read a single sequence from a FASTA file
- Parameters
infile (str) – Pathname of the file
- Returns
The first (or only) sequence in the file
- Return type
str
- receptor_utils.simple_bio_seq.reverse_complement(seq: str)[source]
Return the reverse complement of a nucelotide sequence
- Parameters
seq (str) – the nucleotide sequence
- Returns
the reverse complement
- Return type
str
- receptor_utils.simple_bio_seq.sample_fasta(seqs: dict, number: int)[source]
Return a random sample of sequences stored in a dict. Sequences are not resampled. Returns a ValueError if the dict does not contain enough sequences
- Parameters
seqs (dict) – The sequences to sample
number (int) – The number of sequences to return
- Returns
The sampled sequences
- Return type
dict
- receptor_utils.simple_bio_seq.scored_consensus(seqs: dict, threshold: float = 0.7)[source]
Return a dumb consensus on a dict of sequences, using Bio.Align.AlignInfo.SummaryInfo.gap_consensus. All sequences should be the same length. The output is modified to provide, as well as the consensus string, the mimimum score achieved at any position
- Parameters
seqs (dict) – the sequences to analyse
threshold (float) – the threshold value that is required to add a particular atom
- Returns
(consensus sequence, min_score)
- Return type
str
- receptor_utils.simple_bio_seq.scored_gap_consensus(alignment, threshold=0.7, ambiguous='X', require_multiple=False)[source]
Adapted from BioPython.Bio.ALign. This function is called by receptor_utils.scored_consensus.
- receptor_utils.simple_bio_seq.toSeqRecords(seqs: dict)[source]
Convert a dict of sequences to a list of BioPython SeqRecords
- Parameters
seqs (dict) – the sequences to convert
- Returns
the BioPython SeqRecords
- Return type
list
- receptor_utils.simple_bio_seq.translate(seq: str, truncate: bool = True, ignore_partial_codon: bool = True)[source]
Translate a nucleotide sequence to amino acid
- Parameters
seq (str) – the sequence to translate
truncate (bool) – If True, truncate the sequence so that it terminates on a codon boundary. Otherwise pad with N if necessary
ignore_partial_codon (bool) – If True, if any position in a codon contains - or ., set the entire codon to — ensuring it gets translated as -
- Returns
Amino acid string
- Return type
str
- receptor_utils.simple_bio_seq.write_csv(file: str, rows: list, delimiter: Optional[str] = None, scan_all: bool = False)[source]
Write a list of dicts to a delimited file. The header row is determined from the keys of the first item
- Parameters
file (str) – filename of the delimited file to create
rows (list) – the rows to write
delimiter (str) – the delimiter (‘,’ by default)
scan_all (bool) – If True, scan all rows to find the full set of keys
- Returns
None
- Return type
None
- receptor_utils.simple_bio_seq.write_fasta(outfile: str, seqs: dict)[source]
Write a dict into a FASTA file. The dict should be indexed by sequence name
- Parameters
outfile (str) – Pathname of the file to be written
seqs (dict) – The sequences to write
- Returns
the number of records written
- Return type
int