receptor_utils package

Submodules

receptor_utils.novel_allele_name module

receptor_utils.novel_allele_name.aligned_diff(novel_seq: str, ref_seq: str)[source]

receptor_utils.novel_allele_name.build_ambiguous_ref(full_ref, start)[source]

receptor_utils.novel_allele_name.closest_aligned_ref(seq: str, ref: dict)[source]

Given an input sequence, find the closest entry or entries in an IMGT or other reference set as determined by a global alignment.

Parameters

seq (str) – the input sequence
ref (dict) – the reference set

Returns

the name of the closest entry, entries, if >1 sequences in the reference set were equally close

Return type

list

receptor_utils.novel_allele_name.find_indels(closest_ref_seq, novel_seq)[source]

receptor_utils.novel_allele_name.name_novel(novel_seq: str, ref_set: dict, v_gene: bool = True)[source]

Make a name for the novel allele, given its gapped or ungapped sequence. The name conforms to the description here . The sequence must be full-length at the 5 prime end, or gapped

Parameters

novel_seq (str) – the full-length sequence to name, which may be either gapped or ungapped in the case of a V-gene
ref_set (dict) – dict of reference genes (gapped in the case of v-genes), in the format returned by read_fasta
v_gene (bool) – True only if we are naming a v_gene

Returns

tuple consisting of three strings: novel_name, novel_seq, notes. novel_seq is gapped in the case of a V-gene

Return type

tuple

receptor_utils.novel_allele_name.run_tests()[source]

receptor_utils.number_v module

receptor_utils.number_v.check_conserved_residues(aa)[source]

receptor_utils.number_v.distribute_cdr(cdr, length)[source]

receptor_utils.number_v.gap_align(seq, ref)[source]

receptor_utils.number_v.gap_align_aa(seq, ref)[source]

receptor_utils.number_v.gap_align_aa_from_nt(aa_seq, nt_gapped)[source]

receptor_utils.number_v.gap_nt_from_aa(nucleotide_seq, peptide_seq)[source]

receptor_utils.number_v.gap_sequence(seq, gapped_ref, ungapped_ref)[source]

receptor_utils.number_v.insert_space(seq, pos)[source]

receptor_utils.number_v.match_score(target, match_list, thresh)[source]

receptor_utils.number_v.nt_diff(s1, s2)[source]

receptor_utils.number_v.number_from_trp(seq, next_p, gapped, trp_pos)[source]

receptor_utils.number_v.number_ighv(seq)[source]

receptor_utils.number_v.pretty_gapped(seq)[source]

receptor_utils.number_v.run_tests()[source]

receptor_utils.simple_bio_seq module

Wrapper around BioSeq functions for simple applications principles: - store sequences as strings, use dicts for collectiopns - all sequences are coerced to upper case on input - iterators are coreced into lists for ease of debugging basically just make things simple for cases where we don’t need to do more

receptor_utils.simple_bio_seq.chunks(l: str, n: int)[source]: Yield successive n-sized chunks from l.

receptor_utils.simple_bio_seq.closest_ref(seq: str, ref: dict)[source]

Given an input sequence, find the closest entry or entries in an IMGT or other reference set as determined by nt_diff.

Parameters

seq (str) – the input sequence
ref (dict) – the reference set

Returns

the name of the closest entry, entries, if >1 sequences in the reference set were equally close

Return type

list

receptor_utils.simple_bio_seq.dumb_consensus(seqs: dict, threshold: float = 0.7)[source]

Return a dumb consensus on a dict of sequences, using Bio.Align.AlignInfo.SummaryInfo.dumb_consensus. All sequences should be the same length.

Parameters

seqs (dict) – the sequences to analyse
threshold (float) – the threshold value that is required to add a particular atom

Returns

the consensus sequence

Return type

str

receptor_utils.simple_bio_seq.nt_diff(s1: str, s2: str)[source]

Returns a count of the positions in which the input sequences differ. If the sequences are of different lengths, the count only covers positions in the shorter sequence.

Parameters

s1 (str) – the first sequence
s2 (str) – the second sequence

Returns

The number of positions at which the sequences differ

Return type

int

receptor_utils.simple_bio_seq.read_csv(file: str, delimiter: Optional[str] = None)[source]

Read a delimited file into a list of dicts (as produced by DictReader)

Parameters

file (str) – filename of the file
delimiter (str) – the delimiter (‘,’ by default)

Returns

the list of dicts

Return type

list

receptor_utils.simple_bio_seq.read_fasta(infile: str)[source]

Read a FASTA file into a dict

Parameters: infile (str) – Pathname of the file
Returns: A dictionary indexed by FASTA ID, containing the sequences (in upper case)
Return type: dict

receptor_utils.simple_bio_seq.read_imgt_fasta(infile: str, species: str, chains=('IGHV', 'IGHD', 'IGHJ', 'CH'), functional_only: bool = False, include_orphon: bool = False)[source]

read V,D,J regions from one or more species from an IMGT reference file

Parameters

infile (str) – The IMGT reference file
species (str) – Species as specified in the file, with spaces replaced by underscore
chains (list) – List of chains to read (e.g. [‘IGHV’, ‘IGHJ’])
functional_only – If True, returns only sequences marked as functional (F)
functional_only – If True, includes orphons

Returns

A dict containing the serquences, indexed by name

Return type

dict

receptor_utils.simple_bio_seq.read_single_fasta(infile)[source]

Read a single sequence from a FASTA file

Parameters: infile (str) – Pathname of the file
Returns: The first (or only) sequence in the file
Return type: str

receptor_utils.simple_bio_seq.reverse_complement(seq: str)[source]

Return the reverse complement of a nucelotide sequence

Parameters: seq (str) – the nucleotide sequence
Returns: the reverse complement
Return type: str

receptor_utils.simple_bio_seq.sample_fasta(seqs: dict, number: int)[source]

Return a random sample of sequences stored in a dict. Sequences are not resampled. Returns a ValueError if the dict does not contain enough sequences

Parameters

seqs (dict) – The sequences to sample
number (int) – The number of sequences to return

Returns

The sampled sequences

Return type

dict

receptor_utils.simple_bio_seq.scored_consensus(seqs: dict, threshold: float = 0.7)[source]

Return a dumb consensus on a dict of sequences, using Bio.Align.AlignInfo.SummaryInfo.gap_consensus. All sequences should be the same length. The output is modified to provide, as well as the consensus string, the mimimum score achieved at any position

Parameters

seqs (dict) – the sequences to analyse
threshold (float) – the threshold value that is required to add a particular atom

Returns

(consensus sequence, min_score)

Return type

str

receptor_utils.simple_bio_seq.scored_gap_consensus(alignment, threshold=0.7, ambiguous='X', require_multiple=False)[source]: Adapted from BioPython.Bio.ALign. This function is called by receptor_utils.scored_consensus.

receptor_utils.simple_bio_seq.toSeqRecords(seqs: dict)[source]

Convert a dict of sequences to a list of BioPython SeqRecords

Parameters: seqs (dict) – the sequences to convert
Returns: the BioPython SeqRecords
Return type: list

receptor_utils.simple_bio_seq.translate(seq: str, truncate: bool = True, ignore_partial_codon: bool = True)[source]

Translate a nucleotide sequence to amino acid

Parameters

seq (str) – the sequence to translate
truncate (bool) – If True, truncate the sequence so that it terminates on a codon boundary. Otherwise pad with N if necessary
ignore_partial_codon (bool) – If True, if any position in a codon contains - or ., set the entire codon to — ensuring it gets translated as -

Returns

Amino acid string

Return type

str

receptor_utils.simple_bio_seq.write_csv(file: str, rows: list, delimiter: Optional[str] = None, scan_all: bool = False)[source]

Write a list of dicts to a delimited file. The header row is determined from the keys of the first item

Parameters

file (str) – filename of the delimited file to create
rows (list) – the rows to write
delimiter (str) – the delimiter (‘,’ by default)
scan_all (bool) – If True, scan all rows to find the full set of keys

Returns

None

Return type

None

receptor_utils.simple_bio_seq.write_fasta(outfile: str, seqs: dict)[source]

Write a dict into a FASTA file. The dict should be indexed by sequence name

Parameters

outfile (str) – Pathname of the file to be written
seqs (dict) – The sequences to write

Returns

the number of records written

Return type

int

receptor_utils package

Submodules

receptor_utils.novel_allele_name module

receptor_utils.number_v module

receptor_utils.simple_bio_seq module

Module contents