download_germline_set
This utility will download reference sequences from the Open Germline Receptor Database (OGRDB).
Download germline sets from the Open Germline Receptor Database (OGRDB)
usage: download_germline_set [-h] [-n NAME] [-v VERSION] [-f {AIRRC-JSON,SINGLE-FG,SINGLE-FU,MULTI-F,MULTI-IGBLAST}] [-u URL] [-p PREFIX] species locus
Positional Arguments
- species
Species (e.g. “Homo sapiens”)
- locus
Locus (IGH, IGK, IGL, TRA, TRB, TRD, TRD)
Named Arguments
- -n, --name
germline set name (the utility will attempt to determine the name, if none is specified)
- -v, --version
Specific version to download, otherwise the latest version will be downloaded
Default: “latest”
- -f, --format
Possible choices: AIRRC-JSON, SINGLE-FG, SINGLE-FU, MULTI-F, MULTI-IGBLAST
Format to download
Default: “AIRRC-JSON”
- -u, --url
URL to use
Default: “https://ogrdb.airr-community.org/api_v2”
- -p, --prefix
Prefix for filenames. Default prefix is species_locus (with _ substituted for space). If PREFIX is NONE, no prefix will be used for multi files, and the default prefix will be used for single files.
Format Options
JSON format: AIRR-C JSON format (default)
Single FASTA file: all V(D)J sequences in a single FASTA file
Multiple FASTA files: V, D, J, and gapped V sequences in separate FASTA files
IgBLAST format: Multiple fasta files, plus IgBLAST germline configuration files
By default, germline sets are downloaded in AIRR-C JSON format. The advantage of this format is that it contains full information on the germline set including delineation of the V-sequence CDRs and delieation of the J-sequences. This means that it can, in principle, be loaded into an annotation tool without needing any additional information to be provided to the tool. The json format is also recommended for use with annotate_j and make_igblast_ndm utilities, which create germline configuration files for IgBLAST. The AIRR-C JSON format can also be specified with the argument -f AIRRC-JSON.
Another option is to download all V(D)J sequences into a single FASTA file. This can be specified by the -f SINGLE-FG argument, which will download gapped V sequences, or -f SINGLE-FU, which will download ungapped V sequences. The filename, by default, is, respectively, <species>_<locus>_gapped.fasta, or <species>_<locus>.fasta. The -p argument can be used to replace the <species>_<locus>_ prefix with a custom prefix.
The sequences can also be downloaded into four FASTA files. These are named, by default, <species>_<locus>_V.fasta, <species>_<locus>_D.fasta, <species>_<locus>_J.fasta, <species>_<locus>_V_gapped.fasta. This is specified by the -f MULTI_F argument. The -p argument can be used to replace the <species>_<locus>_ prefix with a custom prefix.
Finally, the -f MULTI-IGBLAST option will download the four multi files as above, and also create .ndm and .aux files for use with IgBLAST. For further details on use with IgBLAST please see Using AIRR Community Reference Sets with IgBLAST.
Examples
Download the latest version of the human IGK germline set in AIRR-C JSON format:
download_germline_set -"Homo sapiens" IGK
Download the latest version of the human IGH germline set in single FASTA format, with gapped V sequences:
download_germline_set "Homo sapiens" IGH -f SINGLE-FG
Download the latest version of the mouse C57BL/6 IGH germline set in multiple FASTA format:
download_germline_set "Mus musculus" IGH -n "C57BL/6 IGH" -f MULTI-F
Note that the utility will try to provide helpful information in the event of a command error:
>download_germline_set "Mus musculus" IGH -n "C57BL/6" -f MULTI-F
https://ogrdb.airr-community.org/api_v2/germline/species
Mus musculus: 10090
Error: set C57BL/6 not found for species 10090 locus IGH. Available sets:
BALB/c IGHV, C57BL/6 IGH, C57BL/6 IGHV, CAST/EiJ IGH, LEWES/EiJ IGH, MSM/MsJ IGH, NOD/ShiLtJ IGH, PWD/PhJ IGH, BALB/c IGH
Download human IGH files for IgBLAST, using the custom filename prefix AIRRC_IGH:
download_germline_set "Homo sapiens" IGH -f MULTI-IGBLAST -p AIRRC_IGH