Using AIRR Community Reference Sets with MiXCR

For simplicity, we will use the MiXCR Docker/Singularity container. If you are not familiar with the container, you can read about it in the MiXCR documentation. Alternatively, you can install MiXCR on your local machine, using the instructions provided on the MiXCR website. In this case just skip the ‘log in to Docker’ step in the walkthrough.

We will annotate a sample set of sequences provided on the Immcantation website. The sequences have been preprocessed and quality-filtered from Illumina paired-end reads and are present in a FASTA file called HD13M.fasta. The process required to create such a file from sequencing reads, will depend on the sequencing protocol. You can consult the Mixcr documentation, or other sources, to determine a suitable approach for you sequencing data.

Prerequisites

Before you start, you will need to have the following installed on your machine:

Docker

Python 3.9 or above

The receptor-utils package (see Introduction for installation instructions)

The file HD13M.fasta, extracted from the tarball which you can download here.

The steps we shall follow are as follows:

Download the AIRR-C germline set for human IGHV genes
Log in to the MiXCR container so that we can use its installed tools.
Build a MiXCR database for the germline set.
Annotate the sample sequences using MiXCR.

1. Download the germline set for human IGHV genes

In a suitable directory to use for this test, use the download_germline_set utility to download the germline set for human IGHV genes:

$ download_germline_set "Homo sapiens" IGH -f MULTI-F
https://ogrdb.airr-community.org/api_v2/germline/species
Homo sapiens: 9606
9606.IGH_VDJ
FASTA files saved to Homo_sapiens_IGH_V.fasta, Homo_sapiens_IGH_D.fasta, Homo_sapiens_IGH_J.fasta, Homo_sapiens_IGH_V_gapped.fasta

Finally, extract the file HD13M.fasta from the downloaded tarball and copy it to the directory.

1. Log in to the MiXCR container so that we can use its installed tools

Log in to the container, mounting the current local directory as /work.

From Linux using Docker:

docker run -it -v $(pwd):/work ghcr.io/milaboratory/mixcr/mixcr:latest bash

From Windows using Docker:

docker run -it -v %cd%:/work ghcr.io/milaboratory/mixcr/mixcr:latest bash

Once in the container, cd to /work and check that the reference set files are present:

bash-4.2# cd /work
bash-4.2# ls
HD13M.fasta  Homo_sapiens_IGH_D.fasta  Homo_sapiens_IGH_J.fasta  Homo_sapiens_IGH_V.fasta  Homo_sapiens_IGH_V_gapped.fasta
bash-4.2#

3. Build MiXCR database for the germline set

Set the MiXCR license to match your key:

bash-4.2# MI_LICENSE="...your licence key here..."
bash-4.2# export MI_LICENSE

Use MiXCR to build the database:

bash-4.2# mixcr buildLibrary --debug \
    --v-genes-from-fasta Homo_sapiens_IGH_V.fasta --v-gene-feature VRegion \
    --j-genes-from-fasta Homo_sapiens_IGH_J.fasta \
    --d-genes-from-fasta Homo_sapiens_IGH_D.fasta \
    --chain IGH --taxon-id 9606 --species human \
    human-IGH.json.gz

You may see warnings during the build process that stop codons were found in some sequences. This is expected, as some pseudogenes are included in the AIRR-C set. No action is required.

After these commands have run, the database human-IGH.json.gz will be present in the directory.

4. Annotate the sample sequences using MiXCR

Annotate the sequences in HD13M.fasta with MiXCR. As the sequences are in plain FASTA format, we will use the ‘generic pacbio’ template.

bash-4.2# mixcr analyze generic-pacbio -s human \
--library human-IGH \
--assemble-clonotypes-by FR1+CDR1+FR2+CDR2+FR3+CDR3+FR4 \
HD13M.fasta \
HD13M

Once MiXCR has run, you can view HD13M.clones_IGH.tsv to see the resulting annotations.