.. _airrc_sets_with_MiXCR: Using AIRR Community Reference Sets with MiXCR ================================================ For simplicity, we will use the MiXCR Docker/Singularity container. If you are not familiar with the container, you can read about it in the `MiXCR documentation <https://mixcr.com/mixcr/getting-started/docker/?h=container>`_. Alternatively, you can install MiXCR on your local machine, using the instructions provided on the `MiXCR website <https://mixcr.com>`_. In this case just skip the 'log in to Docker' step in the walkthrough. We will annotate a sample set of sequences provided on the `Immcantation website <http://clip.med.yale.edu/immcantation/examples/AIRR_Example.tar.gz>`_. The sequences have been preprocessed and quality-filtered from Illumina paired-end reads and are present in a FASTA file called ``HD13M.fasta``. The process required to create such a file from sequencing reads, will depend on the sequencing protocol. You can consult the Mixcr documentation, or other sources, to determine a suitable approach for you sequencing data. Prerequisites ------------- Before you start, you will need to have the following installed on your machine: * Docker * Python 3.9 or above * The receptor-utils package (see :ref:`introduction_label` for installation instructions) * The file ``HD13M.fasta``, extracted from the tarball which you can download `here <http://clip.med.yale.edu/immcantation/examples/AIRR_Example.tar.gz>`_. The steps we shall follow are as follows: #. Download the AIRR-C germline set for human IGHV genes #. Log in to the MiXCR container so that we can use its installed tools. #. Build a MiXCR database for the germline set. #. Annotate the sample sequences using MiXCR. 1. Download the germline set for human IGHV genes ------------------------------------------------- In a suitable directory to use for this test, use the :ref:`download_germline_set` utility to download the germline set for human IGHV genes: .. code-block:: none $ download_germline_set "Homo sapiens" IGH -f MULTI-F https://ogrdb.airr-community.org/api_v2/germline/species Homo sapiens: 9606 9606.IGH_VDJ FASTA files saved to Homo_sapiens_IGH_V.fasta, Homo_sapiens_IGH_D.fasta, Homo_sapiens_IGH_J.fasta, Homo_sapiens_IGH_V_gapped.fasta Finally, extract the file ``HD13M.fasta`` from the downloaded tarball and copy it to the directory. 1. Log in to the MiXCR container so that we can use its installed tools ----------------------------------------------------------------------- Log in to the container, mounting the current local directory as /data. From Linux using Docker: .. code-block:: none docker run -it -v $(pwd):/work ghcr.io/milaboratory/mixcr/mixcr:latest bash From Windows using Docker: .. code-block:: none docker run -it -v %cd%:/work ghcr.io/milaboratory/mixcr/mixcr:latest bash Once in the container, cd to /work and check that the reference set files are present: .. code-block:: none bash-4.2# cd /work bash-4.2# ls HD13M.fasta Homo_sapiens_IGH_D.fasta Homo_sapiens_IGH_J.fasta Homo_sapiens_IGH_V.fasta Homo_sapiens_IGH_V_gapped.fasta bash-4.2# 3. Build MiXCR database for the germline set ----------------------------------------------- Set the MiXCR license to match your key: .. code-block:: none bash-4.2# MI_LICENSE="...your licence key here..." bash-4.2# export MI_LICENSE Use MiXCR to build the database: .. code-block:: none bash-4.2# mixcr buildLibrary --debug \ --v-genes-from-fasta Homo_sapiens_IGH_V.fasta --v-gene-feature VRegion \ --j-genes-from-fasta Homo_sapiens_IGH_J.fasta \ --d-genes-from-fasta Homo_sapiens_IGH_D.fasta \ --chain IGH --taxon-id 9606 --species human \ human-IGH.json.gz You may see warnings during the build process that stop codons were found in some sequences. This is expected, as some pseudogenes are included in the AIRR-C set. No action is required. After these commands have run, the database human-IGH.json.gz will be present in the directory. 4. Annotate the sample sequences using MiXCR ---------------------------------------------- Annotate the sequences in ``HD13M.fasta`` with MiXCR. As the sequences are in plain FASTA format, we will use the 'generic pacbio' template. .. code-block:: none bash-4.2# mixcr analyze generic-pacbio -s human \ --library human-IGH \ --assemble-clonotypes-by FR1+CDR1+FR2+CDR2+FR3+CDR3+FR4 \ HD13M.fasta \ HD13M Once MiXCR has run, you can view ``HD13M.clones_IGH.tsv`` to see the resulting annotations.