# arcasHLA

**Repository Path**: wangshun1121/arcasHLA

## Basic Information

- **Project Name**: arcasHLA
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: GPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-07-09
- **Last Updated**: 2025-10-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

### Dependencies ###

Install `arcasHLA` through bioconda with:
```
conda install arcas-hla -c bioconda -c conda-forge
```
**Important**: Please include channels `bioconda` and `conda-forge` as above.

`arcasHLA` can also be installed through the [environment.yml](./environment.yml) file in this repo:
```
conda env create -f environment.yml
conda activate arcas-hla
```

### Test ###

**(Update 2023-09-29)**: The below tests are now implemented as a pytest [suite](./test/test_arcas_hla.py). You can run this locally by building the docker environment and running pytest. From the current directory:

```
docker build -t <image-name> -f Docker/Dockerfile .
docker run --rm -v /path/to/repo:/app <image-name> pytest
```
-----

In order to test arcasHLA partial typing, we need to roll back the reference to an earlier version. First, fetch IMGT/HLA database version 3.24.0:
```
./arcasHLA reference --version 3.24.0
```
Extract reads:
```
./arcasHLA extract test/test.bam -o test/output -t 8 -v
```
Genotyping (no partial alleles):
```
./arcasHLA genotype test/output/test.extracted.1.fq.gz test/output/test.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -o test/output -t 8 -v
```
Expected output in `test/output/test.genotype.json`:
```
{"A": ["A*01:01:01", "A*03:01:01"], 
 "B": ["B*39:01:01", "B*07:02:01"], 
 "C": ["C*08:01:01", "C*01:02:01"], 
 "DPB1": ["DPB1*14:01:01", "DPB1*02:01:02"], 
 "DQA1": ["DQA1*02:01:01", "DQA1*05:03"], 
 "DQB1": ["DQB1*02:02:01", "DQB1*06:09:01"], 
 "DRB1": ["DRB1*10:01:01", "DRB1*14:02:01"]}
```
Partial typing:
```
./arcasHLA partial test/output/test.extracted.1.fq.gz test/output/test.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -G test/output/test.genotype.json -o test/output -t 8 -v
```
Expected output in `test/output/test.partial_genotype.json`:
```
{"A": ["A*01:01:01", "A*03:01:01"], 
 "B": ["B*07:02:01", "B*39:39:01"],
 "C": ["C*08:01:01", "C*01:02:01"], 
 "DPB1": ["DPB1*14:01:01", "DPB1*02:01:02"], 
 "DQA1": ["DQA1*02:01:01", "DQA1*05:03"],
 "DQB1": ["DQB1*06:04:01", "DQB1*02:02:01"],
 "DRB1": ["DRB1*03:02:01", "DRB1*14:02:01"]}
```
Remember to update the HLA reference using the following command.
```
./arcasHLA reference --update
```

### Usage ###

To see the list of available tools, simply enter `arcasHLA`. To view the required and optional arguments for any of the tools enter `arcasHLA [command] -h`.

- `extract` : Extracts reads mapped to chromosome 6 and any HLA decoys or chromosome 6 alternates.
- `genotype` : Genotypes HLA alleles from extracted reads (no partial alleles).
- `partial` : Genotypes partial HLA alleles from extracted reads and output from `genotype` (optional).
- `reference` : Update, specify version or force rebuilding of HLA reference.
- `merge` : merge genotyping output for multiple samples into a single json file.

### Extract reads ###

arcasHLA takes sorted BAM files and extracts chromosome 6 reads and related HLA sequences. If the BAM file is not indexed, this tool will run samtools index before extracting reads. By default, `extract` outputs paired FASTQ files; use the `--single` flag for single-end samples.

    arcasHLA extract [options] /path/to/sample.bam 
    
Output: `sample.extracted.1.fq.gz`, `sample.extracted.2.fq.gz`

#### Options: ####
- `--single`          : single-end reads (default: False)                                                                             
- `--unmapped`        : include unmapped reads, recommended if the aligner used marks multimapping reads as unmapped (default: False) 
- `--log FILE`        : log file for run summary (default: sample.extract.log)                                                        
- `--o, --outdir DIR` : output directory (default: `.`)                                                                               
- `--temp DIR`        : temp directory (default: `/tmp`)                                                                              
- `--keep_files`      : keep intermediate files (default: False)                                                                      
- `-t, --threads INT` : number of threads (default: 1)                                                                                
- `-v, --verbose`     : verbosity (default: False)                

### Genotype ###

#### From FASTQs ####
To predict the most likely genotype (no partial alleles), input the FASTQs produced by `extract` or the original FASTQs with all reads (experimental - use with caution).

```
arcasHLA genotype [options] /path/to/sample.1.fq.gz /path/to/sample.2.fq.gz
```

Output: `sample.alignment.p`, `sample.em.json`, `sample.genotype.json`

#### From intermediate alignment file ####  
If you have previously run `genotype` on a sample, you can run `genotype` again directly from `sample.alignment.p` to retype without aligning with Kallisto again. This is useful if you want to try different populations, genes and other parameters.
```
arcasHLA genotype [options] /path/to/sample.alignment.p
``` 
#### Example `.genotype.json` ####

```
{'A': ['A*01:01:01', 'A*29:02:01'],
 'B': ['B*08:01:01', 'B*44:03:01'],
 'C': ['C*07:01:01', 'C*16:01:01'],
 'DQA1': ['DQA1*02:01:01', 'DQA1*05:01:01'],
 'DQB1': ['DQB1*02:01:01', 'DQB1*02:02:01'],
 'DRB1': ['DRB1*03:01:01', 'DRB1*07:01:01']}
```

#### Options ####
- `-g, --genes GENES`       : comma separated list of HLA genes (ex. A,B,C,DQA1,DQB1,DRB1)
- `-p, --population POPULATION`  : sample population, options are asian_pacific_islander, black, caucasian, hispanic, native_american and prior (default: Prior)
- `--min_count INT`   : minimum gene read count required for genotyping (default: 75)
- `--tolerance FLOAT` : convergence tolerance for transcript quantification (default: 10e-7)
- `--max_iterations INT` : maximmum number of iterations for transcript quantification (default: 1000)
- `--drop_iterations INT` : number of iterations before dropping low support alleles, a lower number of iterations is recommended for single-end and low read count samples (default: paired - 10, single - 4)
- `--drop_threshold FLOAT` : proportion of maximum abundance an allele needs to not be dropped (default: 0.1)
- `--zygosity_threshold FLOAT` : threshold for ratio of minor to major allele nonshared count to determine zygosity (default: 0.15)
- `--log FILE`        : log file for run summary (default: `sample.genotype.log`)                                                        
- `--o, --outdir DIR` : output directory (default: `.`)                                                                               
- `--temp DIR`        : temp directory (default: `/tmp`)                                                                              
- `--keep_files`      : keep intermediate files (default: False)                                                                      
- `-t, --threads INT` : number of threads (default: 1)                                                                                
- `-v, --verbose`     : verbosity (default: False)
- `--single`          : Include flag to indicate if single-end FASTQs (paired-end if missing)
- `-l, --avg`         : Estimated average fragment length for single-end reads (default: 200)
- `-s, --std`         : Estimated standard deviation of fragment length (default: 20)


### Genotype - partial (optional) ###
Following genotyping, partial alleles can be predicted. This requires aligning the reads to an alternate, partial allele reference. The `sample.genotype.json` file from the previous step is required.

```
arcasHLA partial [options] -G /path/to/sample.genotype.json /path/to/sample.1.fq.gz /path/to/sample.2.fq.gz
```
   
Output: `sample.partial_alignment.p`, `sample.partial_genotype.json`

The options for partial typing are the same as genotype. Partial typing can be run from the intermediate alignment file.
 
### Merge jsons ###
To make analysis easier, this command will merge all jsons produced by genotyping into a single table. All `.genotype.json` files will be merged into a single `run.genotypes.tsv` file and all `.partial_genotype.json` files will be merged into `run.partial_genotypes.tsv`. In addition, HLA locus read counts and relative abundance produced by alignment will be merged into a single tsv file.
```
arcasHLA merge [options]
```
#### Options ####
- `--run RUN` : run name
- `--i, --indir DIR` : input directory (default: `.`)     
- `--o, --outdir DIR` : output directory (default: `.`)                                                                  
- `-v, --verbose`     : toggle verbosity

### Convert HLA nomenclature ###
arcasHLA convert changes alleles in a tsv file from its input form to a specified grouped nomenclature (P-group or G-group) or a specified number of fields (i.e. 1, 2 or 3 fields in resolution). This file can be produced by arcasHLA merge or any tsv following the same structure:

| subject      	| A1         	| A2         	| B1         	| B2         	| C1         	| C2         	|
|--------------	|------------	|------------	|------------	|------------	|------------	|------------	|
| subject_name 	| A*01:01:01 	| A*01:01:01 	| B*07:02:01 	| B*07:02:01 	| C*04:01:01 	| C*04:01:01 	|

P-group (alleles sharing the same amino acid sequence in the antigen-binding region) and G-group (alleles sharing the same base sequence in the antigen-binding region) can only be reduced to 1-field resolution as alleles with differing 2nd fields can be in the same group. By the same reasoning, P-group cannot be converted into G-group.

```
arcasHLA convert --resolution [resolution] genotypes.tsv
```
#### Options ####
- `-r, --resolution RESOLUTION` : output resolution (1, 2, 3) or grouping (g-group, p-group)
- `-o, --outfile FILE` : output file (default: `./run.resolution.tsv`)
- `-f, --force` : force conversion for grouped alleles even if it results in loss of resolution
- `-v, --verbose`     : toggle verbosity

### Change reference ###
To update the reference to the latest IMGT/HLA version, run

```
arcasHLA reference --update
```
If you are running multiple tools to type HLAs, it can be helpful to use the same version of IMGT/HLA. You can select the version you like using the commithash from the [IMGT/HLA Github](https://github.com/ANHIG/IMGTHLA/commits/Latest).

```
arcasHLA reference --version [commithash]
```

If you suspect there is an issue  with the reference files, rebuild the reference with the following command
```
arcasHLA reference --rebuild
```

Note: if your reference was built with arcasHLA version <= 0.1.1 and you wish to change your reference to versions >= 3.35.0, it may be necessary to remove the IMGTHLA folder due to the need for Git Large File Storage to properly download hla.dat.

```
rm -rf dat/IMGTHLA
arcasHLA reference --update
```

#### Options ####
- `--update` : update to latest IMGT/HLA version
- `--version` : checkout IMGT/HLA version using commithash
- `--rebuild` : rebuild HLA database
- `-v, --verbose`     : verbosity (default: False) 

## Build Customized References ##

#### Input: arcasHLA genotypes.json ####
Customized references can be built from arcasHLA genotype outputs.
```
./arcasHLA customize genotypes.json -o ~/ref
```
#### Input: HLA tsv ####

Customized references can be built from a tab-separated file with the following structure:

| subject | A1      | A2      | B1      | B2      | C1      | C2      |
|---------|---------|---------|---------|---------|---------|---------|
| Example | A*01:01 | A*02:01 | B*07:01 | B*52:01 | C*04:01 | C*18:01 |

```
./arcasHLA customize hla.tsv -o ~/ref
```
#### Options: ####
```
usage: arcasHLA customize [options]

optional arguments:
  -h, --help            show this help message and exit

  -G , --genotype       comma-separated list of HLA alleles (e.g. A*01:01,A*11:01,...)
                        arcasHLA output genotype.json or genotypes.json
                        or tsv with format specified in README.md
  -s , --subject        subject name, only required for list of alleles
  -g , --genes          comma separated list of HLA genes
                        default: all
                        options: A, B, C, DMA, DMB, DOA, DOB, DPA1, DPB1, DQA1,
                        DQB1, DRA, DRB1, DRB3, DRB5, E, F, G, H, J, K, L

  --transcriptome TRANSCRIPTOME
                        transcripts to include besides input HLAs
                         options: full, chr6, none
                          default: full

  --resolution RESOLUTION
                        genotype resolution, only use >2 when typing performed with assay or Sanger sequencing
                          default: 2

  --grouping GROUPING   type/number of transcripts to include per allele
                         single - one 3-field resolution transcript per allele (e.g. A*01:01:01)
                        g-group - all transcripts with identical binding regions
                          default: protein group - all transcripts with identical protein types (2 fields the same)

  -o , --outdir         out directory

  --temp                temp directory

  --keep_files          keep intermediate files

  -t , --threads
  -v, --verbose
```

## Quantification ##
Note: if the reference was built with the `--chr6` flag, you should run `quant` with extracted chromosome 6 FASTQs (see `extract`).

```
./arcasHLA quant --ref /path/to/ref/sample FASTQ
```

Example:
```
./arcasHLA quant --ref ~/ref/Pt23 -t 8 -o /Volumes/quant/ /Volumes/fastq/Pt23_pre.1.fq.gz /Volumes/fastq/Pt23_pre.2.fq.gz
```

#### Options: ####
```
usage: arcasHLA quant [options] FASTQs

positional arguments:
  file               list of fastq files

optional arguments:
  -h, --help         show this help message and exit

  --sample SAMPLE    sample name
  --ref              arcasHLA quant_ref path (e.g. "/path/to/ref/sample")

  -o , --outdir      out directory

  --temp             temp directory

  --keep_files       keep intermediate files

  -l AVG, --avg AVG  Estimated average fragment length for single-end reads
                       default: 200

  -s STD, --std STD  Estimated standard deviation of fragment length for single-end reads
                       default: 20

  --single           Include flag if single-end reads. Default is paired-end.

  -t , --threads
  -v, --verbose
```
 

## Citations ##
* Orenbuch R, Filip I, Comito D, et al (2019) arcasHLA: high resolution HLA typing from RNA seq. Bioinformatics doi:[10.1093/bioinformatics/btz474](http://dx.doi.org/10.1093/bioinformatics/btz474)
* Orenbuch R, Filip I, Rabadan R (2020) HLA Typing from RNA Sequencing and Applications to Cancer. Methods Mol. Biol. doi: 10.1007/978-1-0716-0327-7_5 (https://link.springer.com/protocol/10.1007%2F978-1-0716-0327-7_5)
* Filip, I., Wang, A., Kravets, O. et al. Pervasiveness of HLA allele-specific expression loss across tumor types. Genome Med 15, 8 (2023). https://doi.org/10.1186/s13073-023-01154-x