# hifiasm **Repository Path**: lqhhhhhh/hifiasm ## Basic Information - **Project Name**: hifiasm - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-26 - **Last Updated**: 2020-12-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## Getting Started ```sh # Install hifiasm (requiring g++ and zlib) git clone https://github.com/chhylp123/hifiasm cd hifiasm && make # Run on test data (use -f0 for small datasets) wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz ./hifiasm -o test -t4 -f0 chr11-2M.fa.gz 2> test.log awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa # get primary contigs in FASTA # Assemble inbred/homozygous genomes (-l0 disables duplication purging) hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log # Assemble heterozygous with built-in duplication purging hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz # Trio binning assembly (requiring https://github.com/lh3/yak) yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz) yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz) hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz ``` ## Introduction Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data. ## Why Hifiasm? * Hifiasm delivers high-quality assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers. * Given sequence reads from the parents, hifiasm can produce overall the best haplotype-resolved assembly so far. It is the assembler of choice by the [Human Pangenome Project][hpp] for the first batch of samples. * Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge\_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time. * Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm. * Hifiasm is trivial to install and easy to use. It does not required python, R or C++11 compilers and can be compiled into a single executable. The default setting works well with a variety of genomes. [hpp]: https://humanpangenome.org ## Usage A typical hifiasm command line looks like: ```sh hifiasm -o NA12878.asm -t 32 NA12878.fq.gz ``` where `NA12878.fq.gz` provides the input reads, `-t` sets the number of CPUs in use and `-o` specifies the prefix of output files. For this example, the primary contigs are written to `NA12878.asm.p_ctg.gfa` and alternate contigs to `NA12878.asm.a_ctg.gfa`. At the first run, hifiasm saves corrected reads and overlaps to disk as `NA12878.asm.*.bin`. It reuses the saved results to avoid the time-consuming all-vs-all overlap calculation next time. You may specify `-i` to ignore precomputed overlaps and redo overlapping from raw reads. Hifiasm purges haplotig duplications by default. For inbred or homozygous genomes, you may disable purging with option `-l0`. Old HiFi reads may contain short adapter sequences at the ends of reads. You can specify `-z20` to trim both ends of reads by 20bp. For small genomes, use `-f0` to disable the initial bloom filter which takes 16GB memory at the beginning. For genomes much larger than human, applying `-f38` or even `-f39` is preferred to save memory on k-mer counting. When parental short reads are available, hifiasm can generate a pair of haplotype-resolved assemblies with trio binning. To perform such assembly, you need to count k-mers first with [yak][yak] first and then do assembly: ```sh yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz ``` Here `NA12878.asm.hap1.p_ctg.gfa` and `NA12878.asm.hap2.p_ctg.gfa` give the two haplotype assemblies. In the binning mode, hifiasm does not purge haplotig duplications by default. Because hifiasm reuses saved overlaps, you can generate both primary/alternate assemblies and trio binning assemblies with ```sh hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log ``` The second command line will run much faster than the first. You can also dump error corrected in FASTA and/or overlaps in PAF with ```sh hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null ``` ## Output files For non-trio assembly, hifiasm generates the following files: 1. Haplotype-resolved raw [unitig][unitig] graph in [GFA][gfa] format (*prefix*.r\_utg.gfa). This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors. 2. Haplotype-resolved processed unitig graph without small bubbles (*prefix*.p\_utg.gfa). Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information. 3. Primary assembly [contig][unitig] graph (*prefix*.p\_ctg.gfa). This graph collapses different haplotypes. 4. Alternate assembly contig graph (*prefix*.a\_ctg.gfa). This graph consists of all assemblies that are discarded in primary contig graph. For trio assembly, hifiasm generates the following files: 1. Haplotype-resolved raw [unitig][unitig] graph in [GFA][gfa] format (*prefix*.r\_utg.gfa). This graph keeps all haplotype information. 2. Phased paternal/haplotype1 contig graph (*prefix*.hap1.p\_ctg.gfa). This graph keeps the phased paternal/haplotype1 assembly. 3. Phased maternal/haplotype2 contig graph (*prefix*.hap2.p\_ctg.gfa). This graph keeps the phased maternal/haplotype2 assembly. Hifiasm writes error corrected reads to the *prefix*.ec.bin binary file and writes overlaps to *prefix*.ovlp.source.bin and *prefix*.ovlp.reverse.bin. ## Results The following table shows the statistics of several hifiasm primary assemblies: |Dataset|Size|Cov.|Asm options|CPU time|Wall time|RAM| N50| |:---------------|-----:|-----:|:---------------------|-------:|--------:|----:|----------------:| |[Mouse (C57/BL6J)][mouse-data]|2.6Gb |×25|-t48 -l0 |172.9h |4.8h |76G |21.1Mb| |[Maize (B73)][maize-data] |2.2Gb |×22|-t48 -l0 |203.2h |5.1h |68G |36.7Mb| |[Strawberry][strawberry-data] |0.8Gb |×36|-t48 -D10|152.7h |3.7h |91G |17.8Mb| |[Frog][frog-data] |9.5Gb |×29|-t48 |2834.3h|69.0h|463G|9.3Mb| |[Redwood][redwood-data] |35.6Gb|×28|-t80 |3890.3h|65.5h|699G|5.4Mb| |[Human (CHM13)][CHM13-data] |3.1Gb |×32|-t48 -l0 |310.7h |8.2h |114G|88.9Mb| |[Human (HG00733)][HG00733-data]|3.1Gb|×33|-t48 |269.1h |6.9h |135G|69.9Mb| |[Human (HG002)][NA24385-data] |3.1Gb |×36|-t48 |305.4h |7.7h |137G|98.7Mb| [mouse-data]: https://www.ncbi.nlm.nih.gov/sra/?term=SRR11606870 [maize-data]: https://www.ncbi.nlm.nih.gov/sra/?term=SRR11606869 [strawberry-data]: https://www.ncbi.nlm.nih.gov/sra/?term=SRR11606867 [frog-data]: https://www.ncbi.nlm.nih.gov/sra?term=(SRR11606868)%20OR%20SRR12048570 [redwood-data]: https://www.ncbi.nlm.nih.gov/sra/?term=SRP251156 [CHM13-data]: https://www.ncbi.nlm.nih.gov/sra?term=(((SRR11292120)%20OR%20SRR11292121)%20OR%20SRR11292122)%20OR%20SRR11292123 Hifiasm can assemble a 3.1Gb human genome in several hours or a ~30Gb hexaploid redwood genome in a few days on a single machine. For trio binning assembly: |Dataset|Cov.|CPU time|Elapsed time|RAM| N50| |:---------------|-----:|-------:|--------:|----:|----------------:| |[HG00733][HG00733-data], [\[father\]][HG00731-data], [\[mother\]][HG00732-data]|×33|269.1h|6.9h|135G|35.1Mb (paternal), 34.9Mb (maternal)| |[HG002][NA24385-data], [\[father\]][NA24149-data], [\[mother\]][NA24143-data]|×36|305.4h|7.7h|137G|41.0Mb (paternal), 40.8Mb (maternal)| |[NA12878][NA12878-data], [\[father\]][NA12891-data], [\[mother\]][NA12892-data]|×30|180.8h|4.9h|123G|27.7Mb (paternal), 27.0Mb (maternal)| [HG00733-data]: https://www.ebi.ac.uk/ena/data/view/ERX3831682 [HG00731-data]: https://www.ebi.ac.uk/ena/data/view/ERR3241754 [HG00732-data]: https://www.ebi.ac.uk/ena/data/view/ERR3241755 [NA24385-data]: https://www.ncbi.nlm.nih.gov/sra?term=(((SRR10382244)%20OR%20SRR10382245)%20OR%20SRR10382248)%20OR%20SRR10382249 [NA24149-data]: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/NIST_HiSeq_HG003_Homogeneity-12389378/HG003Run01-13262252/ [NA24143-data]: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG004_NA24143_mother/NIST_HiSeq_HG004_Homogeneity-14572558/HG004Run01-15133132/ [NA12878-data]: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/PacBio_SequelII_CCS_11kb/ [NA12891-data]: https://www.ebi.ac.uk/ena/data/view/ERR194160 [NA12892-data]: https://www.ebi.ac.uk/ena/data/view/ERR194161 Except NA12878, the assemblies above were produced by hifiasm v0.12 and can be downloaded at ```txt ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/submission/hifiasm-0.12/ ``` NA12878 was assembled with an older version of hifiasm and is available at ```txt ftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/NA12878-r253/ ``` [unitig]: http://wgs-assembler.sourceforge.net/wiki/index.php/Celera_Assembler_Terminology [gfa]: https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md [paf]: https://github.com/lh3/miniasm/blob/master/PAF.md [yak]: https://github.com/lh3/yak ## Getting Help For detailed description of options, please see `man ./hifiasm.1`. The `-h` option of hifiasm also provides brief description of options. If you have further questions, please raise an issue at the [issue page](https://github.com/chhylp123/hifiasm/issues). ## Limitations 1. Purging haplotig duplications may introduce misassemblies.