# AMPRIL-genomes

**Repository Path**: lqhhhhhh/AMPRIL-genomes

## Basic Information

- **Project Name**: AMPRIL-genomes
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-15
- **Last Updated**: 2021-04-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## Overview
### Please be informed that all scripts here are only for reproducing the results based on our 8 thaliana genomes. You may have to modify the scripts if you would like to apply them to your projects

These scripts are used for the project of seven Arabidopsis thaliana genomes, including 
	
	Protein-coding gene annotation
	
	Pan-genome construction
	
	Structural variation calling
	
	Synteny diversity analysis

All related raw reads, assembly, annotation results can be freely accessed in 1001 Genomes Project website
https://1001genomes.org/data/MPIPZ/MPIPZJiao2020/releases/current/

### Citation:
Jiao, W., Schneeberger, K. Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics. Nat Commun 11, 989 (2020). https://doi.org/10.1038/s41467-020-14779-y


## System Requirements
The scripts have been tested on Linux (4.4.0-97-generic #120-Ubuntu)

Perl (5.22.2)

Python (2.7.15)

R (3.5.2)

## Installation
git clone https://github.com/schneebergerlab/AMPRIL-genomes.git

All scripts can be run directly.

## Usage

### workflow for protein-coding gene annotation 
	working directory stucture (e.g: Cvi genome:
	/AMPRIL/annotation/Cvi
	/AMPRIL/annotation/Cvi/reference
	/AMPRIL/annotation/Cvi/abinitio
	/AMPRIL/annotation/Cvi/protein
	/AMPRIL/annotation/Cvi/RNAseq
	/AMPRIL/annotation/Cvi/EVM_PASA
	/AMPRIL/annotation/Cvi/evaluation
	/AMPRIL/annotation/Cvi/version
	/AMPRIL/annotation/scripts
	/AMPRIL/genefamily/blastpAraport11/Cvi

#### step 1: run pipeline for protein-coding gene annotation
	python ../scripts/evm.pasa.integrate.pipeline.py -f ./annotation.config
  	This script will do 
  
  	1) protein sequence alignment using exonerate  
  	2) RNAseq reads mapping using hisat2		
  	3) ab initio prediction using AUGUSTUS, GlimmerHMM, SNAP		
  	4) merge results from 1),2) and 3)
  	5) run EVM to integrate the results to get consensus gene models
  	6) get the gene, protein, CDS sequences based on the gene model gff3 file and reference fasta file
  	The results will be included in the file evm.all.gff3
  
#### step 2: Annotate repeats and TE-related genes
	RepeatMasker -species arabidopsis -gff -dir ./ -pa 20 ../../reference/chr.all.v2.0.fasta
  	perl ../../../scripts/repeat.classfied.gff3.pl ./chr.all.v2.0.fasta.out.gff ./chr.all.v2.0.fasta.out ./chr.all.v2.0.fasta.repeats.ann.gff3 repeat.ann.stats &
  	egrep -v 'Low|Simple|RNA|other|Satellite' chr.all.v2.0.fasta.repeats.ann.gff3 |cut -f 1,4,5,9 >chr.all.v2.0.TE.bed
  	perl ../../../scripts/remove.TErelated.genes.pl ../../EVM_PASA/evm.annotation.protein.fasta ../../EVM_PASA/evm.annotation.gene.fasta ../RepeatMasker/chr.all.v2.0.TE.bed ../../EVM_PASA/evm.all.gff3 ./ 

#### step 3: get the a first version of gene models and evaluate
	get the first version
	nohup python ./scripts/gene.id.update.py -i ./ -v 2.0 >log/gene.id.update.log &
	
	awk '{if ($3!="miRNA_primary_transcript" && $3!="pseudogenic_exon" && $3!="pseudogenic_transcript" && $3!="pseudogenic_tRNA" && $3!="transposon_fragment" && $3!="mRNA" && $3!="protein" && $3 !="CDS" && $3!="exon" && $3!="five_prime_UTR" && $3!="three_prime_UTR" && $3!="lnc_RNA") print}' Araport11_GFF3_genes_transposons.201606.gff |grep -v '^#' |cut -f 1,4,5,9 |grep -vP '\.\d+\;Parent' |grep -v  'ChrC|ChrM' >Araport11_gene.TE.chr1-5.bed 
	
	cd /AMPRIL/annotation/Cvi/evaluation
	mkdir blastnCol misannotation update 

##### 1) gene seq. blastn
	cd blastnCol
	ln -s ../../version/Sha.v2.0.gene.fasta ./gene.fasta
	blastn -query ./gene.fasta  -db TAIR10_chr_all.fas -num_threads 20 -evalue 1e-5 -out gene.blastn.Col.out &

	perl annotation.evaluation.by.geneBlastnCol.pl ./gene.blastn.Col.out
	awk '{print $2"\t"$7"\t"$8"\t"$1"\t"$9"\t"$10}' ../blastnCol/gene.blastn.besthit.out |sed 's/chloroplast/ChrC/g' |sed 's/mitochondria/ChrM/g' |awk '{if (!/Chr/) print "Chr"$0;else print}' |sed 's/ChrNone/None/g' >query.gene.blastn.besthit.out

	blastn -query Araport11.prot.genomic.seq.fasta -db chr.all.v2.0.fasta -out arap11.prot-gene.blastn.out -evalue 1e-5 &
	perl ../../../scripts/assembly.eval.arap11.single.pl ./arap11.prot-gene.blastn.out ../../../../tair10/Araport11/Araport11.prot.genomic.seq.fasta ./araport11.gene.blastn.assembly.out &

##### 2) prot seq. blastp and gene family clustering
	cd /AMPRIL/genefamily/blastpAraport11/Cvi
	
	mkdir prot orthomcl; awk '{if (/>/) print $1;else print}' ../../../annotation/Cvi/version/Cvi.1.0.protein.fasta  |sed 's/>/>Cvi|/g' > prot/Sha.fasta
	cd prot/; ln -s ../../Col.fasta ./ ; cd ../ ; cat prot/*.fasta > Col-Cvi.fasta ; makeblastdb -in ./Col-Cvi.fasta -dbtype prot ;sed 's/Kyo/Sha/g' ../Kyo/orthomcl/orthomcl.config  >orthomcl/orthomcl.config; 

	###run orthomcl clustering
	blastp -query Col-Cvi.fasta -db Col-Cvi.fasta -num_threads 40 -evalue 1e-10 -outfmt 6 -out blastout 
	orthomclInstallSchema ./orthomcl/orthomcl.config install.sql.log
	grep -P "^[^#]" blastout > blastresult; orthomclBlastParser blastresult prot > orthomcl/similarSequences.txt
	perl -p -i -e 's/\t(\w+)(\|.*)orthomcl/\t$1$2$1/' orthomcl/similarSequences.txt; perl -p -i -e 's/0\t0/1\t-181/' orthomcl/similarSequences.txt; cd orthomcl 
	orthomclLoadBlast ./orthomcl.config similarSequences.txt ; orthomclPairs ./orthomcl.config orthomcl_pairs.log cleanup=all  ; orthomclPairs ./orthomcl.config orthomcl_pairs.log cleanup=no; orthomclDumpPairsFiles ./orthomcl.config ;  mcl mclInput --abc -I 1.5 -o mclOutput -te 20; orthomclMclToGroups group 1 < mclOutput > groups.txt &


##### 3) prepare files
	cd /AMPRILdenovo/annotation/Cvi/evaluation/misannotation
	ln -s ../../../../tair10/Araport11/Araport11_gene.bed Araport11.protein.bed
	ln -s ../../../../genefamily/blastpAraport11/Col.fasta Araport11.protein.fasta

	cd ../../version/ ; awk '{if ($3=="gene") print}' Cvi.protein-coding.genes.v2.0.gff |cut -f 1,4,5,7,9 >Cvi.protein-coding.genes.v2.0.bed ; 
	ln -s ../../version/Cvi.protein-coding.genes.v2.0.bed query.protein.bed
	ln -s ../../../../genefamily/blastpAraport11/Cvi/prot/Cvi.fasta query.protein.fasta

	ln -s  ../../../../genefamily/blastpAraport11/Cvi/blastresult
	ln -s  ../../../../genefamily/blastpAraport11/Cvi/orthomcl/groups.txt ./

	awk '{print $2"\t"$7"\t"$8"\t"$1"\t"$9"\t"$10}' ../../../../assembly/ShaNew/evaluation/Araport11blastn/araport11.gene.besthit.out >Araport11.gene.blastn.besthit.out
	awk '{print $2"\t"$7"\t"$8"\t"$1"\t"$9"\t"$10}' ../blastnCol/gene.blastn.besthit.out >query.gene.blastn.besthit.out

##### 4) find mis-merging, mis-spliting, missing, mis-annotated genes 
	input files: 
		Araport11 and accession protein region bed files and sequences files.
		  grep gene  ../../repeat/TErelated/annotation.genes.gff|cut -f 1,4,5,9 |sed 's/TU/model/g' >query.prot.gene.bed   
		Blastp result of accession proteins against Araport11 proteins
		OrthoMCL clustering result between accession and Araport11 proteins
		  grep AT  ../../../../genefamily/AssV2tmp/An-1/prot/Results_Aug11/Orthogroups.txt |grep evm >groups.txt
		Blastn result of Araport11 gene sequences against the accession assembly (Blastn result 1)
		  awk '{print $2"\t"$7"\t"$8"\t"$1"\t"$9"\t"$10}' Col.prot.besthit.out |grep -P 'AT\d' >Col.prot.besthit.out2
		Blastn result of accession gene sequences against Col-0 genome sequences. (Blastn result 2)
		  awk '{print $2"\t"$7"\t"$8"\t"$1"\t"$9"\t"$10}' query.prot.besthit.out >query.prot.besthit.out2
	output files:
		potential.mis-merged.gene.txt (function: findMisMer, based on the result of blastp between Col and the Accession)
		potential.mis-spliting.gene.txt (function: findMisSplit. based on the result of blastp between Col and the Accession)	
		potential.query.un-assembled.gene.txt (findMissingMisann. based on the result of Col gene blastn against the accession's assembly)
		potential.missing.gene.txt (findMissingMisann. based on the result of Col gene blastn against the accession's assembly)
		potential.mis-exon-intron.gene.txt
		potential.mis-split.gene.by.blastn.txt
		potential.mis-merge.gene.by.blastn.txt
		potential.m-vs-m.toBeChecked.by.blastn.txt
		futher.check.list
		potential.mis-annotated.gene.txt
		Araport11.ungrouped.gene.analysis.stat
		Araport11.ungrouped.gene.analysis.txt
		query.ungrouped.gene.analysis.stat
		query.ungrouped.gene.analysis.txt
		query.genes.to.be.updated.added.txt (based on the potential.xxx.txt from blastn-based analysis)
		query.genes.to.be.updated.added.srt.txt	
	
	##to find genes:
		mis-merged (Blastp result, blastn)
		mis-split (Blastp result, blastn)
		wrong exon-intron structure (blastn)
		false protein-coding genes (not annotated in Araport 11 but actually they were assembled in the Col-0 genome) (Blastn result 2)
		missing genes (not annotated in accession, but actually they were assembled)  (Blastn result 1)
		
		nohup python -u ../../../scripts/annotation.evaluate.find-mis.py -g ./groups.txt -o ./run2 -n  Col.prot.besthit.out2 -c query.prot.besthit.out2 -p blastp.result -s Col.prot.gene.bed -q query.prot.gene.bed -x Col.prot.fasta -y query.prot.fasta -a Col.gene.LoF.txt -b query.gene.LoF.txt -r ../../RNAseq/hisat2/rnaseq.4evm.gff >np.run2&

#### step 6: update    
##### 1) run scipio to align the protein sequences from Araport11 annoation
	python ../../../scripts/run.scipio.py -i ../../../data/protein/Araport11-split -o ./  -r ./reference/chr.all.v2.0.fasta >run.log &
  

##### prepare the files
	ln -s ../misannotation/Cvi.genes.to.be.updated.added.srt.txt ./
	grep chr Cvi.genes.to.be.updated.added.srt.txt >genes.to.be.updated.txt
	grep -v chr Cvi.genes.to.be.updated.added.srt.txt >>genes.to.be.updated.txt 
	ln -s ../../abinitio/abinitio.4evm.gff 
	cp ../../../Cvi/evaluation/update2/srt.gff.pl ./
	perl ./srt.gff.pl ./abinitio.4evm.gff SNAP SNAP.ann.gff
	perl srt.gff.pl ./abinitio.4evm.gff GlimmerHMM GlimmerHMM.ann.gff  
	cat ../../augustus/genome.chunk.*.gff |egrep -v '#|intron|transcription|codon' > ../../augustus/augustus.ann.gff
	ln -s ../../augustus/augustus.ann.gff
	ln -s ../misannotation/Araport11.gene.blastn.besthit.out2 ./ 
	ln -s ../../../../tair10/Araport11/Araport11_genes.201606.pep.repr.fasta
	ln -s ../misannotation/Araport11.protein.bed ./
	ln -s ../../version/Cvi.protein-coding.genes.v1.0.gff ./ 
	cat ../../scipio/run2/splitOut/protein.chunk.*.gff >../../scipio/run2/splitOut/scipio.gff
	ln -s  ../../scipio/run2/splitOut/scipio.gff ./ 
	ln -s ../../../../wga/results/Cvi/Cvi.wga.snp.indel.gene.LOF.txt ./wga.snp.indel.gene.LoF.txt

##### 2) update
	input files:
		gene.to.be.updated.txt
		xx.protein-coding.genes.v1.0.gff
		scipio.gff
		wga.snp.indel.gene.LoF.txt
		augustus.ann.gff; SNPA.ann.gff; GlimmerHMM.ann.gff
		Col gene blastn best hit out
		Col Araport11 protein sequence and bed files
		ChrCM.txt (a few of organella contigs were not removed in the assembly process)
	output files:
		genes.to.be.updated.txt2
		updated.gff
		updated.rmdup.gff
		updated.highConf.gff
		updated.highConf.prot.fasta
	Method:
		1) check the LoF information resulting from WGA-based SNPs and InDel annotation, and the Col protein sequence alignment result from Scipio, add the update information :
			ChrCM: ChrC or ChrM genes
			LowConf: low confident genes
			unchange: keep the previous annotation, 
			ChangeSci: annotate based on Scipio result (checkScipio: start codon, stop codon, splice sites, frame-shift, premature stop-codon gain; check AugGenes snapGenes glimGene, check AugGenes2[ab initio], check GeneWise)
			ChangeAug: annotate based on Augustus ( check AugGenes snapGenes glimGene, check AugGenes2[ab initio], check GeneWise)
			ChangeSciAug: annotate based on scipio and augustus (checkScipio, check AugGenes snapGenes glimGene, check AugGenes2[ab initio], check GeneWise)
			not-add: not annotated	
	
		2) prepare the other annotation result from Augustus-evidence-based method, SNAP ab initio and GlimmerHMM ab initio result
	 		cat ../../abinitio/augustus/augustus.hint.chunk.*.gff |egrep -v '#|intron|transcription|codon' > ../../augustus/augustus.ann.gff
		3) update 
 	nohup python -u ../../../scripts/update.misann.genes.py -u genes.to.be.updated.txt -g annotation.genes.gff -o ./run2 -s scipio.gff -x Col.gene.LoF.txt -y query.gene.LoF.txt -c ChrCM.txt -a augustus.ann.gff -n SNAP.4evm.gff -l glimmerhmm.4evm.gff -b ./Col.gene.blastn.besthit.bed  -f ../../reference/chr.all.v2.0.fasta -p Col.prot.fasta -i ./Col.prot.gene.bed >update.run2.log  &
	python ../../scripts/annotation.gene.ID.update.py -i update2/updated.highConf.gff -n ../version/Cvi.genes.annotation.v2.0.gff -o ../version -v v2.5 -a Cvi -g ../reference/chr.all.v2.0.fasta &

 
## Workflow for pan-genome analysis
	pangenome can be built based on the whole genome sequence alignment or protein-coding genes ortholog clustering
	
	## Pan-genome: genome sequence alignment
		do all pairwise whole genome comparisons using MUMmer
		prepare all chromosome length information in a file for each genome; e.g:
			Chr1    30401407
			Chr2    19417579
			Chr3    23034411
			Chr4    18785460
			Chr5    26733864
	python -u wga.pangenome.py -w ./pairwiseAssV2 -o ./ -g ../chrBed_v2/ &

	## Pan-genome: protein-coding genes ortholog clustering
		python pangenome.build.py -g AMPRIL.Alyrata.ortholog.groups.csv -o ./


## Workflow for Structural variations calling for each assembled genome.
	All assemblies were aligned to the reference sequence (TAIR10) using nucmer from the MUMmer4 toolbox with parameter setting “-max -l 40 -g 90 -b 100 -c 200”. The resulting alignments were further filtered for alignment length (>100) and identity (>90). Structural rearrangements and local variations were identified using SyRI (https://github.com/schneebergerlab/syri).

	For details of the file format, please check https://schneebergerlab.github.io/syri/fileformat.html

	Example of commands used:
  	  nucmer --maxmatch  -l 40 -g 90 -c 100 -b 200 -t 20 Col.fasta An-1.fasta
  	  delta-filter -m -i 90 -l 100 out.delta > out_m_i90_l100.delta
  	  show-coords -THrd out_m_i90_l100.delta > out_m_i90_l100.coords
      syri -c out.chrom.coords -d out_m_i90_l100.delta -r Col.fasta -q An-1.fasta --nc 5 --all -k

## Workflow for the analysis of synteny diversity
	## step 1:
	Before caculate synteny diversity, please run all pairwise whole genome comparison using MUMmer and run SyRi to identify the syntenic and rearranged regions for each comparison.
	Let's assume all the alignments in a folder like below:
		/xxx/pairwiseWGA
		/xxx/pairwiseWGA/An-1
		/xxx/pairwiseWGA/An-1/C24
		/xxx/pairwiseWGA/An-1/Cvi
		...
		/xxx/pairwiseWGA/An-1/Sha
		...
		...
		...
		/xxx/pairwiseWGA/Sha
		/xxx/pairwiseWGA/Sha/An-1
		/xxx/pairwiseWGA/Sha/C24
		...
	## step 2: get coordinates of syntenic regions in all genomes
  		bedtools multiinter -i ../../results/pairwiseAssV2/Col/*/*.wga.syn.block.txt  -names An-1 C24 Cvi Eri Kyo Ler Sha >Col.syn.txt
  		for k in {1..5};do show-aligns -r out_m_i90_l100.delta Chr$k chr$k >out_m_i90_l100.chr$k.aligns ;done &
  		for k in {An-1,C24,Cvi,Eri,Kyo,Ler,Sha};do cat $k/out_m_i90_l100.chr*.aligns >$k/$k.aligns ;done &
  		bedtools multiinter -i ../../results/pairwiseAssV2/Col/*/*.wga.syn.block.txt  -names An-1 C24 Cvi Eri Kyo Ler Sha >Col.syn.txt
  		perl ../../scripts/get.all.syn.coord.pl ./Col.syn.all.txt ../../results/pairwiseAssV2/Col/ ./Col.syn.all.coords.txt &
		
	## step 3: caculate synteny diversity for every postion of the genome
		perl ../../scripts/calculate.syn.diversity.pl ./Col.syn.all.coords.txt2 ../../results/pairwiseAssV2/ ../../chrBed_v2 ./syn.diversity.position.Col.txt 

	## caculate synteny diversity in a sliding window
		for k in {1..5}; do   perl ../../scripts/calculate.syn.diversity.window.pl ./splitChr/Chr$k.syn.div.pos.txt 5000 1000 splitChr/Chr$k.syn.div.win50kb.step5kb.txt & done &

	## find HOR (HDR)
	awk '{if ($5>0.5)print}' syn.div.win5kb.step1kb.txt |bedtools merge -i - -d 2002 |bedtools intersect -a - -b ../../../tair10/centromere_Giraut2011.bed -wao |awk '{if ($7==0) print $0"\tA";else print $0"\tC"}' |cut -f 1-3,8  >syn.div.win5kb.step1kb.HDR.bed

	## gene arrangment in the HOR(HDR)
 	nohup perl ../../scripts/syndiv/HDR.gene.scheme.2.pl ../00_synDiv/Col.syn.all.coords.txt2 ./HDR.clu.bed ../../01_syri/pairwiseAssV2/ ../../../genefamily/AMPRIL/ver3/AMPRIL.ortholog.groups.csv ../../../genefamily/AMPRIL/ver3/geneBed2/ ../../../genefamily/AMPRIL/ver3/Rgenes/ann/ 50000 ./cluWin50kb2 > np.log2 &
	
	## R gene arrangment in the HOR (HDR)
	nohup perl ../../../scripts/Rgenes/R.gene.cluster.wga.ortho.scheme.pl ../../../../wga/07_synDiversity/00_synDiv/Col.syn.all.coords.txt2 ./R.gene.cluster.bed ../../../../wga/01_syri/pairwiseAssV2/ ../AMPRIL.ortholog.groups.csv ../geneBed2/ ./ann/ 20000 ./tmp >tmp.log&