# CAFA3 **Repository Path**: ZJX1230/CAFA3 ## Basic Information - **Project Name**: CAFA3 - **Description**: University of Turku CAFA3 project - **Primary Language**: Unknown - **License**: LGPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-01-13 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # CAFA3 University of Turku CAFA3 project Files are in the new machine in address: /home/sukaew/CAFA3 CNN experiment can be run with python train.py You'll need to copy the data folder from /home/kahaka/CAFA3/ Running preprocessing and sequence analysis ------------------------------------------- All preprocessing steps and sequence analyses can be run within the directory 'sequence_features' using the following command line. `python3 target_process.py -o [out_folder] -s [ori_seq]` The program needs two inputs, the out_folder and the ori_seq. The out_folder should be absolute path ended with '/' where the input ori_seq file/directory and the output features directory reside. The input ori_seq can be one of these four formats (folder of non-compressed fasta files, tar.gz, gz and zip). The sequence analysese include Blast Protein, DeltaBlast, Interproscan5, NetAcet, predGPI, nucPred and Taxonomy hierarchy. All analysis results are in folder called `feature`. Running the Feature Generation, Classification and Analysis ----------------------------------------------------------- All experiments can be run using the program `run.py`. The experimental code uses a three-step system. One or more of these actions can be performed using the command line option `--action` or `--a`. By default, all three actions (`build`, `classify` and `statistics`) are performed. The run.py program can be called like this: `python run.py -e [TASK] -o [OUTPUT] --targets external` The `[TASK]` value can be one of `cafa3`, `cafa3hpo` or `cafapi`. Depending on task, different input files are used. The `--targets` option defines how to handle CAFA targets. Making predictions with the neural model ---------------------------------------- cd neural Download and extract data (data.tar.gz) and model files (features_only.tar.gz) from https://github.com/TurkuNLP/CAFA3/releases/tag/v0.0 python3 predict_new.py ./features_only/ ./data/devel_sequences.fasta.gz ./data/examples.json.gz ./devel_predictions.tsv.gz This will use the trained model from ./features_only/ directory and make predictions for the target sequences. The input fasta file should not contain linebreaks within the sequences. examples.json.gz contains the pre-generated features. The last parameter is the output path. Cross-validation ---------------- By default, the scikit-learn classification will use the train/devel/test split for the learning data. To use n-fold cross-validation instead, use the `--fold` option of `run.py`. To do 10-fold cross-validation, the program can be run 10 times using a script like this: `for FOLD in 0 1 2 3 4 5 6 7 8 9; do python run.py -o /tmp/CAFA10fold/fold$FOLD --fold $FOLD; done` Ensemble -------- The program `ensemble.py` can be used to combine predictions from different systems and the BLAST fallback baseline. To run the ensemble, use a command like: `python ensemble.py -a [PRED1_DIR] -b [PRED2_DIR] -o [OUTPUT] --baseline 4 --simple --terms 1000000 --write --cafa --clear`