# PathMiner

**Repository Path**: dllzb/pathminer

## Basic Information

- **Project Name**: PathMiner
- **Description**: PathMiner 是一个用于转录组组装注释与代谢物预测的生物信息学工具，支持基于蛋白质序列的 EC 号注释 与 KEGG 化合物预测
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2025-12-09
- **Last Updated**: 2025-12-09

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

PathMiner
PathMiner 是一个用于转录组注释与代谢物预测的生物信息学工具，支持基于蛋白质序列的 EC 号注释 与 KEGG 化合物预测。

安装
推荐使用 conda 环境安装：
conda env create -f environment.yml
conda activate pathminer

1. RNA Assembly & Protein Prediction（run_assembly_pipeline.sh）
功能：从原始转录组 PE reads（*_1.fq.gz/*_2.fq.gz）批量完成
Trinity →（最长异构体）→ CD-HIT-EST 去冗余 → TransDecoder 蛋白预测，支持断点续跑与并行，自动记录日志到 <OUT>/run_pipeline.log。
输入文件格式
仅支持 双端命名：SAMPLE_1.fq.gz 与 SAMPLE_2.fq.gz（同目录）
使用方法
 语法
bash run_assembly_pipeline.sh <IN> <OUT> <trinity|cdhit|transdecoder|all> <threads>
 示例：全流程，64线程
只run  Trinity
bash src/pathminer/run_assembly_pipeline.sh /home/hh/PathMiner/SRA /home/hh/PathMiner/input trinity 32
只run  CD-HIT
bash src/pathminer/run_assembly_pipeline.sh /home/hh/PathMiner/SRA /home/hh/PathMiner/input cdhit 32
只run  TransDecoder
bash src/pathminer/run_assembly_pipeline.sh /home/hh/PathMiner/SRA /home/hh/PathMiner/input transdecoder 32
Run all pipeline
bash src/pathminer/run_assembly_pipeline.sh /home/hh/PathMiner/SRA /home/hh/PathMiner/input all 32

trinity：仅组装；cdhit：仅去冗余；transdecoder：仅预测；all：全流程。

运行结果示例
<OUT>/
├── Trinity/
│   └── SAMPLE.Trinity.fasta
├── CD-HIT/
│   └── SAMPLE.longest_isoform.cd100.fasta
└── TransDecoder/
    └── SAMPLE/
        ├── SAMPLE.faa
        ├── SAMPLE.transdecoder.cds
        └── SAMPLE.transdecoder.gff3

依赖
Trinity（含 util/misc/get_longest_isoform_seq_per_trinity_gene.pl、jellyfish）
GNU parallel（并行样本级调度）
CD-HIT-EST
TransDecoder
Perl（运行 Trinity 的工具脚本）
（脚本中调用的 get_longest_isoform_seq_per_trinity_gene.pl 路径为本地安装路径，若不同请自行修改。）
提示：若样本较多且内存有限，可先跑 trinity，待完成后再分步运行 cdhit / transdecoder，便于资源调度。

2. Orthogroups & Trees（run_orthogroups_pipeline.sh）
功能：从多个物种蛋白 *.faa 自动完成 核心 OG 提取 → MAFFT 比对 → trimAl 裁剪 → 基因树（FastTree或IQ-TREE）→ 拼接超矩阵与拼接树 → ASTRAL 共祖树；支持断点续跑（自动跳过已完成阶段）。
输入文件格式
输入目录需包含 ≥2 个物种的 *.faa（可为 TransDecoder/<sample>/<sample>.faa 的层级）。
使用方法（批量，推荐）
 语法
bash run_orthogroups_pipeline.sh <IN_DIR> <OUT_DIR> <CPU> [fasttree|iqtree]

 示例：默认 FastTree,覆盖≥0.90，平均长度≥300 aa
bash src/pathminer/run_orthogroups_pipeline.sh /home/hh/PathMiner/input/TransDecoder /home/hh/PathMiner/output 64

 示例：使用 IQ-TREE,修改提取同源基因标准（覆盖100%，平均长度≥500 aa）
COVER_FRAC=1 MIN_AVGLEN=500 bash src/pathminer/run_orthogroups_pipeline.sh /home/hh/PathMiner/input/TransDecoder /home/hh/PathMiner/output 64 iqtree

输出位于
<OUT>/Orthogroups/
├── orthogroups/
│   ├── OGxxxxx.fasta
│   └── ...
├── MSA/
│   ├── OGxxxxx.aln.fasta
│   └── ...
├── Trim/
│   ├── OGxxxxx.trim.fasta
│   └── ...
├── genetrees/
│   ├── OGxxxxx.tree
│   └── ...
├── result/
│   ├── Concatenation.fasta
│   ├── Concatenation.tree
│   ├── gene_trees.tre
│   └── Coalescent.tree
└── pipeline.log

依赖
mafft、trimal、FastTree/FastTreeMP 或 iqtree2、python3、ASTRAL4（脚本内默认路径：/home/hh/PathMiner/tools/ASTRAL/ASTER-Linux/bin/astral4），若首次运行还需 orthofinder。

3. Protein Annotation (run_annotate_pipeline.py)
功能：对蛋白质 FASTA（递归批量）执行 DIAMOND 比对 → EC 号注释 → KEGG 化合物预测，并在输出根目录下创建 annotate/，按物种/样本分别生成结果。
输入文件格式
支持 .faa（默认递归搜索，亦可用 --pattern 自定义）
依赖
diamond、pandas；并要求以下数据库文件存在于 <BASE_DIR>/database/：
ec_sequences.dmnd、ec_uniprot.tsv、KEGG_Compound_Full_Info.tsv
使用方法（批量，推荐）
仅需把输入指定到 TransDecoder 顶层目录：
python src/pathminer/run_annotate_pipeline.py \
  -i /home/hh/PathMiner/input/TransDecoder \
  -o /home/hh/PathMiner/output \
  -t 64 --overwrite
运行结果示例
/home/hh/PathMiner/output/annotate/
 ├── Rhododendron_delavayi/
 │   ├── diamond_results.tsv
 │   ├── ec_prediction.tsv
 │   └── predicted_compounds.tsv
 ├── Rhododendron_moulmainense/
 │   ├── diamond_results.tsv
 │   ├── ec_prediction.tsv
 │   └── predicted_compounds.tsv
 ...

文件说明
diamond_results.tsv：DIAMOND 比对结果
ec_prediction.tsv：EC 注释
predicted_compounds.tsv：按 EC 关联到的 KEGG 化合物
常用参数
-i 输入目录（如 input/TransDecoder/）
-o 输出根目录（自动创建 annotate/）
-t 线程数

4. Pathway & Module Completeness（pm_pathway_matrix.py）
功能：从 annotate 结果（每个物种的 ec_prediction.tsv）计算两类完整度并作图：
Pathway 覆盖度：命中 EC 数 / 该通路 EC 总数（0–1）
Module 完成度（MCR）：命中步骤数 / 该模块步骤总数（0–1）
输入文件格式
<ANNOT_ROOT>/<species>/ec_prediction.tsv（来自 run_annotate_pipeline.py）
需提供 KEGG 参考表（默认放在 <BASE_DIR>/database/）：
KEGG_Pathway_EC.tsv、KEGG_Module_Steps.tsv
使用方法
（推荐同时计算 Pathway & MCR）
python src/pathminer/pm_pathway_matrix.py \
  --annot-root /home/hh/PathMiner/output/annotate \
  --out        /home/hh/PathMiner/output \
  --pathway-map   /home/hh/PathMiner/database/KEGG_Pathway_EC.tsv \
  --module-steps  /home/hh/PathMiner/database/KEGG_Module_Steps.tsv \
  --species-order /home/hh/species_order.txt \ #指定物种在结果矩阵和热图中的行顺序,如果不指定，默认按字母顺序排序,指定后，热图和导出的表格会按照文件里的顺序排列（常用于和系统发育树保持一致）
  --min-ec-per-pathway 8 \ #过滤掉总 EC 数量少于 8 的通路。一些KEGG pathway只包含极少数 EC往往信息量很低或噪音大。只保留结构比较完整（≥8 个酶）的通路来计算覆盖度和绘图，使结果更稳定
  --topN 20 --top-cols-by-var 60 #在生成xlsx时，每个物种只挑前20个分数最高的通路/模块。在热图中，只画跨物种方差最高的 60 条通路/模块（最能区分物种差异）。如果设为 0，表示不筛选，画所有通路，但列太多时图会很挤
tips：
若需要与系统发育树顺序一致，可用树叶名生成 species_order.txt 再传 --species-order

5. Compound Shared/Unique（pm_compound_sharedunique.py）
功能：汇总每个物种的预测化合物，生成 presence/absence 矩阵与 UpSet 图，快速查看“属核心化合物、物种特有化合物、常见共享组合”。
输入文件格式
来自上游 annotate 的结果：<annot-root>/<Species>/predicted_compounds.tsv（需含列 Compound_ID）
使用方法
python src/pathminer/pm_compound_sharedunique.py \
  --annot-root /home/hh/PathMiner/output/annotate \
  --out        /home/hh/PathMiner/output \
  --max-subsets 50
运行结果示例
<OUT>/downstream/2Compound_SharedUnique/
├── compound_matrix.tsv         # 行=Compound_ID；列=Species；值=0/1
├── core_compounds.txt          # 所有物种共有的化合物列表
├── unique_compounds.tsv        # 每个物种特有化合物（两列：Species, Compound_ID）
├── top_intersections.tsv       # Top-N 物种组合及其化合物数
└── figs/
    ├── upset.pdf               # UpSet 图（矢量）
    └── upset.png               # 1200 dpi PNG
输出说明
compound_matrix.tsv：presence/absence 总表，可用于后续统计与可视化。
core_compounds.txt：在所有物种中都出现的“属核心”化合物。
unique_compounds.tsv：仅出现在某个物种中的“物种特有”化合物。
top_intersections.tsv：按“该组合命中的化合物数”从高到低取前 N 个物种组合（与 UpSet 顶部柱状一致）。
UpSet 图：上方柱高=该物种组合共有的化合物个数；下方点阵为参与组合的物种（物种名以斜体显示，_ 自动替换为空格）。
常用参数
--max-subsets <N>：UpSet 仅展示化合物数最多的前 N 个“物种组合”（默认 50）。
依赖
Python：pandas、matplotlib、upsetplot
pip install pandas matplotlib upsetplot -i https://pypi.tuna.tsinghua.edu.cn/simple

6.Species Similarity Maps（pm_similarity_maps.py）
基于化合物/通路矩阵，输出三类可视化以刻画物种间相似性与组间差异：
Similarity heatmap：按物种两两相似度（Jaccard/Bray–Curtis/Cosine）绘制热图
PCA scatter：将物种投影到 PC1–PC2 平面；若提供分组则按组着色并绘制凸包
Richness violin：各组的化合物丰富度（每物种行求和）分布图（带箱线与抖点）
输入文件格式
<MATRIX>：行为特征（如 Compound_ID 或 pathway/module），列为物种（0/1 或连续值），TSV，第一列为行名
可选 <GROUPS>：两列 Species, Group（用于 PCA 着色与小提琴图分组）
可选 <TREE>：Newick 物种树（仅用于热图的物种顺序；未提供时改用层次聚类顺序）
使用方法
python src/pathminer/pm_similarity_maps.py \
  --matrix /home/hh/PathMiner/output/downstream/2Compound_SharedUnique/compound_matrix.tsv \
  --out    /home/hh/PathMiner/output \
  --groups /home/hh/PathMiner/groups.txt \        # 可选
  --tree   /home/hh/PathMiner/output/Orthogroups/result/Coalescent.rerooted.ultra.nwk \  # 可选；仅用于热图排序
  --metric jaccard \                               # jaccard|braycurtis|cosine
  --top-rows-by-var 0                              # 只取方差最高的若干特征(0=全部)
说明：--tree 可不提供；未提供时热图物种顺序采用层次聚类自动确定。
输出示例
<OUT>/downstream/3Similarity/
├── figs/
│   ├── species_similarity_heatmap.pdf|.png     # 相似度热图（1200 dpi, Arial, 物种名斜体，“_”→空格）
│   ├── pca_scatter_groups.pdf|.png             # PCA 散点（按组着色+凸包）
│   └── richness_violin_by_group.pdf|.png       # 组间丰富度小提琴图（自适应横向宽度，避免标签遮挡）
└── tables/
    ├── similarity_matrix.tsv                   # 物种×物种相似度矩阵
    └── species_richness.tsv                    # 每物种丰富度（列和）
参数说明
--metric：
jaccard（默认）：适合 0/1 特征；连续值会先阈值化为 0/1
braycurtis：适合计数/丰度型特征
cosine：适合组成向量（方向相近即相似）
--top-rows-by-var N：仅保留跨物种方差最高的 N 个特征参与相似度与 PCA（N=0 表示全部；可提升可读性）
--groups：提供后 PCA 会按组着色并绘制凸包；小提琴图按组汇总
--tree：若提供，将用于热图的物种顺序（按树叶顺序排列）；否则使用层次聚类顺序
依赖
pip install pandas numpy matplotlib scipy scikit-learn biopython \
  -i https://pypi.tuna.tsinghua.edu.cn/simple
若想与系统发育顺序严格一致，请提供 --tree；仅影响热图的行/列顺序，PCA 与小提琴图不受影响


验证安装
python -V
fastp -v
Trinity --version
TransDecoder.LongOrfs -h
diamond --version
mafft --version
iqtree --version
FastTreeMP -help
trimal --version
which astral4
perl -v
hmmsearch -h


python -c "import pandas; print('pandas', pandas.__version__)"
python -c "import Bio; print('biopython', Bio.__version__)"
python -c "import matplotlib; print('matplotlib', matplotlib.__version__)"
python -c "import scipy; print('scipy', scipy.__version__)"
python -c "import seaborn; print('seaborn', seaborn.__version__)"
python -c "import typer, rich, tqdm, requests; \
print('typer', typer.__version__); \
python -c "import importlib.metadata; print(importlib.metadata.version('rich'))" \
print('tqdm', tqdm.__version__); \
print('requests', requests.__version__)"
python -c "import openpyxl; print('openpyxl', openpyxl.__version__)"