# CLAIRE

**Repository Path**: xiaoruiwang_1_0/CLAIRE

## Basic Information

- **Project Name**: CLAIRE
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-05-12
- **Last Updated**: 2025-05-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


This repository includes codes to run the model in paper

[*CLAIRE: A Contrastive Learning-based Predictor for EC number of chemical reactions*](https://doi.org/10.1186/s13321-024-00944-8 )

to predict EC numbers for chemical reactions. 

# 1.Environment setup

In terminal
```
cd CLAIRE/
conda create -n claire python==3.10
conda activate claire
pip install -r requirements.txt
```
Install `torch`：You may install GPU or CPU version of `torch`.

```
conda install pytorch==1.11.0 cpuonly -c pytorch (CPU)
conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch (GPU)
```

Run the following to install rxnfp:
```
bash rxnfp_env.sh
```

# 2.Data
You can download the ([*data*](https://zenodo.org/records/14635841)) and place it under the `CLAIRE/dev/` directory. The descriptions and purposes for the downloaded files are the following.

`data/embedding`: reaction embeddings from two schemes (DRFP and rxnfp), as well as the python scripts to obtain them;

`data/pred_rxn_ECx`: esm_emb (a dictionary for reaction SMILES and embeddings mapping), labels of testing and training sets; "x" here denotes different levels of EC numbers (first digit, two digits, three digits).

`data/model_lookup_test.pkl`: the featurized testing set (after embedding) in a matrix;

`data/model_lookup_train.pkl`: the featurized training set (after embedding) in a matrix **[NOTE: this file is needed for predictions]**;

`data/test_augmented.csv`: testing set augmented samples in reaction SMILES format and their corresponding EC labels;

`data/train_augmented.csv`: training set augmented samples in reaction SMILES format and their corresponding EC labels;

`data/predictable_EC.csv`: EC numbers that are in the scope of our model. **[NOTE: CLAIRE cannot predict EC numbers beyond this list]**.

# 3.How to use

**(1). Run DRFP embeddings**

Suppose you have three query reactions to be predicted (shown below), saved in a txt file ("my_rxn_smiles.txt"). 
Note that multiple reactants and products are seaparated by "."; reactants and products are separated by ">>".

```txt
NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1.NCCC=O.O>>NCCC(=O)O
C=C(C)CCOP(=O)([O-])OP(=O)([O-])[O-].CC(C)=CCOP(=O)(O)OP(=O)(O)O>>CC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCOP(=O)(O)OP(=O)(O)O
N.NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](OP(=O)(O)O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1.O=C([O-])CCC(=O)C(=O)[O-].[H+]>>N[C@@H](CCC(=O)[O-])C(=O)[O-]
```

Activate the `claire` environment:
```
cd CLAIRE/
conda activate claire
```

Run the following command to obtain DRFP embeddings and save it in "my_rxn_fps.pkl"
```
drfp my_rxn_smiles.txt my_rxn_fps.pkl -d 256
```
where -d is the dimension of the embeddings


**(2). Run rxnfp embeddings**

Activate the rxnfp environment:

In Python, import the relevant packages
```python
from dev.prediction.inference_EC import inference
import pickle
import numpy as np
import pandas as pd
from rxnfp.transformer_fingerprints import (
    RXNBERTFingerprintGenerator, get_default_model_and_tokenizer, generate_fingerprints
)
```

compute for the rxnfp embeddings
```python
model, tokenizer = get_default_model_and_tokenizer()
rxnfp_generator = RXNBERTFingerprintGenerator(model, tokenizer)
example_rxns = ["NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1.NCCC=O.O>>NCCC(=O)O", "C=C(C)CCOP(=O)([O-])OP(=O)([O-])[O-].CC(C)=CCOP(=O)(O)OP(=O)(O)O>>CC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCCC(C)=CCOP(=O)(O)OP(=O)(O)O", "N.NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](OP(=O)(O)O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1.O=C([O-])CCC(=O)C(=O)[O-].[H+]>>N[C@@H](CCC(=O)[O-])C(=O)[O-]"]
rxnfp = rxnfp_generator.convert_batch(example_rxns)
pickle.dump(rxnfp, open('rxnfp_emb.pkl', 'wb'))
```

**(3). Concatenate the rxnfp and drfp embeddings**

```python
drfp = pickle.load(open('my_rxn_fps.pkl', 'rb'))
rxnfp = pickle.load(open('rxnfp_emb.pkl', 'rb'))
test_data = []

for ind, item in enumerate(rxnfp):
    rxn_emb = np.concatenate((np.reshape(item, (1,256)), np.reshape(drfp[ind], (1,256))), axis=1)
    test_data.append(rxn_emb)

test_data = np.concatenate(test_data,axis=0)
```
**(4). Make predictions on the concatenated embeddings**

Activate the claire environment:
```python
train_data = pickle.load(open ('data/model_lookup_train.pkl', 'rb'))
train_labels = pickle.load(open ('data/pred_rxn_EC123/labels_train_ec3.pkl', 'rb')) #if you want 1-level EC or 2-level EC, change it to pred_rxn_EC1/labels_trained_ec1.pkl or pred_rxn_EC12/labels_trained_ec2.pkl, resepetively.
# input your test_labels
test_labels = None
test_tags = ['rxn_' + str(i) for i in range(len(test_data))]

# EC calling results using maximum separation
pretrained_model = '../results/model/pred_rxn_EC123/layer5_node1280_triplet2000_final.pth'
inference(train_data, test_data, train_labels, test_tags,test_labels, pretrained_model, evaluation=True, topk=3, gmm = '../gmm/gmm_ensumble.pkl')
```
The prediction results are saved in `dev/test_prediction.csv`.

This project uses part of codes (the gmm functions) from the [*CLEAN*](https://github.com/tttianhao/CLEAN/) software developed by the Department of Chemical and Biomolecular Engineering at the University of Illinois Urbana-Champaign.