# temp **Repository Path**: wangx4616/temp ## Basic Information - **Project Name**: temp - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-12-22 - **Last Updated**: 2024-12-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Unimol Model for Sequence Classification This repository implements the `Unimol` architecture for molecular sequence classification, based on the `transformers` library. It provides a flexible and extendable structure for working with the Uni-Mol framework, allowing for easy modification or extension. The implementation is a **transformers migration** of `dptech/unimol_tools`, specifically designed to work with organic molecules. It includes the corresponding tokenizer for organic molecules but does **not** support predictions for other types of materials, such as MOFs (Metal-Organic Frameworks) or OLEDs (Organic Light Emitting Diodes). The codebase follows the structure of `transformers`, making it simple to adapt for different use cases and extend for additional functionality in the future. ## Scripts Overview ### 1. `test.py` This script is used to test the model's inference capabilities on a list of SMILES strings. It demonstrates how to load a pretrained model and tokenizer, prepare molecular data, and perform inference. #### Key Functions: - **Loading the model and tokenizer**: Use the following code to load a pretrained `UnimolForSequenceClassification` model and tokenizer: ```python model = UnimolForSequenceClassification.from_pretrained('pretrained/hg_unimol') tokenizer = UnimolTokenizer.from_pretrained('pretrained/hg_unimol') ``` - **Encoding SMILES strings**: SMILES strings can be encoded into model input using the `encode` method of the tokenizer: ```python encoded = tokenizer.encode(smiles, add_special_tokens=True, use_3d=True) ``` - **Performing inference**: The encoded data is passed into the model for inference: ```python model.to("cuda") logits = model(**encoded) print(logits) ``` #### Example Usage: ```bash python test.py ``` ### 2. `run.py` This script is used for training the `UnimolForSequenceClassification` model on a dataset of SMILES strings and labels, with optional evaluation. #### Key Functions: - **Preparing the dataset**: The dataset is read from an Excel file containing SMILES strings and corresponding labels. Each SMILES string is encoded using the tokenizer: ```python encoded = tokenizer.encode(smiles, add_special_tokens=True, use_3d=True) ``` - **Training the model**: The `Trainer` class is used to train the model with the dataset: ```python trainer = Trainer(model, dataset, params) trainer.train() ``` - **Evaluation**: The model can be evaluated using the `CustomTest` class: ```python test = CustomTest(trainer) test.run() ``` Here are the training results visualized in the following chart: ![Evaluation Metrics](./evaluation_metrics_pt.png) #### Example Usage: ```bash python run.py ``` ### 3. `move_weights.py` This script is used to migrate the state dictionary (`state_dict`) from the `dptech/unimol` implementation to this implementation. It performs necessary adjustments to layer names and weights to ensure compatibility with the `Unimol` model. #### Key Functions: - **Loading the state_dict from `dptech/unimol`**: The pretrained weights from `dptech/unimol` are loaded as follows: ```python cache = torch.load("pretrained/dptech_unimol/mol_pre_all_h_220816.pt")['model'] ``` - **Renaming and reorganizing layers**: The state_dict keys are renamed and reorganized to match the new model architecture. For example: ```python key = key.replace("encoder", "unimol") ``` - **Saving the new state_dict**: After all necessary modifications, the updated state_dict is saved: ```python torch.save(new_cache, "moved_weight.pt") ``` #### Example Usage: ```bash python move_weights.py ``` ## Folder Structure ```bash ├── model │ ├── modeling_unimol.py # Model definition for Unimol │ ├── configuration_unimol.py # Configuration for the Unimol model │ ├── tokenization_unimol.py # Tokenizer for the Unimol model ├── utils │ ├── data_collator.py # Data collator to batch the data │ ├── trainer.py # Trainer class for model training ├── pretrained │ ├── hg_unimol # Pretrained model and tokenizer ├── run.py # Training script ├── test.py # Inference testing script ├── move_weights.py # Script to migrate weights └── requirements.txt # Python dependencies ``` ## Notes This work is based on **Uni-Mol tools for property prediction, representation, and downstreams**. Uni-Mol tools are easy-to-use wrappers for property prediction, representation, and downstream tasks with Uni-Mol. It includes the following tools: - Molecular property prediction with Uni-Mol. - Molecular representation with Uni-Mol. - Other downstream tasks with Uni-Mol. For more details, check the [https://github.com/deepmodeling/Uni-Mol/](unimol_tools) repository. Documentation for Uni-Mol tools is available at: [https://unimol.readthedocs.io/en/latest/](https://unimol.readthedocs.io/en/latest/).