# NPS-generation

**Repository Path**: dot23/NPS-generation

## Basic Information

- **Project Name**: NPS-generation
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-11-22
- **Last Updated**: 2021-11-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## NPS-generation

This repository contains Python source code required to train and evaluate deep generative models of novel psychoactive substances, as used in the manuscript, "A deep generative model enables automated structure elucidation of novel psychoactive substances."

Due to its sensitivity and the potential for misuse, the data used to train the model is not publicly available for unrestricted download. However, the training dataset will be made available to all qualified researchers in the field upon request. Similarly, the model output, including all generated molecules, their sampling frequencies, and predicted tandem mass spectra, will also be provided upon request. Please contact David Wishart (david dot wishart at ualberta dot ca) to request access. 

### Usage

The scripts in the `python` directory were used in the following order to preprocess the HighResNPS dataset, train chemical language models, evaluate the quality of the generated molecules, sample SMILES strings from the trained models, and tabulate unique molecules based on their frequency.

- `clean-SMILES.py`: preprocess chemical structures from the HighResNPS database for input during model training. 
- `augment-SMILES.py`: enumerate multiple, non-canonical SMILES for each canonical SMILES in the file output by `clean-SMILES.py`, given some fixed data augmentation factor.
- `train_model.py`: train a recurrent neural network-based generative model of chemical structures.
- `calculate_outcomes.py`: calculate a suite of metrics used to benchmark different generative models, varying the amount of data augmentation and RNN architecture.
- `calculate_outcome_distributions.py`: write complete property distributions (not just summary statistics) for molecules generated by the best model in the benchmarking analysis.
- `sample_molecules.py`: sample a large number of SMILES strings (here, 1 billion) from the best generative model.
- `tabulate_molecules.py`: tabulate the frequency with which each unique molecule appears in the sample of 1 billion SMILES strings, and record its mass and molecular formula.

`datasets.py`, `functions.py`, and `models.py` contain additional classes and functions required for model training and analysis. Arguments for usage from the command line are documented within each individual script. 

A demonstration dataset of 2,000 SMILES for drug-like small molecules is provided in order to demonstrate the functionality of the code. Please note, however, that these molecules were sampled at random from the  ChEMBL database (version 28) and are not themselves designer drugs. Please contact David Wishart (david dot wishart at ualberta dot ca) to request access to the complete training set used in the accompanying manuscript.

### Environment

The experiments described in the manuscript were carried out in a conda environment with the following packages installed. A copy of the environment is also provided in the file `environment.yml`. 

```
# packages in environment at /home/skinnim/.conda/envs/chemenv:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
beautifulsoup4            4.9.3              pyhb0f4dca_0
blas                      1.0                         mkl
brotlipy                  0.7.0           py36h27cfd23_1003
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2021.4.13            h06a4308_1
cairo                     1.14.12              h8948797_3
certifi                   2020.12.5        py36h06a4308_0
cffi                      1.14.0           py36h2e261b9_0
chardet                   3.0.4           py36h06a4308_1003
conda                     4.9.2            py36h06a4308_0
conda-build               3.20.5                   py36_1
conda-package-handling    1.7.2            py36h03888b9_0
cryptography              3.3.1            py36h3c74f83_0
cudatoolkit               10.0.130                      0
cudnn                     7.6.5                cuda10.0_0
deepsmiles                1.0.1                    pypi_0    pypi
et_xmlfile                1.0.1                   py_1001
fcd-torch                 1.0.7                    pypi_0    pypi
filelock                  3.0.12                     py_0
fontconfig                2.13.0               h9420a91_0
freetype                  2.10.4               h5ab3b9f_0
glib                      2.63.1               h5a9c865_0
glob2                     0.7                        py_0
icu                       58.2                 he6710b0_3
idna                      2.10                       py_0
intel-openmp              2020.2                      254
jdcal                     1.4.1                      py_0
jinja2                    2.11.2                     py_0
jpeg                      9b                   h024ee3a_2
lcms2                     2.11                 h396b838_0
ld_impl_linux-64          2.33.1               h53a641e_7
libarchive                3.4.2                h62408e4_0
libboost                  1.73.0              h37e3b65_11
libedit                   3.1.20191231         h14c3975_1
libffi                    3.2.1             hf484d3e_1007
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
liblief                   0.10.1               he6710b0_0
libpng                    1.6.37               hbc83047_0
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_1
libuuid                   1.0.3                h1bed415_2
libxcb                    1.14                 h7b6447c_0
libxml2                   2.9.10               hb55368b_3
lz4-c                     1.9.2                heb0550a_3
markupsafe                1.1.1            py36h7b6447c_0
mkl                       2020.2                      256
mkl-service               2.3.0            py36he8ac12f_0
mkl_fft                   1.2.0            py36h23d657b_0
mkl_random                1.1.1            py36h0573a6f_0
ncurses                   6.2                  he6710b0_1
ninja                     1.10.2           py36hff7bd54_0
numpy                     1.19.2           py36h54aff64_0
numpy-base                1.19.2           py36hfa32c7d_0
olefile                   0.46                       py_0
openpyxl                  3.0.7              pyhd3eb1b0_0
openssl                   1.1.1k               h27cfd23_0
pandas                    1.1.3            py36he6710b0_0
patchelf                  0.12                 h2531618_1
pcre                      8.44                 he6710b0_0
pillow                    8.0.1            py36he98fc37_0
pip                       20.3.1           py36h06a4308_0
pixman                    0.40.0               h7b6447c_0
pkginfo                   1.6.1            py36h06a4308_0
psutil                    5.7.2            py36h7b6447c_0
py-boost                  1.73.0          py36h962f231_11
py-lief                   0.10.1           py36h403a769_0
pycosat                   0.6.3            py36h27cfd23_0
pycparser                 2.20                       py_2
pyopenssl                 20.0.0             pyhd3eb1b0_1
pysocks                   1.7.1            py36h06a4308_0
python                    3.6.10               h191fe78_1
python-dateutil           2.8.1                      py_0
python-libarchive-c       2.9                        py_0
pytorch                   1.1.0           cuda100py36he554f03_0
pytz                      2020.4             pyhd3eb1b0_0
pyyaml                    5.3.1            py36h7b6447c_1
rdkit                     2020.09.1.0      py36hd50e099_1    rdkit
readline                  7.0                  ha6073c6_4
requests                  2.25.0             pyhd3eb1b0_0
ripgrep                   12.1.1                        0
ruamel_yaml               0.15.87          py36h7b6447c_1
scipy                     1.5.2            py36h0b6359f_0
selfies                   1.0.2                    pypi_0    pypi
setuptools                51.0.0           py36h06a4308_2
six                       1.15.0           py36h06a4308_0
soupsieve                 2.0.1                      py_0
sqlite                    3.33.0               h62c20be_0
tk                        8.6.10               hbc83047_0
tqdm                      4.54.1             pyhd3eb1b0_0
urllib3                   1.25.11                    py_0
wheel                     0.36.1             pyhd3eb1b0_0
xlrd                      2.0.1              pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
yaml                      0.2.5                h7b6447c_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.5                h9ceee32_0
```

Note that to run the `train_model.py` and `sample-molecules.py` scripts, a GPU is recommended.