# NPS-generation **Repository Path**: dot23/NPS-generation ## Basic Information - **Project Name**: NPS-generation - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-11-22 - **Last Updated**: 2021-11-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## NPS-generation This repository contains Python source code required to train and evaluate deep generative models of novel psychoactive substances, as used in the manuscript, "A deep generative model enables automated structure elucidation of novel psychoactive substances." Due to its sensitivity and the potential for misuse, the data used to train the model is not publicly available for unrestricted download. However, the training dataset will be made available to all qualified researchers in the field upon request. Similarly, the model output, including all generated molecules, their sampling frequencies, and predicted tandem mass spectra, will also be provided upon request. Please contact David Wishart (david dot wishart at ualberta dot ca) to request access. ### Usage The scripts in the `python` directory were used in the following order to preprocess the HighResNPS dataset, train chemical language models, evaluate the quality of the generated molecules, sample SMILES strings from the trained models, and tabulate unique molecules based on their frequency. - `clean-SMILES.py`: preprocess chemical structures from the HighResNPS database for input during model training. - `augment-SMILES.py`: enumerate multiple, non-canonical SMILES for each canonical SMILES in the file output by `clean-SMILES.py`, given some fixed data augmentation factor. - `train_model.py`: train a recurrent neural network-based generative model of chemical structures. - `calculate_outcomes.py`: calculate a suite of metrics used to benchmark different generative models, varying the amount of data augmentation and RNN architecture. - `calculate_outcome_distributions.py`: write complete property distributions (not just summary statistics) for molecules generated by the best model in the benchmarking analysis. - `sample_molecules.py`: sample a large number of SMILES strings (here, 1 billion) from the best generative model. - `tabulate_molecules.py`: tabulate the frequency with which each unique molecule appears in the sample of 1 billion SMILES strings, and record its mass and molecular formula. `datasets.py`, `functions.py`, and `models.py` contain additional classes and functions required for model training and analysis. Arguments for usage from the command line are documented within each individual script. A demonstration dataset of 2,000 SMILES for drug-like small molecules is provided in order to demonstrate the functionality of the code. Please note, however, that these molecules were sampled at random from the ChEMBL database (version 28) and are not themselves designer drugs. Please contact David Wishart (david dot wishart at ualberta dot ca) to request access to the complete training set used in the accompanying manuscript. ### Environment The experiments described in the manuscript were carried out in a conda environment with the following packages installed. A copy of the environment is also provided in the file `environment.yml`. ``` # packages in environment at /home/skinnim/.conda/envs/chemenv: # # Name Version Build Channel _libgcc_mutex 0.1 main beautifulsoup4 4.9.3 pyhb0f4dca_0 blas 1.0 mkl brotlipy 0.7.0 py36h27cfd23_1003 bzip2 1.0.8 h7b6447c_0 ca-certificates 2021.4.13 h06a4308_1 cairo 1.14.12 h8948797_3 certifi 2020.12.5 py36h06a4308_0 cffi 1.14.0 py36h2e261b9_0 chardet 3.0.4 py36h06a4308_1003 conda 4.9.2 py36h06a4308_0 conda-build 3.20.5 py36_1 conda-package-handling 1.7.2 py36h03888b9_0 cryptography 3.3.1 py36h3c74f83_0 cudatoolkit 10.0.130 0 cudnn 7.6.5 cuda10.0_0 deepsmiles 1.0.1 pypi_0 pypi et_xmlfile 1.0.1 py_1001 fcd-torch 1.0.7 pypi_0 pypi filelock 3.0.12 py_0 fontconfig 2.13.0 h9420a91_0 freetype 2.10.4 h5ab3b9f_0 glib 2.63.1 h5a9c865_0 glob2 0.7 py_0 icu 58.2 he6710b0_3 idna 2.10 py_0 intel-openmp 2020.2 254 jdcal 1.4.1 py_0 jinja2 2.11.2 py_0 jpeg 9b h024ee3a_2 lcms2 2.11 h396b838_0 ld_impl_linux-64 2.33.1 h53a641e_7 libarchive 3.4.2 h62408e4_0 libboost 1.73.0 h37e3b65_11 libedit 3.1.20191231 h14c3975_1 libffi 3.2.1 hf484d3e_1007 libgcc-ng 9.1.0 hdf63c60_0 libgfortran-ng 7.3.0 hdf63c60_0 liblief 0.10.1 he6710b0_0 libpng 1.6.37 hbc83047_0 libstdcxx-ng 9.1.0 hdf63c60_0 libtiff 4.1.0 h2733197_1 libuuid 1.0.3 h1bed415_2 libxcb 1.14 h7b6447c_0 libxml2 2.9.10 hb55368b_3 lz4-c 1.9.2 heb0550a_3 markupsafe 1.1.1 py36h7b6447c_0 mkl 2020.2 256 mkl-service 2.3.0 py36he8ac12f_0 mkl_fft 1.2.0 py36h23d657b_0 mkl_random 1.1.1 py36h0573a6f_0 ncurses 6.2 he6710b0_1 ninja 1.10.2 py36hff7bd54_0 numpy 1.19.2 py36h54aff64_0 numpy-base 1.19.2 py36hfa32c7d_0 olefile 0.46 py_0 openpyxl 3.0.7 pyhd3eb1b0_0 openssl 1.1.1k h27cfd23_0 pandas 1.1.3 py36he6710b0_0 patchelf 0.12 h2531618_1 pcre 8.44 he6710b0_0 pillow 8.0.1 py36he98fc37_0 pip 20.3.1 py36h06a4308_0 pixman 0.40.0 h7b6447c_0 pkginfo 1.6.1 py36h06a4308_0 psutil 5.7.2 py36h7b6447c_0 py-boost 1.73.0 py36h962f231_11 py-lief 0.10.1 py36h403a769_0 pycosat 0.6.3 py36h27cfd23_0 pycparser 2.20 py_2 pyopenssl 20.0.0 pyhd3eb1b0_1 pysocks 1.7.1 py36h06a4308_0 python 3.6.10 h191fe78_1 python-dateutil 2.8.1 py_0 python-libarchive-c 2.9 py_0 pytorch 1.1.0 cuda100py36he554f03_0 pytz 2020.4 pyhd3eb1b0_0 pyyaml 5.3.1 py36h7b6447c_1 rdkit 2020.09.1.0 py36hd50e099_1 rdkit readline 7.0 ha6073c6_4 requests 2.25.0 pyhd3eb1b0_0 ripgrep 12.1.1 0 ruamel_yaml 0.15.87 py36h7b6447c_1 scipy 1.5.2 py36h0b6359f_0 selfies 1.0.2 pypi_0 pypi setuptools 51.0.0 py36h06a4308_2 six 1.15.0 py36h06a4308_0 soupsieve 2.0.1 py_0 sqlite 3.33.0 h62c20be_0 tk 8.6.10 hbc83047_0 tqdm 4.54.1 pyhd3eb1b0_0 urllib3 1.25.11 py_0 wheel 0.36.1 pyhd3eb1b0_0 xlrd 2.0.1 pyhd3eb1b0_0 xz 5.2.5 h7b6447c_0 yaml 0.2.5 h7b6447c_0 zlib 1.2.11 h7b6447c_3 zstd 1.4.5 h9ceee32_0 ``` Note that to run the `train_model.py` and `sample-molecules.py` scripts, a GPU is recommended.