There are two main projects in which I will be involved during this summer internship. The first one is the B3DB which stands for Blood-Brain Barrier Data Base, where I will be working with Machine Learning Models to predict a molecule's BBB permeability. The second project in which I am also working is in the construction of molecular databases with computational data. For this project, we are assigned different "issues" of databases, where several steps need to be taken:

  • Obtain structures.
  • Visualize them to make sure they are sensible.
  • Optimize geometries.
  • Complete single-point uhf calculations using def2-SPVD, def2-TZVPD, def2-QZVPD basis sets.
  • Complete single-point ub3lyp calculations using def2-SPVD, def2-TZVPD, def2-QZVPD basis sets.
  • Complete single-point uwb97xd calculations using def2-SPVD, def2-TZVPD, def2-QZVPD basis sets.
  • Post process the calculations.
  • Transfer of assigned database to the group.

In this post I describe an example of the workflow for the construction of a molecular database.

S22 Example

Obtain Chemical Structures

Download structures, unzip and move files to a new dir:

wget http://www.begdb.org/moldown.php?id=4
unzip 'moldown.php?id=4'
mkdir BEGDB:_S22 
mv *.xyz BEGDB_S22/

Sanitize Chemical Structure

Sometimes files from internet source are not reliable in their format, for this reason it is important to sanitize the structures.

cd BEGDB_S22
database sanitize *.xyz

Generate Input File

With folder command, a new file folder will be created for every molecule, if the number command is also applied, then these folders will be appended with a four digit number starting on 0001. This is very useful when working with big datasets. The input command will generate a Gaussian input file taking as an argument a geometry file, which can have different formats such as .XYZ, .GJF or .FCHK. The input files are specific for the type of calculation that will be run. If a geometry optimization is required, use the opt command. For single point energy calculation, use the force argument. Level of theory and basis set are also specified in the command.

cd BEGDB_S22

database folder *.xyz

database input g16 force uhf sto-3g 00*/*.xyz

Chemical Structure form Input File

With an input file, it is also possible to convert it to a geometry file. The reason for this is that input files include geometry data upon which calculations are started.

database convert xyz 00*/*.gif

Run QChem Calculation

Once the input files are ready, they need to be scheduled to a computation cluster such as Compute Canada and for this a job scheduler is needed. The scheduler of Graham operates under the SLURM protocol. In order to run the calculations, the job scripts need to be created from the input files and then these job files need to be submitted to the scheduler.

cd BEGDB_S22
database slurm g16 00-01:00 rrg-ayers-ab 00*/00*_force_uhf_sto3g/*.com
database submit 00*/00*_force_uhf_sto3g/*.sh

Check Quantum Chemistry Calculations

Once calculations are done, use the check command to analyze if calculations were performed successfully. Depending on the type of calculation that was run, several flags can be used to avoid false error messages.

cd BEGDB_S22
database check g16 00*/00*_force_uhf_sto3g/*.log

Group 16 Example

Create JSON file from "CID:NAME:CHARGE:MULTIPLICITY"

mkdir group_16

cd group_16

database json 139605:SF2:0:1 24555:SF4 17358:SF6 24548:SOF2 -o group_16.json
# check the content of JSON file
cat group_16.json

database download group_16.json

# check downloaded input files (use space-bar to go through the inputs & use q key to exit)
cat *.gjf | less

database convert xyz 00*/*.gif

Other notes:

Add changes to Database instructions as they are somewhat difficult to follow along:

  1. Errors in scripts -> "Obtaining Geometry" section shows the command to create geometry files from .GJF files without creating them first.
  2. Propose new structure more amicable: instead of having multiple examples per command or stage of Q-chem calculation workflow, make an entire workflow with each example dataset. Notebook structure might be very suitable for this purpose.

Question:

  • Database slurm doesn't has a parameter for memory. How is this calculated?
  • From which user should I submit my calculations? rrg-ayers-ab or def-ayers? Answer: rrg-ayers-ab. For long calculations it's def-ayers.
  • Within virtual environment. How to change file extension to show only last folder? Answer: change w parameter in P1 to W, which will only show the reduced version of pwd.
  • After geometry optimization, why are nasty_anions generated with .JSON file with charge and multiplicity if previous input files already had them? Answer: it is to read correctly charge and multiplicity of the molecule.