June 25 - Databases
There are two main projects in which I will be involved during this summer internship. The first one is the B3DB
which stands for Blood-Brain Barrier Data Base, where I will be working with Machine Learning Models to predict a molecule's BBB permeability. The second project in which I am also working is in the construction of molecular databases with computational data. For this project, we are assigned different "issues" of databases, where several steps need to be taken:
- Obtain structures.
- Visualize them to make sure they are sensible.
- Optimize geometries.
- Complete single-point
uhf
calculations using def2-SPVD, def2-TZVPD, def2-QZVPD basis sets. - Complete single-point
ub3lyp
calculations using def2-SPVD, def2-TZVPD, def2-QZVPD basis sets. - Complete single-point
uwb97xd
calculations using def2-SPVD, def2-TZVPD, def2-QZVPD basis sets. - Post process the calculations.
- Transfer of assigned database to the group.
In this post I describe an example of the workflow for the construction of a molecular database.
S22 Example
Obtain Chemical Structures
Download structures, unzip and move files to a new dir:
wget http://www.begdb.org/moldown.php?id=4
unzip 'moldown.php?id=4'
mkdir BEGDB:_S22
mv *.xyz BEGDB_S22/
Sanitize Chemical Structure
Sometimes files from internet source are not reliable in their format, for this reason it is important to sanitize the structures.
cd BEGDB_S22
database sanitize *.xyz
Generate Input File
With folder
command, a new file folder will be created for every molecule, if the number
command is also applied, then these folders will be appended with a four digit number starting on 0001
. This is very useful when working with big datasets. The input
command will generate a Gaussian input file taking as an argument a geometry file, which can have different formats such as .XYZ
, .GJF
or .FCHK
. The input files are specific for the type of calculation that will be run. If a geometry optimization is required, use the opt
command. For single point energy calculation, use the force
argument. Level of theory and basis set are also specified in the command.
cd BEGDB_S22
database folder *.xyz
database input g16 force uhf sto-3g 00*/*.xyz
Chemical Structure form Input File
With an input file, it is also possible to convert it to a geometry file. The reason for this is that input files include geometry data upon which calculations are started.
database convert xyz 00*/*.gif
Run QChem Calculation
Once the input files are ready, they need to be scheduled to a computation cluster such as Compute Canada and for this a job scheduler is needed. The scheduler of Graham
operates under the SLURM protocol. In order to run the calculations, the job scripts need to be created from the input files and then these job files need to be submitted to the scheduler.
cd BEGDB_S22
database slurm g16 00-01:00 rrg-ayers-ab 00*/00*_force_uhf_sto3g/*.com
database submit 00*/00*_force_uhf_sto3g/*.sh
Check Quantum Chemistry Calculations
Once calculations are done, use the check
command to analyze if calculations were performed successfully. Depending on the type of calculation that was run, several flags can be used to avoid false error messages.
cd BEGDB_S22
database check g16 00*/00*_force_uhf_sto3g/*.log
Group 16 Example
Create JSON
file from "CID:NAME:CHARGE:MULTIPLICITY"
mkdir group_16
cd group_16
database json 139605:SF2:0:1 24555:SF4 17358:SF6 24548:SOF2 -o group_16.json
# check the content of JSON file
cat group_16.json
database download group_16.json
# check downloaded input files (use space-bar to go through the inputs & use q key to exit)
cat *.gjf | less
database convert xyz 00*/*.gif
Other notes:
Add changes to Database instructions as they are somewhat difficult to follow along:
- Errors in scripts -> "Obtaining Geometry" section shows the command to create geometry files from
.GJF
files without creating them first. - Propose new structure more amicable: instead of having multiple examples per command or stage of Q-chem calculation workflow, make an entire workflow with each example dataset. Notebook structure might be very suitable for this purpose.
Question:
Database slurm
doesn't has a parameter for memory. How is this calculated?- From which user should I submit my calculations? rrg-ayers-ab or def-ayers? Answer: rrg-ayers-ab. For long calculations it's def-ayers.
- Within virtual environment. How to change file extension to show only last folder? Answer: change
w
parameter in P1 toW
, which will only show the reduced version ofpwd
. - After geometry optimization, why are
nasty_anions
generated with.JSON
file with charge and multiplicity if previous input files already had them? Answer: it is to read correctly charge and multiplicity of the molecule.