Wednesday, May 25, 2016

Automating calculations: pKa predictions

I want to compute the standard free energy difference for proton exchange between heliotridane and a molecule which I'll call compound 1 (comp1)

heliotridaneH+ + comp1 <=> heliotridane + comp1H+

To do this I need to geometry optimize in the gas phase and then do a single point calculation using a solvation model, all using GAMESS. I want to automate this so I can do it for comp2, comp3, ...

To do this I use bash and the python programs listed below

Step 1 make a file file called smiles.list which contains the name and SMILES strings, which defines both the molecule, the protonation state, and the conformation around the protonated N. I still need to automate this step.

Step 2  make so-called header files for the geometry optimization (pm6_opt.header) and solvation energy (pm6_sol.header) calculations.  These files contain the relevant keywords for the calculations.

Step 3 convert the SMILES strings to 3D coordinates (xyz files).

python smiles2coord.py smiles.list

This also creates som sdf files which I can now remove (rm *.sdf).


Step 4 convert the xyz files to GAMESS input files for geometry optimization
When naming the molecules in smiles.list I was carefull to label all the protonated forms with a "+".

for x in *.xyz; do babel -ixyz $x -ogamin ${x%.*}_opt.inp -xf pm6_opt.header; done

I could have modified smiles2coord_lis.py to generate these files directly, but I like to keep things modular.  At this point I can also get rid of the xyz files (xyz files).

Step 5 change the charge for the protonated forms.

sed -i 's/icharg=0/icharg=1/g' *+*.inp


Step 6 submit the GAMESS calculations.

for x in *_opt.inp; do submit_gamess $x; done

My GAMESS submit script is called "submit_gamess"

Step 7 wait til all the optimizations are done and that they all converged.

grep EQUILIBRIUM *.log

Step 8 transfer the optimized geometry to a new input file for the solvation energy calculations

for x in *_opt.log; do babel -igamout $x -ogamin ${x%.*}_sol.inp -xf pm6_sol.header; done

Step 9 change the charge for the protonated forms.

sed -i 's/icharg=0/icharg=1/g' *+*_sol.inp

Step 10 submit the GAMESS calculations.

for x in *_sol.inp; do submit_gamess $x; done

Step 11 wait til all the optimizations are done and that the calculations finished normally.

grep NORMALLY *_sol.log

Step 12 extract the last heat of formation in each file and put them in a file called energies

python readv2.py *._sol.log > energies

Step 13 find the lowest energy for each conformer and compute the relative energies and pKa values
python pka.py energies

Files referred to above





This work is licensed under a Creative Commons Attribution 4.0

Thursday, May 5, 2016

A brief introduction to SMILES and InChI



SMILES and InChI are two notations that that can be used to compactly describe molecules.  As molecules get complicated SMILES and InChI are are great alternative to a chemical names but there are many other uses (see the end of this post for an example).  Here is a brief intro to a few of the basics.

Ethane
SMILES: CC
InChI: InChI=1S/C2H6/c1-2/h1-2H3

SMILES is pretty straight forward: ethane is two C atoms singly bonded to each other.  Two atoms next to each other imply a single bond and programs that read SMILES strings are smart enough to know that each C atom needs 3 H atoms to fill the valence shell.

InChI requires a little more explanation: "1S" is version 1 of Standard InChi (more about that later). C2H6 is the empirical formula and tells you that the first two atoms are C atoms. c1-2 means that atom 1 and 2 are connected. h1-2H3 means that atom 1-2 each have 3 H atoms. Programs that read SMILES strings are smart enough to know that this means that the two C atoms are connected by a single bond.

So, SMILES defines the bond types from which one can infer protonation, while InChI defines protonation from which one can infer the bond types (key differences indicated in bold):

Ethene and ethyne
SMILES: C=C and C#C
InChI: InChI=1S/C2H4/c1-2/h1-2H2 and InChI=1S/C2H2/c1-2/h1-2H

SMILES: double and triple bonds are indicated by "=" and "#", respectively.
InChI: "CH2" and "CH" are indicated by "H2" and "H", respectively.

Ethanol and dimethyl ether
SMILES: CCO and COC
InChI: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 and  InChI=1S/C2H6O/c1-3-2/h1-2H3

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: "C2H6O" means that the first and second atoms (1 and 2) are C atoms and the third (3) is an O atom.  The connectivity is 1-2-3 for ethanol and 1-3-2 for dimethyl ether.  For ethanol atom 3 has 1 H atom, atom 2 has 2 H atoms, and atom 1 has 3 H atoms.  For dimethyl ether atom 1-2 have 3 H atoms, while atom 3 has none.

Note that InChI keeps the atom ordering the same in the two molecules while SMILES changes it, i.e. the oxygen is atom number 3 for both ethanol and dimethyl ether defined by InChI, while it is atom number 3 and 2, respectively, when defined by SMILES.

Benzene and toluene
SMILES: c1ccccc1 and Cc1ccccc1
InChI: InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H and InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3

SMILES: in the case of benzene atom 1 is connected to both atom 2 and atom 6, i.e.  a ring is formed. "1" is a label and does not refer to atom number 1 (see toluene). A lower case "c" is used to indicate aromatic carbons, meaning they should be singly protonated.  For toluene, the methyl group is bonded to atom number 2, which is also bonded to atom number 7.

InChI: In the case of benzene aromaticity is inferred from the fact that all 6 carbon have 1 H atom (h1-6H).  For toluene, the methyl group is bonded to atom number 7, which is also bonded to atom number 6.

Ethylamine and ethylammonium
SMILES: CCN and CC[NH3+]
InChI: InChI=1S/C2H7N/c1-2-3/h2-3H2,1H3 and InChI=1S/C2H7N/c1-2-3/h2-3H2,1H3/p+1

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: The protonation (h2-3H2,1H3) is identical in both cases an corresponds to ethylamine. For ethylammonium "p+1" indicates that an extra proton is added, but it doesn't specify where (more on that later).

Acetic acid and acetate
SMILES: CC(=O)O and CC(=O)[O-]
InChI: InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4) and InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: In the case of acetic acid InChi recognises that there are two "tautomers", i.e. that both oxygens can have the hydrogen atom.  In this particular case, there is free rotation about the CC bond so the oxygens are equivalent and we don't really think about acetic acid having tautomers, but see next case.

Formamide
SMILES: C(=O)N
InChI: InChI=1S/CH3NO/c2-1-3/h1H,(H2,2,3)

Here SMILES and InChI start to differ a bit in the chemistry.  InChI recognises that formamide can exist in two tautomeric states, HC(=O)NH2 and HC(OH)=NH, while SMILES only specifies the dominant one, HC(=O)NH2, by default.

Both SMILES and InChI can be used to specify particular tautomers.  For SMILES the other tautomer is "C(OH)=N"while for InChI it is InChI=1/CH3NO/c2-1-3/h1H,(H2,2,3)/f/h2H2 and InChI=1/CH3NO/c2-1-3/h1H,(H2,2,3)/f/h2,3H.  Notice that the S is missing in InChI=1 because the use of a fixed H layer (indicated by /f) is a "non-standard" InChI.

cis- and trans-2-butene
SMILES: C/C=C\C and C/C=C/C
InCHhI: InChI=1S/C4H8/c1-3-4-2/h3-4H,1-2H3/b4-3- and InChI=1S/C4H8/c1-3-4-2/h3-4H,1-2H3/b4-3+

L- and D-alanine
SMILES: C[C@@H](C(=O)O)N and C[C@H](C(=O)O)N
InChI: InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 and InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1

Building molecules using SMILES and InChI
One use of SMILES and InChI is to quickly and automatically build molecules. 

Furthermore, it is not too difficult to write scripts that generate SMILES strings based on some template.  For example, you can define a template c1ccc(c(c1)X)Y and substituents X, Y = O, F, and N and generate all possible combinations using a bit of python.  The resulting list of SMILES strings can then be converted to coordinates using Cactus.  Imagine the tedium of building all these molecules in a builder.

Learning more about SMILES and InChI
Theres a lot more to SMILES and InChI than covered in this post and plenty more info online (e.g. here for SMILES and InChI).  However, personally I learned most from generating a bunch of examples using this cactus site, which can convert chemical names to SMILES or InChI as well as SMILES and InChI to a 3D structure (pick TwirlyMol in the menu).

Related posts
A simple script to get molecular coordinates from a chemical name using Open Babel
Automating calculations: pKa predictions


This work is licensed under a Creative Commons Attribution 4.0