Thursday, May 5, 2016

A brief introduction to SMILES and InChI

SMILES and InChI are two notations that that can be used to compactly describe molecules.  As molecules get complicated SMILES and InChI are are great alternative to a chemical names but there are many other uses (see the end of this post for an example).  Here is a brief intro to a few of the basics.

Ethane
SMILES: CC
InChI: InChI=1S/C2H6/c1-2/h1-2H3

SMILES is pretty straight forward: ethane is two C atoms singly bonded to each other.  Two atoms next to each other imply a single bond and programs that read SMILES strings are smart enough to know that each C atom needs 3 H atoms to fill the valence shell.

InChI requires a little more explanation: "1S" is version 1 of Standard InChi (more about that later). C2H6 is the empirical formula and tells you that the first two atoms are C atoms. c1-2 means that atom 1 and 2 are connected. h1-2H3 means that atom 1-2 each have 3 H atoms. Programs that read SMILES strings are smart enough to know that this means that the two C atoms are connected by a single bond.

So, SMILES defines the bond types from which one can infer protonation, while InChI defines protonation from which one can infer the bond types (key differences indicated in bold):

Ethene and ethyne
SMILES: C=C and C#C
InChI: InChI=1S/C2H4/c1-2/h1-2H2 and InChI=1S/C2H2/c1-2/h1-2H

SMILES: double and triple bonds are indicated by "=" and "#", respectively.
InChI: "CH2" and "CH" are indicated by "H2" and "H", respectively.

Ethanol and dimethyl ether
SMILES: CCO and COC
InChI: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 and  InChI=1S/C2H6O/c1-3-2/h1-2H3

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: "C2H6O" means that the first and second atoms (1 and 2) are C atoms and the third (3) is an O atom.  The connectivity is 1-2-3 for ethanol and 1-3-2 for dimethyl ether.  For ethanol atom 3 has 1 H atom, atom 2 has 2 H atoms, and atom 1 has 3 H atoms.  For dimethyl ether atom 1-2 have 3 H atoms, while atom 3 has none.

Note that InChI keeps the atom ordering the same in the two molecules while SMILES changes it, i.e. the oxygen is atom number 3 for both ethanol and dimethyl ether defined by InChI, while it is atom number 3 and 2, respectively, when defined by SMILES.

Benzene and toluene
SMILES: c1ccccc1 and Cc1ccccc1
InChI: InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H and InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3

SMILES: in the case of benzene atom 1 is connected to both atom 2 and atom 6, i.e.  a ring is formed. "1" is a label and does not refer to atom number 1 (see toluene). A lower case "c" is used to indicate aromatic carbons, meaning they should be singly protonated.  For toluene, the methyl group is bonded to atom number 2, which is also bonded to atom number 7.

InChI: In the case of benzene aromaticity is inferred from the fact that all 6 carbon have 1 H atom (h1-6H).  For toluene, the methyl group is bonded to atom number 7, which is also bonded to atom number 6.

Ethylamine and ethylammonium
SMILES: CCN and CC[NH3+]
InChI: InChI=1S/C2H7N/c1-2-3/h2-3H2,1H3 and InChI=1S/C2H7N/c1-2-3/h2-3H2,1H3/p+1

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: The protonation (h2-3H2,1H3) is identical in both cases an corresponds to ethylamine. For ethylammonium "p+1" indicates that an extra proton is added, but it doesn't specify where (more on that later).

Acetic acid and acetate
SMILES: CC(=O)O and CC(=O)[O-]
InChI: InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4) and InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: In the case of acetic acid InChi recognises that there are two "tautomers", i.e. that both oxygens can have the hydrogen atom.  In this particular case, there is free rotation about the CC bond so the oxygens are equivalent and we don't really think about acetic acid having tautomers, but see next case.

Formamide
SMILES: C(=O)N
InChI: InChI=1S/CH3NO/c2-1-3/h1H,(H2,2,3)

Here SMILES and InChI start to differ a bit in the chemistry.  InChI recognises that formamide can exist in two tautomeric states, HC(=O)NH2 and HC(OH)=NH, while SMILES only specifies the dominant one, HC(=O)NH2, by default.

Both SMILES and InChI can be used to specify particular tautomers.  For SMILES the other tautomer is "C(OH)=N"while for InChI it is InChI=1/CH3NO/c2-1-3/h1H,(H2,2,3)/f/h2H2 and InChI=1/CH3NO/c2-1-3/h1H,(H2,2,3)/f/h2,3H.  Notice that the S is missing in InChI=1 because the use of a fixed H layer (indicated by /f) is a "non-standard" InChI.

cis- and trans-2-butene
SMILES: C/C=C\C and C/C=C/C
InCHhI: InChI=1S/C4H8/c1-3-4-2/h3-4H,1-2H3/b4-3- and InChI=1S/C4H8/c1-3-4-2/h3-4H,1-2H3/b4-3+

L- and D-alanine
SMILES: C[C@@H](C(=O)O)N and C[C@H](C(=O)O)N
InChI: InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 and InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1

Building molecules using SMILES and InChI
One use of SMILES and InChI is to quickly and automatically build molecules. 

Furthermore, it is not too difficult to write scripts that generate SMILES strings based on some template.  For example, you can define a template c1ccc(c(c1)X)Y and substituents X, Y = O, F, and N and generate all possible combinations using a bit of python.  The resulting list of SMILES strings can then be converted to coordinates using Cactus.  Imagine the tedium of building all these molecules in a builder.

Learning more about SMILES and InChI
Theres a lot more to SMILES and InChI than covered in this post and plenty more info online (e.g. here for SMILES and InChI).  However, personally I learned most from generating a bunch of examples using this cactus site, which can convert chemical names to SMILES or InChI as well as SMILES and InChI to a 3D structure (pick TwirlyMol in the menu).

Related posts
A simple script to get molecular coordinates from a chemical name using Open Babel



This work is licensed under a Creative Commons Attribution 4.0

Saturday, April 16, 2016

A simple script to get molecular coordinates from a chemical name using Open Babel


Here is a simple cshell script to get coordinates from a chemical name using Open Babel.

Acknowledgements (Jimmy's "friends" please read carefully): I learned about curl from Jimmy who didn't contribute to the code in any way (and hates cshell anyway).


This work is licensed under a Creative Commons Attribution 4.0

Tuesday, March 1, 2016

Computational Chemistry Highlights: February issue

The February issue of Computational Chemistry Highlights is out.

CCH is an overlay journal that identifies the most important papers in computational and theoretical chemistry published in the last 1-2 years. CCH is not affiliated with any publisher: it is a free resource run by scientists for scientists. You can read more about it here.

Table of content for this issue features contributions from CCH editors Steven Bachrach, Alán Aspuru-Guzik, and Jan Jensen:

From C60 to Infinity: Large-Scale Quantum Chemistry Calculations of the Heats of Formation of Higher Fullerenes




Saturday, February 20, 2016

Making your computational chemistry data available as supplementary material

It is fairly normal (and good practise!) to make atomic coordinates available as supplementary material when submitting a paper.  From what I can tell this is typically done by copying the coordinates into a text editor and making a pdf file.  This can be a tedious, time consuming, and error prone process and the irony is that the intended user will have to reverse the process to actually use the data. There is a better way.

Use an online digital repository
tl;dr: upload your files to an online digital repository, such as Figshare, and simply provide the link to the data in the supplementary material.

To use Figshare you simply make an account, click "Create a new project", and choose "File set". Then you can upload the files you want to share and describe what you are sharing.  Once you have everything the way you want it, you can make it public (you even get a DOI). Here I describe what I did for my latest paper (supplementary material on FigShare here)

Sharing coordinates
Most people will only be interested in the coordinates so I decided to make them available in xyz format, since this format can be read by almost any molecular visualization program.  So I copied the output files that contained the coordinates I wanted to share into a single folder and used OpenBabel to extract the coordinates and convert them into xyz format.

For this I used a bash command.  If you use another shell you can temporarily switch to bash by typing /bin/bash (to get out of bash later type exit).  First make a new folder called "coordinates". Then convert Gaussian log files to xyz files and place them in the new folder (all one line)

for i in *.out; do babel -ig09 "$i" -oxyz coordinates/"${i/.out}".xyz; done

If you want you can add "opt" to the xyz files to indicate that they are optimized by (all one line)

for i in *.out; do babel -ig09 "$i" -oxyz coordinates/"${i/.out}"opt.xyz; done

When doing the calculations I used a different naming scheme for the systems so I added a text file called README to the folder where I describe the change. Then I created a zip file of the folder

zip -r coordinates_for_supmat.zip coordinates

which I then uploaded to FigShare and made public to get the DOI.

Sharing everything else
One or two people may be interested in more than coordinates.  So I also make a zip file of the entire project folder and upload that to the same FigShare fileset.  Actually, calculations were done by three different people so I made zip files for each of their project folders separately.  This only takes a few minutes not counting compression and upload time.  These folders also contain calculations that did not make it into the manuscript but I am not willing to spend much time weeding them out because it is quite likely that no one will ever look at it.  Anyway, if there are questions they can always ask me. I also used a Google sheet for the data analysis (don't judge me!) so I included a link to that on FigShare.  If you used Excel you could just upload that file.

Sharing Gaussian output
Some of the calculations were done with Gaussian09, so I checked with Gaussian about whether they allow sharing output files.  They wrote that "yes, you may include the relevant parts of the output in the manuscript. We just ask that you not include any timing information".  I had a look at the output files and concluded that if I delete lines with the word "cpu" I should be OK.

You can do this for one file (here called "file.out") by

grep -v "cpu" file.out > temp && mv temp file.out

and all the files in a project (here located in folder "project") using bash (all one line)

for i in project/*.out project/**/*.out project/**/**/*.out; do grep -v "cpu" "$i" > temp && mv temp  "$i"; done

This  command goes three folders deep.  If you have more layers add "project/**/.../**/*.out" as needed.  There are also many other ways of going this.  Just to be safe I also deleted lines containing "terminated" because they contain a time and date stamp.



This work is licensed under a Creative Commons Attribution 4.0

Monday, February 8, 2016

DFTB3 in GAMESS

I recently did my first DFTB3 calculations in GAMESS.  The process wasn't completely obvious, hence this post.

First I got the parameter files from dftb.org.  You have to fill in and email a registration form to get a login so that might take a few days.  I downloaded the 3ob-3-1 parameter set.

The DFTB3-related input can be found below.  My molecule contains C, N, O, and H atoms.  If there are additional atoms then you need to define additional pairs.  It is OK to specify atom pairs for atoms that are not in the molecule, so the easiest might be to define all relevant pairs once and for all and use it in all input files.

The value for dampex and hubder are found in the README file that is located in the 3ob-3-1 folder I downloaded.  hubder defines parameters for N, C, O, and H - in that order.  It was not clear from the manual what the order should be.  I found the correct order by running the input file without hubder. This run prints out the default hubder values in the order  N, C, O, and H and it turns out that is the order you need to use in the input file.

2016.02.09 Update: Yoshio Nishimoto has told me that the order of the hubder parameters is determined by the order in which they appear in $DATA.  Also using MODIO=31 in $SYSTEM will speed up the calculation.

 $basis gbasis=dftb $END
 $dftb ndftb=3 DAMPXH=.t. dampex=4.0
       hubder(1)=-0.1535,-0.1492,-0.1575,-0.1857 $end
 $dftbsk
 C C "/Applications/gamess/3ob-3-1/C-C.skf"
 C H "/Applications/gamess/3ob-3-1/C-H.skf"
 C N "/Applications/gamess/3ob-3-1/C-N.skf"
 C O "/Applications/gamess/3ob-3-1/C-O.skf"
 H C "/Applications/gamess/3ob-3-1/H-C.skf"
 H H "/Applications/gamess/3ob-3-1/H-H.skf"
 H N "/Applications/gamess/3ob-3-1/H-N.skf"
 H O "/Applications/gamess/3ob-3-1/H-O.skf"
 N C "/Applications/gamess/3ob-3-1/N-C.skf"
 N H "/Applications/gamess/3ob-3-1/N-H.skf"
 N N "/Applications/gamess/3ob-3-1/N-N.skf"
 N O "/Applications/gamess/3ob-3-1/N-O.skf"
 O C "/Applications/gamess/3ob-3-1/O-C.skf"
 O H "/Applications/gamess/3ob-3-1/O-H.skf"
 O N "/Applications/gamess/3ob-3-1/O-N.skf"
 O O "/Applications/gamess/3ob-3-1/O-O.skf"
 $end

I would like to thank +Anders Steen Christensen for his help in figuring this out.


This work is licensed under a Creative Commons Attribution 4.0