Thursday, May 5, 2016

A brief introduction to SMILES and InChI



SMILES and InChI are two notations that that can be used to compactly describe molecules.  As molecules get complicated SMILES and InChI are are great alternative to a chemical names but there are many other uses (see the end of this post for an example).  Here is a brief intro to a few of the basics.

Ethane
SMILES: CC
InChI: InChI=1S/C2H6/c1-2/h1-2H3

SMILES is pretty straight forward: ethane is two C atoms singly bonded to each other.  Two atoms next to each other imply a single bond and programs that read SMILES strings are smart enough to know that each C atom needs 3 H atoms to fill the valence shell.

InChI requires a little more explanation: "1S" is version 1 of Standard InChi (more about that later). C2H6 is the empirical formula and tells you that the first two atoms are C atoms. c1-2 means that atom 1 and 2 are connected. h1-2H3 means that atom 1-2 each have 3 H atoms. Programs that read SMILES strings are smart enough to know that this means that the two C atoms are connected by a single bond.

So, SMILES defines the bond types from which one can infer protonation, while InChI defines protonation from which one can infer the bond types (key differences indicated in bold):

Ethene and ethyne
SMILES: C=C and C#C
InChI: InChI=1S/C2H4/c1-2/h1-2H2 and InChI=1S/C2H2/c1-2/h1-2H

SMILES: double and triple bonds are indicated by "=" and "#", respectively.
InChI: "CH2" and "CH" are indicated by "H2" and "H", respectively.

Ethanol and dimethyl ether
SMILES: CCO and COC
InChI: InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 and  InChI=1S/C2H6O/c1-3-2/h1-2H3

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: "C2H6O" means that the first and second atoms (1 and 2) are C atoms and the third (3) is an O atom.  The connectivity is 1-2-3 for ethanol and 1-3-2 for dimethyl ether.  For ethanol atom 3 has 1 H atom, atom 2 has 2 H atoms, and atom 1 has 3 H atoms.  For dimethyl ether atom 1-2 have 3 H atoms, while atom 3 has none.

Note that InChI keeps the atom ordering the same in the two molecules while SMILES changes it, i.e. the oxygen is atom number 3 for both ethanol and dimethyl ether defined by InChI, while it is atom number 3 and 2, respectively, when defined by SMILES.

Benzene and toluene
SMILES: c1ccccc1 and Cc1ccccc1
InChI: InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H and InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3

SMILES: in the case of benzene atom 1 is connected to both atom 2 and atom 6, i.e.  a ring is formed. "1" is a label and does not refer to atom number 1 (see toluene). A lower case "c" is used to indicate aromatic carbons, meaning they should be singly protonated.  For toluene, the methyl group is bonded to atom number 2, which is also bonded to atom number 7.

InChI: In the case of benzene aromaticity is inferred from the fact that all 6 carbon have 1 H atom (h1-6H).  For toluene, the methyl group is bonded to atom number 7, which is also bonded to atom number 6.

Ethylamine and ethylammonium
SMILES: CCN and CC[NH3+]
InChI: InChI=1S/C2H7N/c1-2-3/h2-3H2,1H3 and InChI=1S/C2H7N/c1-2-3/h2-3H2,1H3/p+1

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: The protonation (h2-3H2,1H3) is identical in both cases an corresponds to ethylamine. For ethylammonium "p+1" indicates that an extra proton is added, but it doesn't specify where (more on that later).

Acetic acid and acetate
SMILES: CC(=O)O and CC(=O)[O-]
InChI: InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4) and InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1

SMILES: pretty self explanatory I think (if not leave a comment).
InChI: In the case of acetic acid InChi recognises that there are two "tautomers", i.e. that both oxygens can have the hydrogen atom.  In this particular case, there is free rotation about the CC bond so the oxygens are equivalent and we don't really think about acetic acid having tautomers, but see next case.

Formamide
SMILES: C(=O)N
InChI: InChI=1S/CH3NO/c2-1-3/h1H,(H2,2,3)

Here SMILES and InChI start to differ a bit in the chemistry.  InChI recognises that formamide can exist in two tautomeric states, HC(=O)NH2 and HC(OH)=NH, while SMILES only specifies the dominant one, HC(=O)NH2, by default.

Both SMILES and InChI can be used to specify particular tautomers.  For SMILES the other tautomer is "C(OH)=N"while for InChI it is InChI=1/CH3NO/c2-1-3/h1H,(H2,2,3)/f/h2H2 and InChI=1/CH3NO/c2-1-3/h1H,(H2,2,3)/f/h2,3H.  Notice that the S is missing in InChI=1 because the use of a fixed H layer (indicated by /f) is a "non-standard" InChI.

cis- and trans-2-butene
SMILES: C/C=C\C and C/C=C/C
InCHhI: InChI=1S/C4H8/c1-3-4-2/h3-4H,1-2H3/b4-3- and InChI=1S/C4H8/c1-3-4-2/h3-4H,1-2H3/b4-3+

L- and D-alanine
SMILES: C[C@@H](C(=O)O)N and C[C@H](C(=O)O)N
InChI: InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 and InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m1/s1

Building molecules using SMILES and InChI
One use of SMILES and InChI is to quickly and automatically build molecules. 

Furthermore, it is not too difficult to write scripts that generate SMILES strings based on some template.  For example, you can define a template c1ccc(c(c1)X)Y and substituents X, Y = O, F, and N and generate all possible combinations using a bit of python.  The resulting list of SMILES strings can then be converted to coordinates using Cactus.  Imagine the tedium of building all these molecules in a builder.

Learning more about SMILES and InChI
Theres a lot more to SMILES and InChI than covered in this post and plenty more info online (e.g. here for SMILES and InChI).  However, personally I learned most from generating a bunch of examples using this cactus site, which can convert chemical names to SMILES or InChI as well as SMILES and InChI to a 3D structure (pick TwirlyMol in the menu).

Related posts
A simple script to get molecular coordinates from a chemical name using Open Babel
Automating calculations: pKa predictions


This work is licensed under a Creative Commons Attribution 4.0

No comments: