Recently, I had to search a given chemical structure into a list of structures. Using the python chemoinformatics packages pybel and rdkit, I was easily able to do so but the operation took a little too much time for my linking. Wondering how I could search faster, I immediately thought about Jean-Philippe’s previous blog post titled Put Those CPUs to Good Use. I’ve decided to follow his instructions and give it a try.

Goal

Look for a molecule (a given chemical structure) in a list of molecules.

Sequential implementation

Note that I could have used only pybel or only rdkit. That being said, I am facing a specific constraint in my current project: the molecules in my list are pybel molecule objects while the molecule in my query is a rdkit molecule object. Without this constraint, I could have written the function using only pybel or only rdkit.


def exactSearch (q, start, end) :
''' Search a given structure among a list of molecules'''
''' Returns the number of matches'''

matches = []
for mpybel in molecules[start, end]:  ## pybel molecules
fp1 = mpybel.calcfp()  ## pybel
smi2 = Chem.MolToSmiles(q, isomericSmiles=True)  # rdkit molecule
mpybel2 = pybel.readstring("smi", smi2 ) # convert rdkit molecule in pybel molecule

if mpybel.formula == mpybel2.formula :
fp2 = mpybel2.calcfp()
tanimoto = fp1 | fp2

if tanimoto == 1:
matches.append(mpybel)
return len(matches)

n_matches = exactSearch (query, 0, len(molecules))


Running this function, it took 29.8 seconds to find the three occurrences of my query among the 25791 molecules of the list. I am patient but to some extent: waiting 30 seconds every time I need to find a molecule is too long.

Multiprocessing implementation

To run on more than one CPU using the multiprocessing python package, I only need to add two modifications to my previous code: (1) I must define a wrapper function and (2) the tasks that are going to be run in parallel.


import multiprocessing
def exactSearch_wrapper (args) :
return exactSearch(*args)

def exactSearch (q, start, end) :
... # same function as above

p = multiprocessing.Pool (processes=4)
tasks = [(query, i, i+1000]) for i in range(0, len(molecules), 1000)]
p.close()
p.join()


With this code, I am still capable of identifying the three occurrences of my query but it now takes 7.8 seconds since I have divided my list of molecules into smaller lists of 1000 molecules, and I am now using 4 CPUs. This list division was possible because each comparison of the query to a molecule of the list is an independent operation.

Final remarks

Even though there are other options to speed up my search, my intents here were to present a (hopefully) simple example of the multiprocessing package. You can see that it is quite easy to alter existing code to take advantage of the available CPUs in order to reduce waiting time.

Another option would be the usage of a code profiler to identify the operations taking a lot of time and see if those can be modified to get an optimized code.

Here is the output of the profiler for the exactSearch function :

Timer unit: 1e-06 s

Total time: 28.8646 s
File:
Function: exactSearch at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1                                           def exactSearch0 (q, mols) :
2
3         1            1      1.0      0.0      matches = []
4
5     25792        16267      0.6      0.1      for mpybel in mols:  ## pybel molecules
6     25791     13097301    507.8     45.4          fp1 = mpybel.calcfp()  ## pybel
7
8
9     25791      1043002     40.4      3.6          smi2 = Chem.MolToSmiles(q, isomericSmiles=True)
10     25791     14247872    552.4     49.4          mpybel2 = pybel.readstring("smi", smi2 )
11
12     25791       459966     17.8      1.6          if mpybel.formula == mpybel2.formula :
13
14
15         3          151     50.3      0.0              fp2 = mpybel2.calcfp()
16         3           13      4.3      0.0              tanimoto = fp1 | fp2
17
18         3            3      1.0      0.0              if tanimoto == 1:
19         3            2      0.7      0.0                  matches.append(mpybel)
20
21         1           67     67.0      0.0      print len(matches)
22         1            1      1.0      0.0      return matches



The operations that are the most time consuming are the ones generating the fingerprint fp1 for all the molecules and the conversion of the query. Both are done 25791 times and take about 94% of the 30 seconds. The conversion operation should be outside of the loop as it needs to be done only once. As for the generation of the fingerprint, I only need to compute the fingerprint if the formulas of the compared molecules are the same. Otherwise, I already know that they are different molecules. Hence, I do not need to compute the fingerprint for all the 25791 molecules. Optimizing the code following an inspection with the code profiler reduces the search time from 29.8 seconds to 0.137 seconds. In this case, there is no gain in using more CPUs: it might even take longer!

0.137 seconds, I can live with that!!

Here is the optimized search function :


def exactSearch (q, start, end) :
matches = []
smi2 = Chem.MolToSmiles(q, isomericSmiles=True)