Recently, I had to search a given chemical structure into a list of structures. Using the python chemoinformatics packages pybel and rdkit, I was easily able to do so but the operation took a little too much time for my linking. Wondering how I could search faster, I immediately thought about Jean-Philippe’s previous blog post titled Put Those CPUs to Good Use. I’ve decided to follow his instructions and give it a try.
Look for a molecule (a given chemical structure) in a list of molecules.
Note that I could have used only pybel or only rdkit. That being said, I am facing a specific constraint in my current project: the molecules in my list are pybel molecule objects while the molecule in my query is a rdkit molecule object. Without this constraint, I could have written the function using only pybel or only rdkit.
def exactSearch (q, start, end) : ''' Search a given structure among a list of molecules''' ''' Returns the number of matches''' matches =  for mpybel in molecules[start, end]: ## pybel molecules fp1 = mpybel.calcfp() ## pybel smi2 = Chem.MolToSmiles(q, isomericSmiles=True) # rdkit molecule mpybel2 = pybel.readstring("smi", smi2 ) # convert rdkit molecule in pybel molecule if mpybel.formula == mpybel2.formula : fp2 = mpybel2.calcfp() tanimoto = fp1 | fp2 if tanimoto == 1: matches.append(mpybel) return len(matches) n_matches = exactSearch (query, 0, len(molecules))
Running this function, it took 29.8 seconds to find the three occurrences of my query among the 25791 molecules of the list. I am patient but to some extent: waiting 30 seconds every time I need to find a molecule is too long.
To run on more than one CPU using the multiprocessing python package, I only need to add two modifications to my previous code: (1) I must define a wrapper function and (2) the tasks that are going to be run in parallel.
import multiprocessing def exactSearch_wrapper (args) : return exactSearch(*args) def exactSearch (q, start, end) : ... # same function as above p = multiprocessing.Pool (processes=4) tasks = [(query, i, i+1000]) for i in range(0, len(molecules), 1000)] searches = p.map_async(exactSearch_wrapper, tasks) p.close() p.join()
With this code, I am still capable of identifying the three occurrences of my query but it now takes 7.8 seconds since I have divided my list of molecules into smaller lists of 1000 molecules, and I am now using 4 CPUs. This list division was possible because each comparison of the query to a molecule of the list is an independent operation.
Even though there are other options to speed up my search, my intents here were to present a (hopefully) simple example of the multiprocessing package. You can see that it is quite easy to alter existing code to take advantage of the available CPUs in order to reduce waiting time.
Another option would be the usage of a code profiler to identify the operations taking a lot of time and see if those can be modified to get an optimized code.
Here is the output of the profiler for the
exactSearch function :
Timer unit: 1e-06 s Total time: 28.8646 s File: Function: exactSearch at line 1 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 def exactSearch0 (q, mols) : 2 3 1 1 1.0 0.0 matches =  4 5 25792 16267 0.6 0.1 for mpybel in mols: ## pybel molecules 6 25791 13097301 507.8 45.4 fp1 = mpybel.calcfp() ## pybel 7 8 9 25791 1043002 40.4 3.6 smi2 = Chem.MolToSmiles(q, isomericSmiles=True) 10 25791 14247872 552.4 49.4 mpybel2 = pybel.readstring("smi", smi2 ) 11 12 25791 459966 17.8 1.6 if mpybel.formula == mpybel2.formula : 13 14 15 3 151 50.3 0.0 fp2 = mpybel2.calcfp() 16 3 13 4.3 0.0 tanimoto = fp1 | fp2 17 18 3 3 1.0 0.0 if tanimoto == 1: 19 3 2 0.7 0.0 matches.append(mpybel) 20 21 1 67 67.0 0.0 print len(matches) 22 1 1 1.0 0.0 return matches
The operations that are the most time consuming are the ones generating the fingerprint fp1 for all the molecules and the conversion of the query. Both are done 25791 times and take about 94% of the 30 seconds. The conversion operation should be outside of the loop as it needs to be done only once. As for the generation of the fingerprint, I only need to compute the fingerprint if the formulas of the compared molecules are the same. Otherwise, I already know that they are different molecules. Hence, I do not need to compute the fingerprint for all the 25791 molecules. Optimizing the code following an inspection with the code profiler reduces the search time from 29.8 seconds to 0.137 seconds. In this case, there is no gain in using more CPUs: it might even take longer!
0.137 seconds, I can live with that!!
Here is the optimized search function :
def exactSearch (q, start, end) : matches =  smi2 = Chem.MolToSmiles(q, isomericSmiles=True) mpybel2 = pybel.readstring("smi", smi2 ) for mpybel in molecules[start:end]: if mpybel.formula == mpybel2.formula : fp1 = mpybel.calcfp() ## pybel fp2 = mpybel2.calcfp() tanimoto = fp1 | fp2 if tanimoto == 1: matches.append(mpybel) return len(matches)