Often, we rely on our old habits. We get comfortable and have a tendency to do things the same old way. Same thing happens when you’re programming. But a day will come when you’ll ask yourself, is this the fastest way to perform this task ? And when this happens to you (and if the given task is in Python), you’ll be glad that a package like timeit exist. Sure there are other ways to organize timing contest in Python. With the package time for example, you can start by setting a t0 = time.time(), perform your task and then print the elapsed time print time.time()-t0. I use this all the time.

But timeit makes it simple to test small chunks of code and is callable from the command line.

Following the example in this article from Xiaonuo Gantan on PythonCentral, I’ve been able to see that list comprehension is still the fastest to replace characters in a list of strings.

Actually, I’m using a pandas dataframe with really messy column names. I want to replace all the weird characters in them because I need to convert my pandas dataframe to a R dataframe. R doesn’t like weird symbols in column names. I’ve always been using list comprehension to do this but I recently saw that pandas has a map function. I was wondering if the map function would be faster. So, here’s my test :


import re

def wrapper(func, *args, **kwargs):
  def wrapped():
    return func(*args, **kwargs)
  return wrapped

def f1 (l) : 
  # Using regular expression  
  return [re.sub(r'=', 'eq', x, flags=re.IGNORECASE) for x in list(l)]

def f2 (l) : 
  # Using map function with a conditional check for unicode or string 
  return l.map(lambda x: x.replace('=', 'eq') if isinstance(x, (str, unicode)) else x)

def f3 (l) : 
  # Using map function without the check 
  return l.map(lambda x: x.replace('=', 'eq'))

def f4 (l) : 
  # Using list comprehension 
  return [x.replace('=', 'eq') for x in list(l)]

def f5 (l) : 
  # Using a for loop  
  c = []
  for e in l : 
    c.append(e.replace('=','eq'))
  return c

fs = [f1,f2,f3,f4,f5]

for f in fs : 
  wrapped = wrapper(f,df.columns)
  print '%s : %.3f sec for 10000 iterations ' % (f.func_name,timeit.timeit(wrapped, number=10000))

Here’s the output :

f1 : 3.559 sec for 10000 iterations
f2 : 0.927 sec for 10000 iterations
f3 : 0.607 sec for 10000 iterations
f4 : 0.358 sec for 10000 iterations
f5 : 0.478 sec for 10000 iterations

And the winner is… f4!