[Python] Iterators vs Generators

[Python] Iterators vs Generators

In Python, there are iterators and generators. You probably already use iterators without even knowing that you do so. But understanding the difference between those two concepts is really important since choosing one over the other has a huge impact on memory usage. If you are working with small datasets, memory usage might not be your first concern. However, with big datasets, it is another story. So what are they exactly, iterators and generators?

Iterators

The process of going through a list, element by element, is called iteration:

 
>>> lst = [1,2,3]
>>> for i in lst:
...  print(i*i)

1
4
9

And when you use a list comprehension, you end up creating a list and therefore an iterator object.

 
>>> lst = [x*x for x in [1,2,3]]  # [1, 4, 9]
>>> print(lst)

[1, 4, 9]
 
>>> for i in lst:
...  print(i)

1
4
9

Generators

By replacing [] by (), you create a generator expression rather than a list. With generators, instead of storing data in the variable lst (and therefore in memory), you are going to generate them on the spot (e.g. only when you need them).
Careful: Generating data on the spot does not allow to read them several times. And if you try to do so, no error will be raised to warn you.

 
>>> lst = (x*x for x in [1,2,3])  # [1, 4, 9]
>>> print(lst)

<generator object <genexpr> at 0x1933640>

Here, “genexpr” means generator expression (and not gene expression as biologists could assume).

 
>>> for i in lst:
...  print(i)

1
4
9
 
>>> for i in lst:
...  print(i)


# Nothing is displayed

You can easily notice than even if we wanted to print the list lst two times, it is only printed once.

Tip: When you start to get familiar with generators, it is interesting to nest them (for performance reason or code readability for example). But be aware that reading the second list lst2 will erase the first one!

 
>>> lst1 = (x*x for x in [1,2,3])   # [1, 4, 9]
>>> lst2 = (x+x for x in lst1)  # [2, 8, 18]
>>> for i in lst2:
...  print(i)

2
8
18
 
>>> for i in lst1:
...  print(i)


# Nothing is displayed

Advantages/Disadvantages of generators

If you need to access your data only once, using generators will allow you to decrease your memory usage (because data are generated on the spot) and your program will run faster.

Now, if you need to access your data several times, you are still going to decrease your memory usage but your program will be slower since the generator and the data need to be generated each time. Therefore, it is generally not a good practice to use generator in this last scenario.

Finally, here is a quick example to give you an idea of generator’s performance when they are properly used.

 
import os
import gc
import psutil

num = 10000000
rep = 500

def mem_usage_in_MB(proc):
  return  proc.memory_info()[0] / float(2 ** 20)

proc = psutil.Process(os.getpid())
mem0 = mem_usage_in_MB(proc)
toto = (x*x for x in range(num))
tata = (x+x for x in toto)
tutu = (x-1 for x in tata)
print("mem generator: " + str(mem_usage_in_MB(proc) - mem0) + "MB")
mem0 = mem_usage_in_MB(proc)
toto = [x*x for x in range(num)]
toto = [x+x for x in toto]
toto = [x-1 for x in toto]
print("mem iterator: " + str(mem_usage_in_MB(proc) - mem0) + "MB")


import timeit
def test(t, num):
  toto = (x*x for x in range(num)) if t == "gen" else [x*x for x in range(num)]
  sum(toto)

def test2(t, num):
  toto = (x*x for x in range(num)) if t == "gen" else [x*x for x in range(num)]
  toto = (x+x for x in toto) if t == "gen" else [x+x for x in toto]
  toto = (x-1 for x in toto) if t == "gen" else [x-1 for x in toto]
  sum(toto)

print("test time generator:" + str(timeit.timeit("test(\"gen\"," + str(num) + ")", setup="from __main__ import test", number=rep)))
print("test time iterator:" + str(timeit.timeit("test(\"iter\"," + str(num) + ")", setup="from __main__ import test", number=rep)))
print("test2 time generator:" + str(timeit.timeit("test(\"gen\"," + str(num) + ")", setup="from __main__ import test", number=rep)))
print("test2 time iterator:" + str(timeit.timeit("test(\"iter\"," + str(num) + ")", setup="from __main__ import test", number=rep)))
# with python3

mem generator: 0.00390625MB
mem iterator: 387.8984375MB
test time generator:730.7246094942093
test time iterator:765.0462868176401
test2 time generator:727.7452643960714
test2 time iterator:768.4699434302747

# with python2

mem generator: 310.72265625MB
mem iterator: 545.578125MB
test time generator:801.186733007
test time iterator:757.989295006
test2 time generator:810.537645102
test2 time iterator:939.240092993

 

Footnote: Of course, this post is only a brief introduction on generators, by the prism of list comprehension (often used by python programmers). To learn more about them, you can check this presentation but also gather information on the keyword yield.
Footnote 2: With Python2 you need to replace “range” by “xrange” to have the optimum performance, but your code will no longer be Python3 compatible.

By | 2015-09-18T09:36:54+00:00 September 18, 2015|Categories: Bioinformatics, Performance, Python, Uncategorized|0 Comments

About the Author:

I’ve started as a computer scientist, then I have quickly realised that bioinformatics are saturated by puzzles to solve. As in the "The Summit of the Gods" (Jirō Taniguchi), there are always a new mountain to climb or a path more straightforward.

Leave A Comment