In Python, there are iterators and generators. You probably already use iterators without even knowing that you do so. But understanding the difference between those two concepts is really important since choosing one over the other has a huge impact on memory usage. If you are working with small datasets, memory usage might not be your first concern. However, with big datasets, it is another story. So what are they exactly, iterators and generators?
Iterators
The process of going through a list, element by element, is called iteration:
>>> lst = [1,2,3]
>>> for i in lst:
... print(i*i)
1
4
9
And when you use a list comprehension, you end up creating a list and therefore an iterator object.
>>> lst = [x*x for x in [1,2,3]] # [1, 4, 9] >>> print(lst)
[1, 4, 9]>>> for i in lst: ... print(i)
1 4 9
Generators
By replacing [] by (), you create a generator expression rather than a list. With generators, instead of storing data in the variable lst (and therefore in memory), you are going to generate them on the spot (e.g. only when you need them).
Careful: Generating data on the spot does not allow to read them several times. And if you try to do so, no error will be raised to warn you.
>>> lst = (x*x for x in [1,2,3]) # [1, 4, 9]
>>> print(lst)
<generator object <genexpr> at 0x1933640>
Here, “genexpr” means generator expression (and not gene expression as biologists could assume).
>>> for i in lst: ... print(i)
1 4 9>>> for i in lst: ... print(i)
# Nothing is displayed
You can easily notice than even if we wanted to print the list lst two times, it is only printed once.
Tip: When you start to get familiar with generators, it is interesting to nest them (for performance reason or code readability for example). But be aware that reading the second list lst2 will erase the first one!
>>> lst1 = (x*x for x in [1,2,3]) # [1, 4, 9] >>> lst2 = (x+x for x in lst1) # [2, 8, 18] >>> for i in lst2: ... print(i)
2 8 18>>> for i in lst1: ... print(i)
# Nothing is displayed
Advantages/Disadvantages of generators
If you need to access your data only once, using generators will allow you to decrease your memory usage (because data are generated on the spot) and your program will run faster.
Now, if you need to access your data several times, you are still going to decrease your memory usage but your program will be slower since the generator and the data need to be generated each time. Therefore, it is generally not a good practice to use generator in this last scenario.
Finally, here is a quick example to give you an idea of generator’s performance when they are properly used.
import os import gc import psutil num = 10000000 rep = 500 def mem_usage_in_MB(proc): return proc.memory_info()[0] / float(2 ** 20) proc = psutil.Process(os.getpid()) mem0 = mem_usage_in_MB(proc) toto = (x*x for x in range(num)) tata = (x+x for x in toto) tutu = (x-1 for x in tata) print("mem generator: " + str(mem_usage_in_MB(proc) - mem0) + "MB") mem0 = mem_usage_in_MB(proc) toto = [x*x for x in range(num)] toto = [x+x for x in toto] toto = [x-1 for x in toto] print("mem iterator: " + str(mem_usage_in_MB(proc) - mem0) + "MB") import timeit def test(t, num): toto = (x*x for x in range(num)) if t == "gen" else [x*x for x in range(num)] sum(toto) def test2(t, num): toto = (x*x for x in range(num)) if t == "gen" else [x*x for x in range(num)] toto = (x+x for x in toto) if t == "gen" else [x+x for x in toto] toto = (x-1 for x in toto) if t == "gen" else [x-1 for x in toto] sum(toto) print("test time generator:" + str(timeit.timeit("test(\"gen\"," + str(num) + ")", setup="from __main__ import test", number=rep))) print("test time iterator:" + str(timeit.timeit("test(\"iter\"," + str(num) + ")", setup="from __main__ import test", number=rep))) print("test2 time generator:" + str(timeit.timeit("test(\"gen\"," + str(num) + ")", setup="from __main__ import test", number=rep))) print("test2 time iterator:" + str(timeit.timeit("test(\"iter\"," + str(num) + ")", setup="from __main__ import test", number=rep))) # with python3
mem generator: 0.00390625MB mem iterator: 387.8984375MB test time generator:730.7246094942093 test time iterator:765.0462868176401 test2 time generator:727.7452643960714 test2 time iterator:768.4699434302747# with python2
mem generator: 310.72265625MB mem iterator: 545.578125MB test time generator:801.186733007 test time iterator:757.989295006 test2 time generator:810.537645102 test2 time iterator:939.240092993
Footnote: Of course, this post is only a brief introduction on generators, by the prism of list comprehension (often used by python programmers). To learn more about them, you can check this presentation but also gather information on the keyword yield.
Footnote 2: With Python2 you need to replace “range” by “xrange” to have the optimum performance, but your code will no longer be Python3 compatible.
Leave A Comment