====== When parallelization does not help: the starving CPUs problem ======
== Executive summary ==
Computational scientists should know that most of the time
their CPUs are waiting for data to arrive. Knowing where
the low-level bottlenecks are, and knowing what can be
done to ameliorate them, may save hours of frustration
when trying to understand why apparently well-written
programs perform poorly.
I'll talk about why current CPUs are starving for data,
and how to address this issue in modern computers by using
different techniques.
== Contents ==
1. Motivation
2. The Data Access Issue
* Why Modern CPUs Are Starving?
* Caches And The Hierarchical Memory Model
* Techniques For Fighting Data Starvation
3. High Performance Libraries
{{:materials:starving_cpus:StarvingCPUs.pdf|Preliminary slides here}}.
The multiprocessing script for NumPy is {{:materials:starving_cpus:poly-mp.py|here}}.
== Exercises ==
Fetch the guidelines in {{:materials:starving_cpus:guidelines.txt|TXT}} format or {{:materials:starving_cpus:guidelines.pdf|PDF}}.
You will need these files:
* {{:materials:starving_cpus:poly1.py|poly1.py}}
* {{:materials:starving_cpus:poly2.py|poly2.py}}
* {{:materials:starving_cpus:poly.c|poly.c}}
Obviously, the solutions are not provided (yet!).
== Server for exercises ==
You should see the following
Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb
+--[ RSA 2048]----+
| o |
| o . . |
| = . . . |
| . + + . |
| o S . o |
| . . = o + |
| o o o o |
| . . ..o |
| ... .oo.E |
+-----------------+
The exercises need to be performed on a machine with more power than a normal notebook.
To access the server you need to use a different port (to bypass the firewall :( ):
ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes
== CPU vs Memory Benchmark ==
Save the following as cpu_vs_mem.py and execute. Thanks to Stéfan van der Walt for contributing this.
import numpy as np
import timeit
import sys
N = 10*1000*1000
x = np.linspace(1, 10, N)
def raw():
"""Straight-forward NumPy evaluation of polynomial.
"""
return (((.25 * x) + .75) * x - 1.5) * x - 2
def inplace(block_size=20000):
"""Blocked evaluation of polynomial.
"""
y = np.empty(len(x))
for k in range(len(x) // block_size + 1):
b, e = k * block_size, (k+1) * block_size
y[b:e] = x[b:e]
y[b:e] *= .25
y[b:e] += .75
y[b:e] *= x[b:e]
y[b:e] -= 1.5
y[b:e] *= x[b:e]
y[b:e] -= 2
return y
def bench():
"""Illustrate CPU vs memory trade-off.
Break up a computation in chunks and benchmark. Small blocks fit into
cache easily, but the NumPy overhead and the outer Python for-loop
takes long to execute. With large blocks, the overhead for NumPy and
the for-loop is negligible, but the blocks no longer fit into cache,
resulting in delays.
Returns
-------
block_sizes : list
Size of the different data chunks.
times : list
Execution times.
"""
times = []
blocks = np.round(np.logspace(3, 6, num=50))
for b in blocks:
times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b,
'import cpu_vs_mem', number=1))
print 'Block size: %d Execution time: %.2f' % (b, times[-1])
sys.stdout.flush()
return blocks, times
if __name__ == "__main__":
import matplotlib.pyplot as plt
blocks, times = bench()
plt.semilogx(blocks, times, 'o-')
plt.xlabel('Block size [b]')
plt.ylabel('Execution time [s]')
plt.title('CPU vs Memory Benchmark')
plt.show()