Computational scientists should know that most of the time their CPUs are waiting for data to arrive. Knowing where the low-level bottlenecks are, and knowing what can be done to ameliorate them, may save hours of frustration when trying to understand why apparently well-written programs perform poorly.
I'll talk about why current CPUs are starving for data, and how to address this issue in modern computers by using different techniques.
1. Motivation
2. The Data Access Issue
3. High Performance Libraries
Preliminary slides here. The multiprocessing script for NumPy is here.
Fetch the guidelines in TXT format or PDF.
You will need these files:
Obviously, the solutions are not provided (yet!).
Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb +--[ RSA 2048]----+ | o | | o . . | | = . . . | | . + + . | | o S . o | | . . = o + | | o o o o | | . . ..o | | ... .oo.E | +-----------------+
The exercises need to be performed on a machine with more power than a normal notebook. To access the server you need to use a different port (to bypass the firewall :( ):
ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes
Save the following as cpu_vs_mem.py and execute. Thanks to Stéfan van der Walt for contributing this.
import numpy as np import timeit import sys N = 10*1000*1000 x = np.linspace(1, 10, N) def raw(): """Straight-forward NumPy evaluation of polynomial. """ return (((.25 * x) + .75) * x - 1.5) * x - 2 def inplace(block_size=20000): """Blocked evaluation of polynomial. """ y = np.empty(len(x)) for k in range(len(x) // block_size + 1): b, e = k * block_size, (k+1) * block_size y[b:e] = x[b:e] y[b:e] *= .25 y[b:e] += .75 y[b:e] *= x[b:e] y[b:e] -= 1.5 y[b:e] *= x[b:e] y[b:e] -= 2 return y def bench(): """Illustrate CPU vs memory trade-off. Break up a computation in chunks and benchmark. Small blocks fit into cache easily, but the NumPy overhead and the outer Python for-loop takes long to execute. With large blocks, the overhead for NumPy and the for-loop is negligible, but the blocks no longer fit into cache, resulting in delays. Returns ------- block_sizes : list Size of the different data chunks. times : list Execution times. """ times = [] blocks = np.round(np.logspace(3, 6, num=50)) for b in blocks: times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b, 'import cpu_vs_mem', number=1)) print 'Block size: %d Execution time: %.2f' % (b, times[-1]) sys.stdout.flush() return blocks, times if __name__ == "__main__": import matplotlib.pyplot as plt blocks, times = bench() plt.semilogx(blocks, times, 'o-') plt.xlabel('Block size [b]') plt.ylabel('Execution time [s]') plt.title('CPU vs Memory Benchmark') plt.show()