====== When parallelization does not help: the starving CPUs problem ====== == Executive summary == Computational scientists should know that most of the time their CPUs are waiting for data to arrive. Knowing where the low-level bottlenecks are, and knowing what can be done to ameliorate them, may save hours of frustration when trying to understand why apparently well-written programs perform poorly. I'll talk about why current CPUs are starving for data, and how to address this issue in modern computers by using different techniques. == Contents == 1. Motivation 2. The Data Access Issue * Why Modern CPUs Are Starving? * Caches And The Hierarchical Memory Model * Techniques For Fighting Data Starvation 3. High Performance Libraries {{:materials:starving_cpus:StarvingCPUs.pdf|Preliminary slides here}}. The multiprocessing script for NumPy is {{:materials:starving_cpus:poly-mp.py|here}}. == Exercises == Fetch the guidelines in {{:materials:starving_cpus:guidelines.txt|TXT}} format or {{:materials:starving_cpus:guidelines.pdf|PDF}}. You will need these files: * {{:materials:starving_cpus:poly1.py|poly1.py}} * {{:materials:starving_cpus:poly2.py|poly2.py}} * {{:materials:starving_cpus:poly.c|poly.c}} Obviously, the solutions are not provided (yet!). == Server for exercises == You should see the following Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb +--[ RSA 2048]----+ | o | | o . . | | = . . . | | . + + . | | o S . o | | . . = o + | | o o o o | | . . ..o | | ... .oo.E | +-----------------+ The exercises need to be performed on a machine with more power than a normal notebook. To access the server you need to use a different port (to bypass the firewall :( ): ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes == CPU vs Memory Benchmark == Save the following as cpu_vs_mem.py and execute. Thanks to Stéfan van der Walt for contributing this. import numpy as np import timeit import sys N = 10*1000*1000 x = np.linspace(1, 10, N) def raw(): """Straight-forward NumPy evaluation of polynomial. """ return (((.25 * x) + .75) * x - 1.5) * x - 2 def inplace(block_size=20000): """Blocked evaluation of polynomial. """ y = np.empty(len(x)) for k in range(len(x) // block_size + 1): b, e = k * block_size, (k+1) * block_size y[b:e] = x[b:e] y[b:e] *= .25 y[b:e] += .75 y[b:e] *= x[b:e] y[b:e] -= 1.5 y[b:e] *= x[b:e] y[b:e] -= 2 return y def bench(): """Illustrate CPU vs memory trade-off. Break up a computation in chunks and benchmark. Small blocks fit into cache easily, but the NumPy overhead and the outer Python for-loop takes long to execute. With large blocks, the overhead for NumPy and the for-loop is negligible, but the blocks no longer fit into cache, resulting in delays. Returns ------- block_sizes : list Size of the different data chunks. times : list Execution times. """ times = [] blocks = np.round(np.logspace(3, 6, num=50)) for b in blocks: times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b, 'import cpu_vs_mem', number=1)) print 'Block size: %d Execution time: %.2f' % (b, times[-1]) sys.stdout.flush() return blocks, times if __name__ == "__main__": import matplotlib.pyplot as plt blocks, times = bench() plt.semilogx(blocks, times, 'o-') plt.xlabel('Block size [b]') plt.ylabel('Execution time [s]') plt.title('CPU vs Memory Benchmark') plt.show()