When parallelization does not help: the starving CPUs problem

Executive summary

Computational scientists should know that most of the time their CPUs are waiting for data to arrive. Knowing where the low-level bottlenecks are, and knowing what can be done to ameliorate them, may save hours of frustration when trying to understand why apparently well-written programs perform poorly.

I'll talk about why current CPUs are starving for data, and how to address this issue in modern computers by using different techniques.

Exercises

Fetch the guidelines in TXT format or PDF.

You will need these files:

Obviously, the solutions are not provided (yet!).

Server for exercises

You should see the following

Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb
+--[ RSA 2048]----+
|    o            |
|   o .   .       |
|    = . . .      |
|   . + +   .     |
|      o S . o    |
|   . .   = o +   |
|    o   o o o    |
|     . . ..o     |
|    ... .oo.E    |
+-----------------+

The exercises need to be performed on a machine with more power than a normal notebook. To access the server you need to use a different port (to bypass the firewall :( ):

ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes

CPU vs Memory Benchmark

Save the following as cpu_vs_mem.py and execute. Thanks to Stéfan van der Walt for contributing this.

cpu_vs_mem.py

import numpy as np
import timeit
import sys
 
N = 10*1000*1000
x = np.linspace(1, 10, N)
 
def raw():
    """Straight-forward NumPy evaluation of polynomial.
 
    """
    return (((.25 * x) + .75) * x - 1.5) * x - 2
 
def inplace(block_size=20000):
    """Blocked evaluation of polynomial.
 
    """
    y = np.empty(len(x))
    for k in range(len(x) // block_size + 1):
        b, e = k * block_size, (k+1) * block_size
        y[b:e] = x[b:e]
        y[b:e] *= .25
        y[b:e] += .75
        y[b:e] *= x[b:e]
        y[b:e] -= 1.5
        y[b:e] *= x[b:e]
        y[b:e] -= 2
 
    return y
 
def bench():
    """Illustrate CPU vs memory trade-off.
 
    Break up a computation in chunks and benchmark. Small blocks fit into
    cache easily, but the NumPy overhead and the outer Python for-loop
    takes long to execute.  With large blocks, the overhead for NumPy and
    the for-loop is negligible, but the blocks no longer fit into cache,
    resulting in delays.
 
    Returns
    -------
    block_sizes : list
        Size of the different data chunks.
    times : list
        Execution times.
 
    """
    times = []
    blocks = np.round(np.logspace(3, 6, num=50))
    for b in blocks:
        times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b,
                                   'import cpu_vs_mem', number=1))
        print 'Block size: %d  Execution time: %.2f' % (b, times[-1])
        sys.stdout.flush()
 
    return blocks, times
 
if __name__ == "__main__":
    import matplotlib.pyplot as plt
 
    blocks, times = bench()
    plt.semilogx(blocks, times, 'o-')
    plt.xlabel('Block size [b]')
    plt.ylabel('Execution time [s]')
    plt.title('CPU vs Memory Benchmark')
    plt.show()

When parallelization does not help: the starving CPUs problem

Executive summary

Contents

Exercises

Server for exercises

CPU vs Memory Benchmark