====== When parallelization does not help: the starving CPUs problem ======

== Executive summary ==

Computational scientists should know that most of the time
their CPUs are waiting for data to arrive.  Knowing where
the low-level bottlenecks are, and knowing what can be
done to ameliorate them, may save hours of frustration
when trying to understand why apparently well-written
programs perform poorly.

I'll talk about why current CPUs are starving for data,
and how to address this issue in modern computers by using
different techniques.

== Contents ==

1.  Motivation

2.  The Data Access Issue
  * Why Modern CPUs Are Starving?
  * Caches And The Hierarchical Memory Model
  * Techniques For Fighting Data Starvation

3. High Performance Libraries


{{:materials:starving_cpus:StarvingCPUs.pdf|Preliminary slides here}}.
The multiprocessing script for NumPy is {{:materials:starving_cpus:poly-mp.py|here}}.

== Exercises ==

Fetch the guidelines in {{:materials:starving_cpus:guidelines.txt|TXT}} format or {{:materials:starving_cpus:guidelines.pdf|PDF}}.

You will need these files:

  * {{:materials:starving_cpus:poly1.py|poly1.py}}
  * {{:materials:starving_cpus:poly2.py|poly2.py}}
  * {{:materials:starving_cpus:poly.c|poly.c}}

Obviously, the solutions are not provided (yet!).

== Server for exercises ==

<box right grey 30%>
You should see the following
  Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb
  +--[ RSA 2048]----+
  |    o            |
  |   o .   .       |
  |    = . . .      |
  |   . + +   .     |
  |      o S . o    |
  |   . .   = o +   |
  |    o   o o o    |
  |     . . ..o     |
  |    ... .oo.E    |
  +-----------------+
</box>
The exercises need to be performed on a machine with more power than a normal notebook.
To access the server you need to use a different port (to bypass the firewall :( ):
  ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes

== CPU vs Memory Benchmark ==

Save the following as cpu_vs_mem.py and execute.  Thanks to Stéfan van der Walt for contributing this.

<code python cpu_vs_mem.py>
import numpy as np
import timeit
import sys

N = 10*1000*1000
x = np.linspace(1, 10, N)

def raw():
    """Straight-forward NumPy evaluation of polynomial.

    """
    return (((.25 * x) + .75) * x - 1.5) * x - 2

def inplace(block_size=20000):
    """Blocked evaluation of polynomial.

    """
    y = np.empty(len(x))
    for k in range(len(x) // block_size + 1):
        b, e = k * block_size, (k+1) * block_size
        y[b:e] = x[b:e]
        y[b:e] *= .25
        y[b:e] += .75
        y[b:e] *= x[b:e]
        y[b:e] -= 1.5
        y[b:e] *= x[b:e]
        y[b:e] -= 2

    return y

def bench():
    """Illustrate CPU vs memory trade-off.

    Break up a computation in chunks and benchmark. Small blocks fit into
    cache easily, but the NumPy overhead and the outer Python for-loop
    takes long to execute.  With large blocks, the overhead for NumPy and
    the for-loop is negligible, but the blocks no longer fit into cache,
    resulting in delays.

    Returns
    -------
    block_sizes : list
        Size of the different data chunks.
    times : list
        Execution times.

    """
    times = []
    blocks = np.round(np.logspace(3, 6, num=50))
    for b in blocks:
        times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b,
                                   'import cpu_vs_mem', number=1))
        print 'Block size: %d  Execution time: %.2f' % (b, times[-1])
        sys.stdout.flush()

    return blocks, times

if __name__ == "__main__":
    import matplotlib.pyplot as plt

    blocks, times = bench()
    plt.semilogx(blocks, times, 'o-')
    plt.xlabel('Block size [b]')
    plt.ylabel('Execution time [s]')
    plt.title('CPU vs Memory Benchmark')
    plt.show()


</code>