===== Pragmatic Concurrency for Python =====

=== Tutors ===

  - Eilif Muller (eilif dot mueller at epfl dot ch)
  - Zbigniew Jędrzejewski-Szmek

==== Topics covered ====

  * Parallel programming concepts
  * ipython (ipcluster)
  * mpi4py (Message Passing Interface for Python)
  * multiprocessing

==== Exercises ==== 

The purpose of these exercises is not to amount to killer speed-ups (a laptop is not the right hardware for that), but rather to run and modify a few examples, become comfortable with APIs, and implement some simple parallel programs.

  * To retrieve the source code for the exercises: 

    git clone <username>@python.g-node.org:/git/parallel

=== 1) Running MPI programs ===

Write a simple python program using the mpi4py module which imports
mpi4py.MPI and displays the COMM_WORLD.rank, size and
MPI.Get_processor_name() on each process.  It is always handy to
have such a program around to verify that the MPI environment is
working as expected.  In a distributed environment, the processor
name will further inform you that your MPI execution was spawned
accross machine boundaries, and how many processes are allocated per
machine.

Note: To run your program mpi4py, it must be started as if it was any MPI program, i.e. as follows:

<code bash>
$ mpiexec -n X python <program.py>
</code>


=== 2) Matrix Multiplication ===

  * Configure them to have the same matrix sizes, and compare speeds. It would be nice to also look at speedup and scaling, for those of you with (remote) access to machines with more than 4 cores (true cores, not hyper-threads) with the appropriate software installed.

For ipython, you need to start an ipcluster: 

NB: <code bash>
$ ipcluster start -n X  # or similar for your ipython version, see lecture notes
</code>
Where -n X is the number of slave processes to start.

=== 3) Parallelization of mandelbrot ===

  - Using similar decomposition techniques to the Matrix Multiplication example, parallelize the serial implementation of the mandelbrot plotter provided in the examples, using mpi4py, ipython and multiprocessing.
  - Load balancing - the mandelbrot compuation has the property that computing some pixels take much longer than others.

First, quantify the degree of inbalance by gathering and plotting the distribution of execution times per pixel.  Assuming you used chunked decomposition as for matrix multiplication, how does this per-pixel imbalance translate into a per-chunk inbalance?

Second, Can you modify the decomposition of the problem to provide each worker with work-loads which are more equal?
  
== Hint ==

ipython: read-up on the LoadBalancedView here: http://ipython.org/ipython-doc/rel-0.13/parallel/parallel_task.html
mpi4py: a pure mpi4py approach is more tricky.  One method might be
to use asynchronous messaging (Isend, Irecv) to a set of workers and
let a master (e.g. rank 0) re-assign work as workers complete.
  
 
=== 4) IPython map-reduce ===

Using the ipython approach, get a collection of processes to count the occurrences of a word in a collection of documents, and then reduce the results to a total count per word on the master process.

See also: http://en.wikipedia.org/wiki/MapReduce, http://labs.google.com/papers/mapreduce.html




==== Lecture material ====

{{:parallel:talk.pdf|}}

==== Running on the cluster ====

Start:

  ssh <username>@login.s3it.uzh.ch
  module load cluster/gnode
  module load OpenMPI
  export PATH=/home/gc3/zbyszek/usr/bin:$PATH

You are running on the frontend machine ''login-p30-40''.
Now you should be able to launch ''python3'', ''ipcluster'', ''ipython''.

To run tests on the "big iron", launch a shell:

  srun -n 2 -A gnode -p largemem --pty --time=0:30:0 --mem=16g bash -l

Please note that this reserves (blocks) two CPUs as long as the shell is active.

To launch actual jobs:

  srun -n 8 -A gnode -p largemem --time=0:10:00 --mem=8g python3 script.py
  
Example using openmpi:

  srun --mpi=pmi2 -n 8 -A gnode -p largemem --time=0:10:00 --mem=8g mpiexec -n 8 python3 parallel/matmul/mpi_matmul.py
  
Common problems:

  * ''Required node not available (down, drained or reserved)'' — ''cluster/gnode'' module is not loaded
  * ''mpiexec: No such file or directory'' — ''OpenBLAS'' module is not loaded
  * ''ImportError: No module named 'numpy''' — you are running ''/usr/bin/python3'' not ''~zbyszek/usr/bin/python3'', ''$PATH'' is not set
  * ''These module(s) exist but cannot be loaded as requested: "OpenMPI"'' — ''cluster/gnode'' module is not loaded
  * Job is scheduled but seems to wait forever waiting for resources — someone else has reserved too many CPUs
  * ''Unable to create job step: More processors requested than permitted'' — most likely you are trying to run the job from ''sgi-p30-11'' instead of ''login-p30-40''
  * ''ImportError: No module named 'matplotlib''' — matplotlib is not installed, sorry!