The purpose of these exercises is not to amount to killer speed-ups (a laptop is not the right hardware for that), but rather to run and modify a few examples, become comfortable with APIs, and implement some simple parallel programs.
git clone <username>@python.g-node.org:/git/parallel
Write a simple python program using the mpi4py module which imports mpi4py.MPI and displays the COMM_WORLD.rank, size and MPI.Get_processor_name() on each process. It is always handy to have such a program around to verify that the MPI environment is working as expected. In a distributed environment, the processor name will further inform you that your MPI execution was spawned accross machine boundaries, and how many processes are allocated per machine.
Note: To run your program mpi4py, it must be started as if it was any MPI program, i.e. as follows:
$ mpiexec -n X python <program.py>
For ipython, you need to start an ipcluster:
NB:
$ ipcluster start -n X # or similar for your ipython version, see lecture notes
Where -n X is the number of slave processes to start.
First, quantify the degree of inbalance by gathering and plotting the distribution of execution times per pixel. Assuming you used chunked decomposition as for matrix multiplication, how does this per-pixel imbalance translate into a per-chunk inbalance?
Second, Can you modify the decomposition of the problem to provide each worker with work-loads which are more equal?
ipython: read-up on the LoadBalancedView here: http://ipython.org/ipython-doc/rel-0.13/parallel/parallel_task.html mpi4py: a pure mpi4py approach is more tricky. One method might be to use asynchronous messaging (Isend, Irecv) to a set of workers and let a master (e.g. rank 0) re-assign work as workers complete.
Using the ipython approach, get a collection of processes to count the occurrences of a word in a collection of documents, and then reduce the results to a total count per word on the master process.
See also: http://en.wikipedia.org/wiki/MapReduce, http://labs.google.com/papers/mapreduce.html
Start:
ssh <username>@login.s3it.uzh.ch module load cluster/gnode module load OpenMPI export PATH=/home/gc3/zbyszek/usr/bin:$PATH
You are running on the frontend machine login-p30-40
.
Now you should be able to launch python3
, ipcluster
, ipython
.
To run tests on the “big iron”, launch a shell:
srun -n 2 -A gnode -p largemem --pty --time=0:30:0 --mem=16g bash -l
Please note that this reserves (blocks) two CPUs as long as the shell is active.
To launch actual jobs:
srun -n 8 -A gnode -p largemem --time=0:10:00 --mem=8g python3 script.py
Example using openmpi:
srun --mpi=pmi2 -n 8 -A gnode -p largemem --time=0:10:00 --mem=8g mpiexec -n 8 python3 parallel/matmul/mpi_matmul.py
Common problems:
Required node not available (down, drained or reserved)
— cluster/gnode
module is not loadedmpiexec: No such file or directory
— OpenBLAS
module is not loadedImportError: No module named 'numpy
' — you are running /usr/bin/python3
not ~zbyszek/usr/bin/python3
, $PATH
is not setThese module(s) exist but cannot be loaded as requested: “OpenMPI”
— cluster/gnode
module is not loadedUnable to create job step: More processors requested than permitted
— most likely you are trying to run the job from sgi-p30-11
instead of login-p30-40
ImportError: No module named 'matplotlib
' — matplotlib is not installed, sorry!