a Summer School by the G-Node and the Physik-Institut, University of Zurich
When we are dealing with big data, it is impossible to fit the whole dataset in memory. If we are able to read data in in “chunks”, and process each chunk on its own, we can create an efficient pipeline.
In this exercise, the goal is to solve the first part of the problem: write an iterator which yields single words from a file, ignoring all punctuation and whitespace.
If the file wasn't too big, one could simply use
open('filename').read().translate(None, ';.!?').split()
But imagine what would happen if the file was bigger than the available RAM… So there's a requirement, that the iterator cannot “slurp” the whole file in, but should read it in parts and yield words without using an arbitrary amount of memory at any point in time.
Using the iterator from Exercise 1, we can easily access words in a file. In this exercise the second part of the pipeline is implemented: words generated by the word-iterator are used to calculated some statistics on the text. Since we are operating in a pipeline, we again have to watch the total memory use, and cannot simply store all the words in a list.
Write a function which prints out the mean length of words in the file.
Sometimes finding the right datastructure makes for a really simple solution to a simingly complicated problem…
Write a program to find all anagrams in a list of words. Ignore the case of letters.
E.g., for the list
['now', 'elapses', 'house', 'pleases', 'won']
the program might say
elapses <~> pleases won <~> now
For development write the list directly in the program. This way you can test if results are sensible without reading pages of output.
For testing use the following generator expression:
(line.strip() for line in open('/usr/share/dict/words'))
A word is a sequence of characters. Two words are anagrams, if their sequences of letters are the same, after sorting.
anagrams.py (fixed)
This exercise wasn't anticipated, and requires the installation of an additional package :(:
easy_install --user https://bitbucket.org/bos/python-inotify/get/aca0c5246c46.zip
Image that we want to react whenever lines are written to a file, e.g. to the file where the measurements of some hardware device are written. The following iterator first yields all lines of a file, and then waits until new lines are appended, and yields them, ad infinitum.
import os, time: def follow(file): while True: # readline returns '' when nothing more can be read line = file.readline() if line: yield line else: time.sleep(0.1)
The problem is that this is quite innefficient, using something that is called “busy polling”, and chewing up the CPU.
A general solution to “following” files in this way is to use inotify. This is a Linux kernel interface through which the kernel notifies a program whenever a file or directory is modified.
In the following snippet, the read() method blocks until an event (a modification to the file under watch) happens:
watcher = inotify.watcher.Watcher() watcher.add('/tmp/file', inotify.IN_MODIFY) watcher.read()
Use a watcher to rewrite the iterator above to not use busy polling, and wait with an attempt to read another line until the kernel notifies it about modifications to the file.