December 30, 2010

An Evening with Python's itertool module

Why I love Python? Well, have you been to Himalayas and have watched the morning sunrise? There are certain feelings that cannot be explained. The fun of programming in python cannot be compared. Anywayz...more on Python and the associated 'joyness factor' in a later post. :)

Often while working with large datasets with Python, one needs to take extra care of the memory and even the simplest of the programs have the potential to make the system go slow and consume the entire memory. Python itertools module has some nifty functions which you will end up using most of the time while working with large data sets, especially when working with text. I spent sometime playing around with some basic functions in the itertools module which are simple to use and often find usage across various functionalities. Though the python docs do a pretty fine job of explaining the individual itertools functions, this post is just an enumeration of a few handpicked functions that I often use.

The following snippet does a quick bigram and trigram generation of a given line:
from itertools import *
def bigram(line):
  words = line.split()
  for i in izip(words,words[1:]):
    print i
def trigram(line):
  words = line.split()
  for i in izip(words,words[1:],words[2:]):
    print i

sentence = "Python is the coolest language"
bigram(sentence)
trigram(sentence)
If you notice , 'language' is not part of an empty tuple. If you want to fill the last tuple with a default value, use 'izip_longest'
for i in izip_longest(words,words[1:],fillvalue='-'):
  print i
A sentence can have many non-alphabetic characters, 'filter' does a quick job of removing them. It takes a function as an argument and a list. The function is applied on individual elements of the list.
print filter(str.isalpha,words)
'imap' would probably be one of the most jazziest and coolest of the itertools functions. Lets see its usage in the following example. Assume that you want to find out the longest word in a given file which contains a word list. What is the 'conventional' way of doing this?
infile = open('words.txt', 'r')
len_longest_word = 0
while 1:
  word=infile.readline()
  if not word:
    break
  tmp_len = len(word)
  if tmp_len > len_longest_word :
    len_longest_word = tmp_len
print 'len_longest_word :',len_longest_word
infile.close()
 The same when done via imap is just one sentence :) ..as follows. (note : we are reading the entire file in one go).
infile = open('words.txt', 'r')
contents = infile.read()
words = contents.split()
print "len_longest_word:",max(imap(len, words))
infile.close()
Now, lets say we have to analyse the frequency distribution of a few lists or lets say we have to process a group of lists by accessing successive elements, then the following is a very simple and neat way of acheiving this. (Try doing a frequency distribution of n lists containing numbers using 'chain')
from itertools import chain
a=[10,20,30]
b=[100,200,300]
for i in chain(a,b):
  print i
Often, we want to group elements in a dictionary by its values; instead of iterating through the dictionary and writing redundant code, itertools comes with a cool 'groupby' which allows us to specify the dimension in which we want to group.
from operator import itemgetter
d = dict(a=1, b=2, c=1, d=2, e=1, f=2, g=3)
di = sorted(d.iteritems(), key=itemgetter(1))
for k, g in groupby(di, key=itemgetter(1)):
    print k, map(itemgetter(0), g)
 The above example on groupby was obtained from here.

No comments: