Often while working with large datasets with Python, one needs to take extra care of the memory and even the simplest of the programs have the potential to make the system go slow and consume the entire memory. Python itertools module has some nifty functions which you will end up using most of the time while working with large data sets, especially when working with text. I spent sometime playing around with some basic functions in the itertools module which are simple to use and often find usage across various functionalities. Though the python docs do a pretty fine job of explaining the individual itertools functions, this post is just an enumeration of a few handpicked functions that I often use.
The following snippet does a quick bigram and trigram generation of a given line:
from itertools import *If you notice , 'language' is not part of an empty tuple. If you want to fill the last tuple with a default value, use 'izip_longest'
def bigram(line):
words = line.split()
for i in izip(words,words[1:]):
print i
def trigram(line):
words = line.split()
for i in izip(words,words[1:],words[2:]):
print i
sentence = "Python is the coolest language"
bigram(sentence)
trigram(sentence)
for i in izip_longest(words,words[1:],fillvalue='-'):A sentence can have many non-alphabetic characters, 'filter' does a quick job of removing them. It takes a function as an argument and a list. The function is applied on individual elements of the list.
print i
print filter(str.isalpha,words)'imap' would probably be one of the most jazziest and coolest of the itertools functions. Lets see its usage in the following example. Assume that you want to find out the longest word in a given file which contains a word list. What is the 'conventional' way of doing this?
infile = open('words.txt', 'r')The same when done via imap is just one sentence :) ..as follows. (note : we are reading the entire file in one go).
len_longest_word = 0
while 1:
word=infile.readline()
if not word:
break
tmp_len = len(word)
if tmp_len > len_longest_word :
len_longest_word = tmp_len
print 'len_longest_word :',len_longest_word
infile.close()
infile = open('words.txt', 'r')Now, lets say we have to analyse the frequency distribution of a few lists or lets say we have to process a group of lists by accessing successive elements, then the following is a very simple and neat way of acheiving this. (Try doing a frequency distribution of n lists containing numbers using 'chain')
contents = infile.read()
words = contents.split()
print "len_longest_word:",max(imap(len, words))
infile.close()
from itertools import chainOften, we want to group elements in a dictionary by its values; instead of iterating through the dictionary and writing redundant code, itertools comes with a cool 'groupby' which allows us to specify the dimension in which we want to group.
a=[10,20,30]
b=[100,200,300]
for i in chain(a,b):
print i
from operator import itemgetterThe above example on groupby was obtained from here.
d = dict(a=1, b=2, c=1, d=2, e=1, f=2, g=3)
di = sorted(d.iteritems(), key=itemgetter(1))
for k, g in groupby(di, key=itemgetter(1)):
print k, map(itemgetter(0), g)
No comments:
Post a Comment