Though i have mentioned the comments as part of the code, the following is a quick HOWTO of how to make modifications to this article extractor:
1) To extract meta information , like author, title, description, keywords etc - extract the meta tags in line 30, i.e, after the soup object is constructed, but before the tags are stripped. Also, in strip_tags, return a tuple instead of the text alone.
2) Understand how 'unwanted_tags' works; feel free to add the ids/class names that you might encounter. I have mentioned only a few, but more names like "print","popup","tools","socialtools" can be added.
3) Feel free to suggest any other improvements.
from BeautifulSoup import BeautifulSoup,Comment
invalid_tags = ['b', 'i', 'u','link','em','small','span','blockquote','strong','abbr','ol','h1', 'h2', 'h3','h4','font','tr','td','center','tbody','table']
not_allowed_tags = ['script','noscript','img','object','meta','code','pre','br','hr','form','input','iframe' ,'style','dl','dt','sup','head','acronym']
#attributes that are checked for in a given html tag - if present, the tag is removed.
for each_class in unwanted_tags:
if each_class in tag_class:
for i, x in enumerate(tag.parent.contents):
if x == tag: break
print "Can't find", tag, "in", tag.parent
for r in reversed(tag.contents):
tags = ""
soup = BeautifulSoup(html)
doctype = soup.findAll(text=re.compile("DOCTYPE"))
[tree.extract() for tree in doctype]
#remove all links
links = soup.findAll(text=re.compile("http://"))
[tree.extract() for tree in links]
#remove all comments
comments = soup.findAll(text=lambda text:isinstance(text, Comment) )
[comment.extract() for comment in comments]
for tag in soup.findAll(True):
#remove all the tags that are not allowed.
if tag.name in not_allowed_tags :
#replace the tags with the content of the tag
if tag.name in invalid_tags:
# similar to not_allowed_tags but does a check for the attribute-class/id before removing it
if unwanted(tag.get('class','')) or unwanted(tag.get('id','')) :
# special case of lists - the lists can be part of navbars/sideheadings too,
# hence check length before removing them
if tag.name =='li':
tagc = strip_tags(str(tag.contents))
if len(str(tagc).split()) < 3:
#finally remove all empty and spurious tags and replce it with its content
if tag.name in ['div','a','p','ul','li','html','body'] :
#open the file which contains the html
#this step can be replaced with reading directly from the url
#however, i think its always better to store the html in the
# local storage for any later processing.
html = open("techcrunch.html").read()
soup = strip_tags(html)
content = str(soup.prettify())
#write the stripped content into another file.
outfile = open("tech.txt","w")
If the formatting is screwed up, then you can access the code here or here.