The Unstoppable Force: Simple Article Extractor from HTML

The following is a simple article extractor from a given web(html) page. Being in Python its simple and is less than 55 lines of code. I tried this on a few webpages , and was satisfied with the output.
Though i have mentioned the comments as part of the code, the following is a quick HOWTO of how to make modifications to this article extractor:
1) To extract meta information , like author, title, description, keywords etc - extract the meta tags in line 30, i.e, after the soup object is constructed, but before the tags are stripped. Also, in strip_tags, return a tuple instead of the text alone.
2) Understand how 'unwanted_tags' works; feel free to add the ids/class names that you might encounter. I have mentioned only a few, but more names like "print","popup","tools","socialtools" can be added.
3) Feel free to suggest any other improvements.



from BeautifulSoup import BeautifulSoup,Comment

import re



invalid_tags = ['b', 'i', 'u','link','em','small','span','blockquote','strong','abbr','ol','h1', 'h2', 'h3','h4','font','tr','td','center','tbody','table']

not_allowed_tags = ['script','noscript','img','object','meta','code','pre','br','hr','form','input','iframe' ,'style','dl','dt','sup','head','acronym']



#attributes that are checked for in a given html tag - if present, the tag is removed.

unwanted_tags=["tags","breadcrumbs","disqus","boxy","popular","recent","feature_title","logo","leaderboard","widget","neighbor","dsq","announcement","button","more","categories","blogroll","cloud","related","tab"]



def unwanted(tag_class):

  for each_class in unwanted_tags:

    if each_class in tag_class:

      return True

  return False



#from http://stackoverflow.com/questions/1765848/remove-a-tag-using-beautifulsoup-but-keep-its-contents

def remove_tag(tag):

  for i, x in enumerate(tag.parent.contents):

    if x == tag: break

  else:

    print "Can't find", tag, "in", tag.parent

    return

  for r in reversed(tag.contents):

    tag.parent.insert(i, r)

  tag.extract()



def strip_tags(html):

  tags = ""

  soup = BeautifulSoup(html)

  

  #remove doctype

  doctype = soup.findAll(text=re.compile("DOCTYPE"))

  [tree.extract() for tree in doctype]

  

  #remove all links

  links = soup.findAll(text=re.compile("http://"))

  [tree.extract() for tree in links]

  

  #remove all comments

  comments = soup.findAll(text=lambda text:isinstance(text, Comment) )

  [comment.extract() for comment in comments]

  

  for tag in soup.findAll(True):

    #remove all the tags that are not allowed.

    if tag.name in not_allowed_tags :

      tag.extract()

      continue

    

    #replace the tags with the content of the tag

    if tag.name in invalid_tags:      

      remove_tag(tag)

    

    # similar to not_allowed_tags but does a check for the attribute-class/id before removing it

    if unwanted(tag.get('class','')) or unwanted(tag.get('id','')) :

      tag.extract()

      continue

    

    # special case of lists - the lists can be part of navbars/sideheadings too,

    # hence check length before removing them

    if tag.name =='li':

      tagc = strip_tags(str(tag.contents))

      if len(str(tagc).split()) < 3:

        tag.extract()

        continue

    

    #finally remove all empty and spurious tags and replce it with its content

    if tag.name in ['div','a','p','ul','li','html','body'] :

      remove_tag(tag)

      

  return soup

#open the file which contains the html

#this step can be replaced with reading directly from the url

#however, i think its always better to store the html in the 

#  local storage for any later processing.

html = open("techcrunch.html").read()

soup = strip_tags(html)

content = str(soup.prettify())



#write the stripped content into another file.

outfile = open("tech.txt","w")

outfile.write(content)

outfile.close()

If the formatting is screwed up, then you can access the code here or here.

2 comments:

Anonymous said...: You've got an interesting blog. I was looking for someone with a keen interest on predictive analytics and landed here.

I'd like to pick you brain on a couple of topics.

Could you drop a note at ashwin@ashwinramasamy.com, please?

Thanks!
Ashwin; January 13, 2011 2:08 PM
michaelwaung said...: Thanks for sharing this informative blog. Such a useful Blog. I hope to keep sharing this type of blog.

Website Meta Tag Extractor; June 18, 2019 6:26 PM

January 03, 2011

Simple Article Extractor from HTML

2 comments: