Section 23.4. The BeautifulSoup Extension

23.4. The BeautifulSoup Extension

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) lets you parse HTML that may be badly formed and uses simple heuristics to compensate for likely HTML brokenness (it succeeds in this difficult task with surprisingly good frequency). Module BeautifulSoup supplies a class, also named BeautifulSoup, which you instantiate with either a file-like object (which is read to give the HTML text to parse) or a string (which is the text to parse). The module also supplies other classes (BeautifulStoneSoup and ICantBelieveItsBeautifulSoup) that are quite similar, but suitable for slightly different XML parsing tasks. An instance b of class BeautifulSoup supplies many attributes and methods to ease the task of searching for information in the parsed HTML input, returning instances of classes Tag and NavigableText, which in turn let you keep navigating or dig for more information.

23.4.1. Parsing HTML with BeautifulSoup

The following example uses BeautifulSoup to perform the same task as previous examples: fetch a page from the Web with urllib, parse it, and output the hyperlinks:

import urllib, urlparse, BeautifulSoup

f = urllib.urlopen('http://www.python.org/index.html')
b = BeautifulSoup.BeautifulSoup(f)

seen = set( )
for anchor in b.fetch('a'):
    url = anchor.get('href')
    if url is None or url in seen: continue
    seen.add(url)
    pieces = urlparse.urlparse(url)
    if pieces[0]=='http':
        print urlparse.urlunparse(pieces)

The example calls the fetch method of class BeautifulSoup.BeautifulSoup to obtain all instances of a certain tag (here, tag '<a>'), then the get method of instances of class Tag to obtain the value of an attribute (here, 'href'), or None when that attribute is missing. The logic to analyze and emit the target URLs of outgoing hyperlinks is just the same as in previous examples.