Using the Python HTMLParser library

 In computers, python

When writing a script to download files off a site, I figured there was an easy python library to do that. Well, sort of. I chose to use the HTMLParser library.  The documentation is not the best, so I thought I would add a bit of what I found.  If I had to do it again, I might just use regular expressions to do it all.

First, if you can find all your information in the tag, that makes life a lot easier. If not, you have to create a way to know where you are in a document. To do this, I suggest creating a list in the class that you append all tags to so that you know what the last tag was. Do this at the start of the handle_starttag() function. Take the tag off the stack in the handle_endtag() function. This way when you have a call to handle_data() you know where you are.

Below is an example template to use for your use.

 

import HTMLParser
class MyParse(HTMLParser.HTMLParser):
    def __init__(self):
        #super() does not work for this class
        HTMLParser.HTMLParser.__init__(self)
        self.tag_stack = []
        self.attr_stack = []

    def handle_endtag(self, tag):
        #take the tag off the stack if it matches the next close tag
        #if you are expecting unmatched tags, then this needs to be more robust
        if self.tag_stack[len(self.tag_stack)-1][0] == tag:
            self.tag_stack.pop()

    def handle_data(self, data):
        #'data' is the text between tags, not necessarily
        #matching tags
        #this gives you a link to the last tag
        tstack = self.tag_stack[len(self.tag_stack)-1]
        #do something with the text
           
    def handle_starttag(self, tag, attrs):
        #add tag to the stack
        self.tag_stack.append([tag, attrs])
        #if this tag is a link
        if tag =="a":
            #these next few lines find if there is a hyperlink in the tag
            tloc = map(lambda x: 1 if x[0]=='href' else 0,attrs)
            try:
                #did we find any hyperlinks
                attr_loc = tloc.index(1)
            except:
                pass
            # attr_loc only exists if we found a hyperlink
            if vars().has_key('attr_loc'):
                #append to the last item in the stack the location of the hyperlink
                #note, this does not increase the length of the stack
                #as we are putting it inside the last item on the stack
                self.tag_stack[len(self.tag_stack)-1].append(attr_loc)
               
                #now we can do what we need with the hyperlink

 
How I would use this to go through a webpage (assuming MyParse is in the same file):
 

if __name__=="__main__":
    import httplib
    site = "curioussystem.com"
    file_loc = r"/index.php"
    conn = httplib.HTTPConnection(site)
    conn.request("GET", file_loc)
    r1 = conn.getresponse()
    #copy response to variable because reading clears it
    data = r1.read()
    t = MyParse()
    t.feed(data)     #where the action happens

 
One other note, when I actually had to download something, I used the subprocess module to call wget to do the actual downloading. It was too much work in python for what I wanted.

Leave a Comment

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text. captcha txt

Start typing and press Enter to search