Using the Python HTMLParser library
When writing a script to download files off a site, I figured there was an easy python library to do that. Well, sort of. I chose to use the HTMLParser library. The documentation is not the best, so I thought I would add a bit of what I found. If I had to do it again, I might just use regular expressions to do it all.
First, if you can find all your information in the tag, that makes life a lot easier. If not, you have to create a way to know where you are in a document. To do this, I suggest creating a list in the class that you append all tags to so that you know what the last tag was. Do this at the start of the
handle_starttag() function. Take the tag off the stack in the
handle_endtag() function. This way when you have a call to
handle_data() you know where you are.
Below is an example template to use for your use.
#super() does not work for this class
self.tag_stack = 
self.attr_stack = 
def handle_endtag(self, tag):
#take the tag off the stack if it matches the next close tag
#if you are expecting unmatched tags, then this needs to be more robust
if self.tag_stack[len(self.tag_stack)-1] == tag:
def handle_data(self, data):
#'data' is the text between tags, not necessarily
#this gives you a link to the last tag
tstack = self.tag_stack[len(self.tag_stack)-1]
#do something with the text
def handle_starttag(self, tag, attrs):
#add tag to the stack
#if this tag is a link
if tag =="a":
#these next few lines find if there is a hyperlink in the tag
tloc = map(lambda x: 1 if x=='href' else 0,attrs)
#did we find any hyperlinks
attr_loc = tloc.index(1)
# attr_loc only exists if we found a hyperlink
#append to the last item in the stack the location of the hyperlink
#note, this does not increase the length of the stack
#as we are putting it inside the last item on the stack
#now we can do what we need with the hyperlink
How I would use this to go through a webpage (assuming MyParse is in the same file):
site = "curioussystem.com"
file_loc = r"/index.php"
conn = httplib.HTTPConnection(site)
r1 = conn.getresponse()
#copy response to variable because reading clears it
data = r1.read()
t = MyParse()
t.feed(data) #where the action happens
One other note, when I actually had to download something, I used the
subprocess module to call
wget to do the actual downloading. It was too much work in python for what I wanted.