Using the Python HTMLParser library
When writing a script to download files off a site, I figured there was an easy python library to do that. Well, sort of. I chose to use the HTMLParser library. The documentation is not the best, so I thought I would add a bit of what I found. If I had to do it again, I might just use regular expressions to do it all.
First, if you can find all your information in the tag, that makes life a lot easier. If not, you have to create a way to know where you are in a document. To do this, I suggest creating a list in the class that you append all tags to so that you know what the last tag was. Do this at the start of the handle_starttag()
function. Take the tag off the stack in the handle_endtag()
function. This way when you have a call to handle_data()
you know where you are.
Below is an example template to use for your use.
class MyParse(HTMLParser.HTMLParser):
def __init__(self):
#super() does not work for this class
HTMLParser.HTMLParser.__init__(self)
self.tag_stack = []
self.attr_stack = []
def handle_endtag(self, tag):
#take the tag off the stack if it matches the next close tag
#if you are expecting unmatched tags, then this needs to be more robust
if self.tag_stack[len(self.tag_stack)-1][0] == tag:
self.tag_stack.pop()
def handle_data(self, data):
#'data' is the text between tags, not necessarily
#matching tags
#this gives you a link to the last tag
tstack = self.tag_stack[len(self.tag_stack)-1]
#do something with the text
def handle_starttag(self, tag, attrs):
#add tag to the stack
self.tag_stack.append([tag, attrs])
#if this tag is a link
if tag =="a":
#these next few lines find if there is a hyperlink in the tag
tloc = map(lambda x: 1 if x[0]=='href' else 0,attrs)
try:
#did we find any hyperlinks
attr_loc = tloc.index(1)
except:
pass
# attr_loc only exists if we found a hyperlink
if vars().has_key('attr_loc'):
#append to the last item in the stack the location of the hyperlink
#note, this does not increase the length of the stack
#as we are putting it inside the last item on the stack
self.tag_stack[len(self.tag_stack)-1].append(attr_loc)
#now we can do what we need with the hyperlink
How I would use this to go through a webpage (assuming MyParse is in the same file):
import httplib
site = "curioussystem.com"
file_loc = r"/index.php"
conn = httplib.HTTPConnection(site)
conn.request("GET", file_loc)
r1 = conn.getresponse()
#copy response to variable because reading clears it
data = r1.read()
t = MyParse()
t.feed(data) #where the action happens
One other note, when I actually had to download something, I used the subprocess
module to call wget
to do the actual downloading. It was too much work in python for what I wanted.