Concatenating IEEE E-Book PDFs
For those of you who are IEEE members, they now offer some of their “classic” ebooks as a free download off IEEE Xplore. The only problem is that they come in multi-part PDF files and are not named in a rational fashion. So… if you have linux, pdftk installed, and a mass download extension on your web browser, (such as the great Download-them-All for firefox) we can fix that.
Here’s what we do:
- Log into IEEE Xplore and get to the E-book of your choice (the page needs to have links to the component PDFs).
- Create a new directory to save the component pdfs into.
- Start downloading the component PDFs, use the default names
- Do a “Save As” of the web page into the directory
- Once all files are downloaded, make sure they are valid
- What this means is that Xplore likes to log you out or decide you’ve downloaded enough quite often. When this occurs, you will save the login page or an invalid pdf that is very small.
- To remedy this, delete the bad files, log out & in (even if it shows you are still logged in) and start the download again.
- Run the attached python script with the location of the saved web page as the command line argument. E.G.:
- ./IEEEcat.py /path/to/saved/htmlfile.html
- You now have a compiled PDF in the directory where the individual PDFs and webpage are.
Here’s the file:
IEEEcat
#Copyright 2010 Chad Kidder
#Licensed under the GPL 3.0 license
#This program is used for putting together
#E-books from IEEE Explore
#It takes as input the saved HTML page of the book
#It scrapes that for the title and the component PDF
#File names, and runs them through pdftk to create one
#Big PDF
import sys, re, subprocess, os
if len(sys.argv) < 1:
print "Need the html file to work off of"
sys.exit(1)
titlere = re.compile(r'<h1>.*?(\w[^\t\r\n]+).*?<img', re.M+re.S)
urlre = re.compile(r'href=["\'].*?/(\w+?\.pdf).*?["\']>PDF</a>')
furlre = re.compile(r'href=["\'](.*?\.pdf.*?)["\']>')
pathre = re.compile(r'(.*/)([^/]+)')
fnmatch = re.compile(r'(.+?)\.[htm|html|HTM|HTML]')
pcmd = ['/usr/bin/pdftk']
pecmd = ['cat', 'output']
for fn in sys.argv[1:]:
cmd = pcmd
ecmd = pecmd
ofile = open(fn)
otext = ofile.read()
otmatch = titlere.search(otext)
omatch = urlre.findall(otext)
urlmatch = furlre.findall(otext)
pmatch = pathre.match(fn)
if type(None) == type(pmatch):
tfn = fn
bpath = ''
else:
tfn = pmatch.groups()[1]
bpath = pmatch.groups()[0]
if type(None) == type(otmatch):
bname = fnmatch.match(tfn)
if type(None) == type(bname):
ecmd += [bpath+tfn+ '.pdf']
else:
ecmd += [bpath+bname.groups()[0]+ '.pdf']
else:
ecmd += [bpath+re.sub(r'[\\/*?:"<>\|]','_',otmatch.groups()[0])+ '.pdf']
ifiles = []
for tfn in omatch:
ifiles += [bpath + tfn]
subprocess.call(cmd +ifiles +ecmd)