Several times a week, I find that I need to automate a task to save me considerable amounts of time. The following is an example of this need. I wanted to share the thinking behind the task and the code with you. Improvements and suggestions are welcome. Submit them as comments to this article.
This particular task took about 1 hour to complete, which includes some research, coming up with a plan of attack, and then implementing.
Objective
A List Apart is an ezine published by a small consortium of web design experts who pride themselves in focusing on typography, usability, simplicity, and elegance. I have been a fan of their ezine for many years now, but realized that due to my busy life I have missed a considerable number of articles and entire issues. I recently decided that it would benefit me greatly if I were to catch up on all the articles that I’ve missed. They’ve kindly provided an archive of all of their past issues, but they do not offer a PDF version of them. I needed a simple way to most effectively retrieve all of the issues and then convert them to PDF format in a printer friendly layout (we don’t want a screen format). Creating a script will make it easy to print each article out in bulk, and potentially join them all up into an eBook for further reading on an Amazon Kindle 2 for example. Automating this is a much better use of my time than clicking on a thousand articles and clicking “Print” (more likely, I’d have to also then click “Save as PDF…”, find the new file a home on my hard drive, and then name it).
General Process
A basic high-level process to make this happen might look something like:
- Retrieve a list of all issues.
- Retrieve a list of all articles for each issue. (If possible, combine both steps into one)
- Fetch the contents of each article in print format.
- Convert each article to PDF format.
Basic Design
The plan of attack will be to design a class that is dedicated to communicating with the A List Apart website. This class will fetch the list of issues and articles. Next, some simple code will harness the class to retrieve the information that is necessary, ultimately returning some issue numbers, article titles, and URLs to retrieve the articles. Lastly, the URLs will be passed into a class called PDFerator that was built by pete@notahat.com which uses OS X frameworks to simulate printing the article in a hidden web browser.
Code
Pre-Requisites: In order to run this code, you MUST be running OS X 10.5 on an Apple Mac in order to take advantage of the Print to PDF features.
AListApartFetcher Class
The general strategy for this class is to encapsulate all of the code necessary to communicate with the AListApart.com website. Since they do not offer any web services APIs, I used Why’s Hpricot to screen scrape the information I needed. In order to come up with the xpathing used with Hpricot to find the elements I was looking for, I used SelectorGadget to analyze the AListApart.com pages.
#!/usr/bin/ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
class AListApartFetcher
attr_accessor :all_issue_urls, :all_article_urls
attr_accessor :base_url, :main_url, :yaml_filename
attr_accessor :page_count, :issue_count, :article_count
attr_accessor :debug
def initialize
@debug = true
@base_url = "http://alistapart.com"
@main_url = "http://alistapart.com/articles"
@yaml_filename = "alistapart.yaml"
end
# Retrieve a page of Issue URLs
def retrieve_page(page_url)
issue_urls = []
count = 0
doc = Hpricot(open(page_url))
(doc/".ishno").each do |a|
count += 1
name = a.inner_text
url = a.attributes["href"]
puts " - #{name} => #{url}"
article_urls = retrieve_issue("http://alistapart.com#{url}")
issue_urls << { :name => name, :url => url, :article_urls => article_urls }
end
issue_urls
end
# Retrieve a page of Article URLs for a given Issue
def retrieve_issue(issue_url)
article_urls = []
count = 0
doc = Hpricot(open(issue_url))
doc = (doc/"#content")
(doc/".title/a").each do |a|
count += 1
name = a.inner_text
url = a.attributes["href"]
puts " - #{name} => #{url}"
article_urls << { :name => name, :url => url }
end
article_urls
end
def retrieve_remote
puts "Loading data from remote."
run = true
@page_count = 0
all_issue_urls = []
# Loop until all pages retrieved
while run
page_url = @page_count ? "#{@main_url}?page=#{@page_count + 1}" : @main_url
puts "Retrieving page #{@page_count + 1}: #{page_url}"
issue_urls = retrieve_page(page_url)
if issue_urls.size > 0
fetched_end = false
issue_urls.each do |i|
if i[:name] == "Issue 8"
fetched_end = true
end
end
all_issue_urls.concat issue_urls
@page_count += 1
if fetched_end
run = false
end
else
run = false
end
end
all_issue_urls
end
def load_yaml
puts "Loading data from YAML file."
@all_issue_urls = YAML.load_file(@yaml_filename)
@article_urls_count = @all_issue_urls.inject(0) {|sum, n| sum + (n[:article_urls].size) }
puts "Loaded #{@all_issue_urls.size} issue URLs."
puts "Loaded #{@article_urls_count} article URLs."
end
def save_yaml
puts "Saving data to YAML file."
File.open(@yaml_filename, 'w') do |out|
YAML.dump(@all_issue_urls, out )
end
end
def self.clean_string(string)
chars_to_replace = { ' ' => '_', '?' => '', "'" => '', '(' => '', ')' => '', '&' => '', ';' => '', '/' => '', '!' => '' }
string.split(//).map { |c| chars_to_replace.has_key?(c) ? chars_to_replace[c] : c }.join
end
def self.create_article_filename(issue_name, article_name, extension)
issue_name = clean_string(issue_name)
article_name = clean_string(article_name)
"#{issue_name}-#{article_name}.#{extension}"
end
end
Main Loop
This code was written to live in the same ruby file as the AListApartFetcher class (i.e. alistapart.rb) but could be anywhere. Just be sure to require the necessary gems and ruby class above. For sake of brevity, I did not provide a command line toggle for whether or not this script fetches the issue/article list remotely or from the local YAML file that is created during each run. You can change it by setting “remote” to “false”.
# MAIN CODE
remote = true
all_issue_urls = []
fetcher = AListApartFetcher.new
if remote
all_issue_urls = fetcher.retrieve_remote
article_urls_count = all_issue_urls.inject(0) {|sum, n| sum + (n[:article_urls].size) }
puts "Retrieved #{fetcher.page_count} pages."
puts "Retrieved #{all_issue_urls.size} issue URLs."
puts "Retrieved #{article_urls_count} article URLs."
fetcher.save_yaml
else
fetcher.load_yaml
all_issue_urls = fetcher.all_issue_urls
end
puts "Article Filenames"
all_issue_urls.each do |issue|
issue[:article_urls].each do |article|
filename = AListApartFetcher.create_article_filename(issue[:name], article[:name], 'pdf')
generator_command = "./pdf.rb http://alistapart.com#{article[:url]} #{filename}"
puts "Generating #{filename} with cmd: #{generator_command}"
system(generator_command)
end
end
PDFerator Class
This class was written by pete@notahat.com. You can get the original version from http://svn.bustikated.net/snap/browser/trunk/lib/pdferator.rb.
#!/usr/bin/env ruby
# pete@notahat.com
require 'osx/cocoa'
OSX.require_framework 'WebKit'
class PDFerator < OSX::NSObject
def init
initWithWidth(950)
end
def initWithWidth(width)
# This sets up some context that we need for creating windows.
OSX::NSApplication.sharedApplication
# Create an offscreen window into which we can stick our WebView.
# The height is zero because we'll resize to fit the document later.
@window = OSX::NSWindow.alloc.initWithContentRect_styleMask_backing_defer(
[0, 0, width, 0], OSX::NSBorderlessWindowMask, OSX::NSBackingStoreBuffered, false
)
# Create a WebView and stick it in our offscreen window.
@webView = OSX::WebView.alloc.initWithFrame([0, 0, width, 0])
@window.setContentView(@webView)
# Use the screen stylesheet, rather than the print one.
#@webView.setMediaStyle('screen')
@webView.setMediaStyle('print')
# Make sure we don't save any of the prefs that we change.
@webView.preferences.setAutosaves(false)
# Set some useful options.
@webView.preferences.setShouldPrintBackgrounds(true)
@webView.preferences.setJavaScriptCanOpenWindowsAutomatically(false)
@webView.preferences.setAllowsAnimatedImages(false)
# Make sure we don't get a scroll bar.
@webView.mainFrame.frameView.setAllowsScrolling(false)
self
end
def fetch(url)
# This sets up the webView_* methods to be called when loading finishes.
@webView.setFrameLoadDelegate(self)
# Tell the webView what URL to load.
@webView.setValue_forKey(url, 'mainFrameURL')
# Pass control to Cocoa for a bit.
OSX.CFRunLoopRun
@succeeded
end
attr_reader :error
def webView_didFinishLoadForFrame(view, frame)
@succeeded = true
# Resize the view to fit the page.
@docView = @webView.mainFrame.frameView.documentView
@docView.window.setContentSize(@docView.bounds.size)
@docView.setFrame(@docView.bounds)
# Return control to the fetch method.
OSX.CFRunLoopStop(OSX.CFRunLoopGetCurrent)
end
def webView_didFailLoadWithError_forFrame(webview, error, frame)
@error = error
@succeeed = false
# Return control to the fetch method.
OSX.CFRunLoopStop(OSX.CFRunLoopGetCurrent)
end
def webView_didFailProvisionalLoadWithError_forFrame(webview, error, frame)
@error = error
@succeeed = false
# Return control to the fetch method.
OSX.CFRunLoopStop(OSX.CFRunLoopGetCurrent)
end
def save(filename, options = {})
if options[:paginated]
save_paginated(filename, options)
else
@docView.dataWithPDFInsideRect(@docView.bounds).writeToFile_atomically(filename, true)
end
end
private
def save_paginated(filename, options = {})
# To generate paginated PDF, we create a print job and set it to save
# the results to a file.
printInfo = OSX::NSPrintInfo.alloc.initWithDictionary(
OSX::NSPrintJobDisposition => OSX::NSPrintSaveJob,
OSX::NSPrintSavePath => filename
)
printInfo.setHorizontalPagination OSX::NSAutoPagination
printInfo.setVerticalPagination OSX::NSAutoPagination
printInfo.setVerticallyCentered false
if options.has_key?(:margin)
printInfo.setTopMargin options[:margin]
printInfo.setRightMargin options[:margin]
printInfo.setBottomMargin options[:margin]
printInfo.setLeftMargin options[:margin]
end
# Create a print operation to write out the PDF.
printOp = OSX::NSPrintOperation.printOperationWithView_printInfo(@docView, printInfo)
# Make sure we don't display the page setup and print dialogs.
printOp.setShowPanels(false)
# Do the printing!
printOp.runOperation()
end
end
pdferator = PDFerator.alloc.init
if pdferator.fetch(ARGV[0])
pdferator.save(ARGV[1], :paginated => true)
else
print "Error: #{pdferator.error}\n"
end
Output
As the script runs, you will be presented with some basic debugging output to give you an idea of where things are. After a considerable amount of time, you will have a directory full of PDFs.
-rw-r--r-- 1 kelliott 2098 129084 Apr 8 15:08 Issue_100-Back_to_Basics.pdf -rw-r--r-- 1 kelliott 2098 67470 Apr 8 15:08 Issue_100-Web_Designer_and_Proud_of_It.pdf -rw-r--r-- 1 kelliott 2098 55132 Apr 8 15:08 Issue_101-How_to_be_Soopa_Famous.pdf -rw-r--r-- 1 kelliott 2098 98592 Apr 8 15:08 Issue_101-SMIL_When_You_Play_That.pdf -rw-r--r-- 1 kelliott 2098 103323 Apr 8 15:08 Issue_102-The_Declination_of_Independence.pdf -rw-r--r-- 1 kelliott 2098 74346 Apr 8 15:07 Issue_102-This_Web_Business_III:_Selecting_Professionals.pdf -rw-r--r-- 1 kelliott 2098 82872 Apr 8 15:07 Issue_103-A_Failure_to_Communicate.pdf -rw-r--r-- 1 kelliott 2098 109797 Apr 8 15:07 Issue_104-Down_By_Law.pdf -rw-r--r-- 1 kelliott 2098 77894 Apr 8 15:07 Issue_104-Flash’s_Got_a_Brand_New_Bag.pdf -rw-r--r-- 1 kelliott 2098 134241 Apr 8 15:07 Issue_105-The_Road_to_Dystopia.pdf -rw-r--r-- 1 kelliott 2098 109874 Apr 8 15:07 Issue_106-Beyond_Usability_and_Design:_The_Narrative_Web.pdf -rw-r--r-- 1 kelliott 2098 115614 Apr 8 15:07 Issue_107-“Forgiving”_Browsers_Considered_Harmful.pdf -rw-r--r-- 1 kelliott 2098 86811 Apr 8 15:07 Issue_109-CSS_Design:_Size_Matters.pdf -rw-r--r-- 1 kelliott 2098 71067 Apr 8 15:07 Issue_112-The_Devil_His_Due:_What_Online_Porn_Portends.pdf -rw-r--r-- 1 kelliott 2098 105460 Apr 8 15:07 Issue_113-Game_Design_in_Flash_5,_Part_II:_Heroes_and_Villains.pdf -rw-r--r-- 1 kelliott 2098 72844 Apr 8 15:07 Issue_114-Cheaper_Over_Better:_Why_Web_Clients_Settle_for_Less.pdf -rw-r--r-- 1 kelliott 2098 70946 Apr 8 15:07 Issue_114-The_Client_Did_It:_A_WWW_Whodunit.pdf -rw-r--r-- 1 kelliott 2098 142411 Apr 8 15:07 Issue_115-All_the_Access_Money_Can_Buy.pdf -rw-r--r-- 1 kelliott 2098 196737 Apr 8 15:07 Issue_115-Much_Ado_About_Smart_Tags.pdf -rw-r--r-- 1 kelliott 2098 75678 Apr 8 15:07 Issue_116-CSS_Talking_Points:_Selling_Clients_on_Web_Standards.pdf -rw-r--r-- 1 kelliott 2098 65023 Apr 8 15:07 Issue_116-Nipping_Client_Silliness_in_the_Bud.pdf -rw-r--r-- 1 kelliott 2098 99482 Apr 8 15:07 Issue_117-Kick_ASP_Design:_ASP_for_Non-Programmers.pdf -rw-r--r-- 1 kelliott 2098 66008 Apr 8 15:07 Issue_118-Process,_Methodology,_Life_Cycle,_Oh_My.pdf -rw-r--r-- 1 kelliott 2098 90553 Apr 8 15:07 Issue_119-Global_Treaty_Could_Transform_the_Web.pdf -rw-r--r-- 1 kelliott 2098 133381 Apr 8 15:07 Issue_119-Practical_CSS_Layout_Tips,_Tricks,_and_Techniques.pdf -rw-r--r-- 1 kelliott 2098 91600 Apr 8 15:07 Issue_120-Build_a_“Send_to_Friend”_Page.pdf -rw-r--r-- 1 kelliott 2098 84174 Apr 8 15:07 Issue_120-Evolving_Client_Content.pdf ...etc...
So that you can see an example of the results, I’ve uploaded two of these PDFs. Please note that they are copyright of A List Apart Magazine and its authors.
The outcome is beautiful PDFs that do not have the artifacts created by most web based layouts because we are taking advantage of the print stylesheet that A List Apart was kind enough to create.
Download
Please feel free to modify and distribute this code as you’d like. If you do use it, I only ask that you ping this posting with a comment and make a reference in your code to this article, so that others can benefit.




