Archive for category Ruby

Script It: Convert A List Apart articles to PDFs

Several times a week, I find that I need to automate a task to save me considerable amounts of time. The following is an example of this need. I wanted to share the thinking behind the task and the code with you. Improvements and suggestions are welcome. Submit them as comments to this article.

This particular task took about 1 hour to complete, which includes some research, coming up with a plan of attack, and then implementing.

Objective

A List Apart is an ezine published by a small consortium of web design experts who pride themselves in focusing on typography, usability, simplicity, and elegance. I have been a fan of their ezine for many years now, but realized that due to my busy life I have missed a considerable number of articles and entire issues. I recently decided that it would benefit me greatly if I were to catch up on all the articles that I’ve missed. They’ve kindly provided an archive of all of their past issues, but they do not offer a PDF version of them. I needed a simple way to most effectively retrieve all of the issues and then convert them to PDF format in a printer friendly layout (we don’t want a screen format). Creating a script will make it easy to print each article out in bulk, and potentially join them all up into an eBook for further reading on an Amazon Kindle 2 for example. Automating this is a much better use of my time than clicking on a thousand articles and clicking “Print” (more likely, I’d have to also then click “Save as PDF…”, find the new file a home on my hard drive, and then name it).

General Process

A basic high-level process to make this happen might look something like:

  1. Retrieve a list of all issues.
  2. Retrieve a list of all articles for each issue. (If possible, combine both steps into one)
  3. Fetch the contents of each article in print format.
  4. Convert each article to PDF format.

Basic Design

The plan of attack will be to design a class that is dedicated to communicating with the A List Apart website. This class will fetch the list of issues and articles. Next, some simple code will harness the class to retrieve the information that is necessary, ultimately returning some issue numbers, article titles, and URLs to retrieve the articles. Lastly, the URLs will be passed into a class called PDFerator that was built by pete@notahat.com which uses OS X frameworks to simulate printing the article in a hidden web browser.

Code

Pre-Requisites: In order to run this code, you MUST be running OS X 10.5 on an Apple Mac in order to take advantage of the Print to PDF features.

AListApartFetcher Class

The general strategy for this class is to encapsulate all of the code necessary to communicate with the AListApart.com website. Since they do not offer any web services APIs, I used Why’s Hpricot to screen scrape the information I needed. In order to come up with the xpathing used with Hpricot to find the elements I was looking for, I used SelectorGadget to analyze the AListApart.com pages.

#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'

class AListApartFetcher
  attr_accessor :all_issue_urls, :all_article_urls
  attr_accessor :base_url, :main_url, :yaml_filename
  attr_accessor :page_count, :issue_count, :article_count
  attr_accessor :debug

  def initialize
    @debug = true
    @base_url = "http://alistapart.com"
    @main_url = "http://alistapart.com/articles"
    @yaml_filename = "alistapart.yaml"
  end

  # Retrieve a page of Issue URLs
  def retrieve_page(page_url)
    issue_urls = []
    count = 0

    doc = Hpricot(open(page_url))
    (doc/".ishno").each do |a|
      count += 1
      name = a.inner_text
      url = a.attributes["href"]
      puts "  - #{name} => #{url}"
      article_urls = retrieve_issue("http://alistapart.com#{url}")
      issue_urls << { :name => name, :url => url, :article_urls => article_urls }
    end
    issue_urls
  end

  # Retrieve a page of Article URLs for a given Issue
  def retrieve_issue(issue_url)
    article_urls = []
    count = 0

    doc = Hpricot(open(issue_url))
    doc = (doc/"#content")
    (doc/".title/a").each do |a|
      count += 1
      name = a.inner_text
      url = a.attributes["href"]
      puts "    - #{name} => #{url}"
      article_urls << { :name => name, :url => url }
    end
    article_urls
  end

  def retrieve_remote
    puts "Loading data from remote."

    run = true
    @page_count = 0
    all_issue_urls = []

    # Loop until all pages retrieved
    while run
      page_url = @page_count ? "#{@main_url}?page=#{@page_count + 1}" : @main_url
      puts "Retrieving page #{@page_count + 1}: #{page_url}"
      issue_urls = retrieve_page(page_url)
      if issue_urls.size > 0
        fetched_end = false
        issue_urls.each do |i|
          if i[:name] == "Issue 8"
            fetched_end = true
          end
        end

        all_issue_urls.concat issue_urls
        @page_count += 1

        if fetched_end
          run = false
        end
      else
        run = false
      end
    end
    all_issue_urls
  end

  def load_yaml
    puts "Loading data from YAML file."
    @all_issue_urls = YAML.load_file(@yaml_filename)
    @article_urls_count = @all_issue_urls.inject(0) {|sum, n| sum + (n[:article_urls].size) }
    puts "Loaded #{@all_issue_urls.size} issue URLs."
    puts "Loaded #{@article_urls_count} article URLs."
  end

  def save_yaml
    puts "Saving data to YAML file."
    File.open(@yaml_filename, 'w') do |out|
      YAML.dump(@all_issue_urls, out )
    end
  end

  def self.clean_string(string)
    chars_to_replace = { ' ' => '_', '?' => '', "'" => '', '(' => '', ')' => '', '&' => '', ';' => '', '/' => '', '!' => '' }
    string.split(//).map { |c| chars_to_replace.has_key?(c) ? chars_to_replace[c] : c }.join
  end

  def self.create_article_filename(issue_name, article_name, extension)
    issue_name = clean_string(issue_name)
    article_name = clean_string(article_name)
    "#{issue_name}-#{article_name}.#{extension}"
  end
end
Main Loop

This code was written to live in the same ruby file as the AListApartFetcher class (i.e. alistapart.rb) but could be anywhere. Just be sure to require the necessary gems and ruby class above. For sake of brevity, I did not provide a command line toggle for whether or not this script fetches the issue/article list remotely or from the local YAML file that is created during each run. You can change it by setting “remote” to “false”.

# MAIN CODE

remote = true
all_issue_urls = []

fetcher = AListApartFetcher.new
if remote
  all_issue_urls = fetcher.retrieve_remote
  article_urls_count = all_issue_urls.inject(0) {|sum, n| sum + (n[:article_urls].size) }
  puts "Retrieved #{fetcher.page_count} pages."
  puts "Retrieved #{all_issue_urls.size} issue URLs."
  puts "Retrieved #{article_urls_count} article URLs."

  fetcher.save_yaml
else
  fetcher.load_yaml
  all_issue_urls = fetcher.all_issue_urls
end

puts "Article Filenames"
all_issue_urls.each do |issue|
  issue[:article_urls].each do |article|
    filename = AListApartFetcher.create_article_filename(issue[:name], article[:name], 'pdf')
    generator_command = "./pdf.rb http://alistapart.com#{article[:url]} #{filename}"
    puts "Generating #{filename} with cmd: #{generator_command}"
    system(generator_command)
  end
end
PDFerator Class

This class was written by pete@notahat.com. You can get the original version from http://svn.bustikated.net/snap/browser/trunk/lib/pdferator.rb.

#!/usr/bin/env ruby

# pete@notahat.com

require 'osx/cocoa'
OSX.require_framework 'WebKit'

class PDFerator < OSX::NSObject

  def init
    initWithWidth(950)
  end

  def initWithWidth(width)
    # This sets up some context that we need for creating windows.
    OSX::NSApplication.sharedApplication

    # Create an offscreen window into which we can stick our WebView.
    # The height is zero because we'll resize to fit the document later.
    @window = OSX::NSWindow.alloc.initWithContentRect_styleMask_backing_defer(
      [0, 0, width, 0], OSX::NSBorderlessWindowMask, OSX::NSBackingStoreBuffered, false
    )

    # Create a WebView and stick it in our offscreen window.
    @webView = OSX::WebView.alloc.initWithFrame([0, 0, width, 0])
    @window.setContentView(@webView)

    # Use the screen stylesheet, rather than the print one.
    #@webView.setMediaStyle('screen')
    @webView.setMediaStyle('print')
    # Make sure we don't save any of the prefs that we change.
    @webView.preferences.setAutosaves(false)
    # Set some useful options.
    @webView.preferences.setShouldPrintBackgrounds(true)
    @webView.preferences.setJavaScriptCanOpenWindowsAutomatically(false)
    @webView.preferences.setAllowsAnimatedImages(false)
    # Make sure we don't get a scroll bar.
    @webView.mainFrame.frameView.setAllowsScrolling(false)

    self
  end

  def fetch(url)
    # This sets up the webView_*  methods to be called when loading finishes.
    @webView.setFrameLoadDelegate(self)
    # Tell the webView what URL to load.
    @webView.setValue_forKey(url, 'mainFrameURL')
    # Pass control to Cocoa for a bit.
    OSX.CFRunLoopRun
    @succeeded
  end

  attr_reader :error

  def webView_didFinishLoadForFrame(view, frame)
    @succeeded = true

    # Resize the view to fit the page.
    @docView = @webView.mainFrame.frameView.documentView
    @docView.window.setContentSize(@docView.bounds.size)
    @docView.setFrame(@docView.bounds)

    # Return control to the fetch method.
    OSX.CFRunLoopStop(OSX.CFRunLoopGetCurrent)
  end

  def webView_didFailLoadWithError_forFrame(webview, error, frame)
    @error = error
    @succeeed = false
    # Return control to the fetch method.
    OSX.CFRunLoopStop(OSX.CFRunLoopGetCurrent)
  end

  def webView_didFailProvisionalLoadWithError_forFrame(webview, error, frame)
    @error = error
    @succeeed = false
    # Return control to the fetch method.
    OSX.CFRunLoopStop(OSX.CFRunLoopGetCurrent)
  end

  def save(filename, options = {})
    if options[:paginated]
      save_paginated(filename, options)
    else
      @docView.dataWithPDFInsideRect(@docView.bounds).writeToFile_atomically(filename, true)
    end
  end

private

  def save_paginated(filename, options = {})
    # To generate paginated PDF, we create a print job and set it to save
    # the results to a file.
    printInfo = OSX::NSPrintInfo.alloc.initWithDictionary(
      OSX::NSPrintJobDisposition => OSX::NSPrintSaveJob,
      OSX::NSPrintSavePath       => filename
    )
    printInfo.setHorizontalPagination OSX::NSAutoPagination
    printInfo.setVerticalPagination   OSX::NSAutoPagination
    printInfo.setVerticallyCentered   false

    if options.has_key?(:margin)
      printInfo.setTopMargin    options[:margin]
      printInfo.setRightMargin  options[:margin]
      printInfo.setBottomMargin options[:margin]
      printInfo.setLeftMargin   options[:margin]
    end

    # Create a print operation to write out the PDF.
    printOp = OSX::NSPrintOperation.printOperationWithView_printInfo(@docView, printInfo)
    # Make sure we don't display the page setup and print dialogs.
    printOp.setShowPanels(false)
    # Do the printing!
    printOp.runOperation()
  end

end

pdferator = PDFerator.alloc.init
if pdferator.fetch(ARGV[0])
  pdferator.save(ARGV[1], :paginated => true)
else
  print "Error: #{pdferator.error}\n"
end

Output

As the script runs, you will be presented with some basic debugging output to give you an idea of where things are. After a considerable amount of time, you will have a directory full of PDFs.

-rw-r--r--  1 kelliott  2098   129084 Apr  8 15:08 Issue_100-Back_to_Basics.pdf
-rw-r--r--  1 kelliott  2098    67470 Apr  8 15:08 Issue_100-Web_Designer_and_Proud_of_It.pdf
-rw-r--r--  1 kelliott  2098    55132 Apr  8 15:08 Issue_101-How_to_be_Soopa_Famous.pdf
-rw-r--r--  1 kelliott  2098    98592 Apr  8 15:08 Issue_101-SMIL_When_You_Play_That.pdf
-rw-r--r--  1 kelliott  2098   103323 Apr  8 15:08 Issue_102-The_Declination_of_Independence.pdf
-rw-r--r--  1 kelliott  2098    74346 Apr  8 15:07 Issue_102-This_Web_Business_III:_Selecting_Professionals.pdf
-rw-r--r--  1 kelliott  2098    82872 Apr  8 15:07 Issue_103-A_Failure_to_Communicate.pdf
-rw-r--r--  1 kelliott  2098   109797 Apr  8 15:07 Issue_104-Down_By_Law.pdf
-rw-r--r--  1 kelliott  2098    77894 Apr  8 15:07 Issue_104-Flash’s_Got_a_Brand_New_Bag.pdf
-rw-r--r--  1 kelliott  2098   134241 Apr  8 15:07 Issue_105-The_Road_to_Dystopia.pdf
-rw-r--r--  1 kelliott  2098   109874 Apr  8 15:07 Issue_106-Beyond_Usability_and_Design:_The_Narrative_Web.pdf
-rw-r--r--  1 kelliott  2098   115614 Apr  8 15:07 Issue_107-“Forgiving”_Browsers_Considered_Harmful.pdf
-rw-r--r--  1 kelliott  2098    86811 Apr  8 15:07 Issue_109-CSS_Design:_Size_Matters.pdf
-rw-r--r--  1 kelliott  2098    71067 Apr  8 15:07 Issue_112-The_Devil_His_Due:_What_Online_Porn_Portends.pdf
-rw-r--r--  1 kelliott  2098   105460 Apr  8 15:07 Issue_113-Game_Design_in_Flash_5,_Part_II:_Heroes_and_Villains.pdf
-rw-r--r--  1 kelliott  2098    72844 Apr  8 15:07 Issue_114-Cheaper_Over_Better:_Why_Web_Clients_Settle_for_Less.pdf
-rw-r--r--  1 kelliott  2098    70946 Apr  8 15:07 Issue_114-The_Client_Did_It:_A_WWW_Whodunit.pdf
-rw-r--r--  1 kelliott  2098   142411 Apr  8 15:07 Issue_115-All_the_Access_Money_Can_Buy.pdf
-rw-r--r--  1 kelliott  2098   196737 Apr  8 15:07 Issue_115-Much_Ado_About_Smart_Tags.pdf
-rw-r--r--  1 kelliott  2098    75678 Apr  8 15:07 Issue_116-CSS_Talking_Points:_Selling_Clients_on_Web_Standards.pdf
-rw-r--r--  1 kelliott  2098    65023 Apr  8 15:07 Issue_116-Nipping_Client_Silliness_in_the_Bud.pdf
-rw-r--r--  1 kelliott  2098    99482 Apr  8 15:07 Issue_117-Kick_ASP_Design:_ASP_for_Non-Programmers.pdf
-rw-r--r--  1 kelliott  2098    66008 Apr  8 15:07 Issue_118-Process,_Methodology,_Life_Cycle,_Oh_My.pdf
-rw-r--r--  1 kelliott  2098    90553 Apr  8 15:07 Issue_119-Global_Treaty_Could_Transform_the_Web.pdf
-rw-r--r--  1 kelliott  2098   133381 Apr  8 15:07 Issue_119-Practical_CSS_Layout_Tips,_Tricks,_and_Techniques.pdf
-rw-r--r--  1 kelliott  2098    91600 Apr  8 15:07 Issue_120-Build_a_“Send_to_Friend”_Page.pdf
-rw-r--r--  1 kelliott  2098    84174 Apr  8 15:07 Issue_120-Evolving_Client_Content.pdf
...etc...

So that you can see an example of the results, I’ve uploaded two of these PDFs. Please note that they are copyright of A List Apart Magazine and its authors.

The outcome is beautiful PDFs that do not have the artifacts created by most web based layouts because we are taking advantage of the print stylesheet that A List Apart was kind enough to create.

Download

Please feel free to modify and distribute this code as you’d like. If you do use it, I only ask that you ping this posting with a comment and make a reference in your code to this article, so that others can benefit.

Download as a zip file / tgz file.

, , , , ,

Comments