Thursday, January 15, 2015

Parsing a giant list of quotes

In early 2011, around the time I moved to Sydney, I began to collect interesting quotes I encountered in books, web articles and elsewhere. By ‘interesting’ I really do mean any sense of the term, like interesting factual tidbits, artful prose, or cute proverb-esque one-liners.

This has kept up since then — Instapaper helps — and as of writing I have a little over 1500 entries sitting in one horrendously large (read: three-hundred page) Google Doc.

The bottom [oldest] part of my giant quotes list.

I realised late last year that this Google-Docs-based system, while convenient to quickly add new quotes to, wasn’t particularly useful for retrieval, let alone browsing. It takes nearly 1G of RAM to have the entire thing open in my browser, and editing operations are slow; it’s just not the use case Docs is designed for. Furthermore, for a while now I’ve been wanting to tag quotes (making it easier to hunt for, say, poems, or quotes about economics), and the additional clutter that would add to my current system makes it untenable.

There’s the question of how to store this information instead (a different web app? a SQL database?), but I figured I’d start by making sure I could easily extract all of this data without having to retype all three hundred pages of it by hand. Fortunately, Docs offers an option to download the whole thing, which immediately offers the option of doing some programmatic parsing.

Of course, this is still in a pretty messy state:

The exported HTML from Google Docs.

The first problem: the exported Doc uses stylesheets to italicise and embolden text. I want to preserve these (italics are usually actual author emphasis in the original; bold, my own highlighting), but I can’t hard-code which CSS classes correspond to which, since the way Docs names the CSS classes is unpredictable and varies depending on document contents.

My script uses BeautifulSoup to parse the HTML and cssutils to pull out the CSS definitions:

from bs4 import BeautifulSoup
import cssutils
soup = BeautifulSoup(sys.stdin)
emTags = []
bTags = []
for s in soup.find_all('style'):
    for rule in cssutils.parseString(s.get_text()):
        if rule.style.getProperty('font-style') and \
            rule.style.getProperty('font-style').value == 'italic':
            emTags.append(re.sub("^\.", "", rule.selectorText))
        if rule.style.getProperty('font-weight') and \
            rule.style.getProperty('font-weight').value == 'bold':
            bTags.append(re.sub("^\.", "", rule.selectorText))
    s.decompose()

…at which point it can then manually apply those rules to produce <b> and <em> tags:

for span in soup.find_all('span'):
    if 'class' in span.attrs and span.string:
        if filter(lambda x: x in span['class'], emTags):
            span.string.wrap(soup.new_tag('em'))
        if filter(lambda x: x in span['class'], bTags):
            span.string.wrap(soup.new_tag('b'))
    span.unwrap()

Some quotes contain links; many, I’ve added source links to at the bottom. Google Docs exports the links as Google.com redirects (presumably for nefarious tracking), so that a URL that was originally “http://edge.org/responses/whats-your-law” now becomes the considerably uglier “http://www.google.com/url?q=http%3A%2F%2Fedge.org%2Fresponses%2Fwhats-your-law&sa=D&sntz=1&usg=AFQjCNGTr8xkrcybZRSsCPqNu17dyQgexA”.

from urlparse import urlparse, parse_qs
for a in soup.find_all('a'):
    del a['class']
    if a['href'].startswith('http://www.google.com/url?') \
        or a['href'].startswith('https://www.google.com/url?'):
        params = parse_qs(urlparse(a['href']).query)
        if 'q' not in params:
            warning('q not in params: %s' % params)
            continue
        if len(params['q']) != 1:
            warning('len(params[''q'']) != 1: %s' % params['q'])
            continue
        url = params['q'][0]
        a['href'] = url

After these modifications, the HTML looks a lot more sensible. It’s a simple enough matter to split the quotes up, and to verify that each quote is a collection of paragraphs followed by an attribution: a single line delimited by square brackets.

Making sense of the attributions is a little trickier.

A couple of actual examples:

They’re in a semi-consistent format: usually an author name, usually with a link and/or title. But there’s all sorts of other bits and pieces of data alongside there. Notice the phrasing that implies Albus Dumbledore is a character, not a writer. Notice the ‘episode name, series name’ format for the Idea Channel link.

There’s lots of contextual data here which I wouldn’t want an automated extractor to lose track of. This isn’t going to be as easy as splitting the string by occurrences of , ”. (Hell, that split isn’t even correct; just look at the Reynolds attribution, with a comma in the paper title.)

Fortunately, there’s still enough structure for my script to tackle programmatically. The structure is too complicated to just regexp through, but it is in fact exactly a CFG. It then just becomes a matter of defining appropriate tokens for a lexer:

import ply.lex as lex
def t_PAGENUM(t):
    r'\ ?p\d+(\-\d+)?'
    return t
def t_URL(t):
    r'http[s]?://(?:[a-zA-Z]|[0-9]|[$\-/_@.&+#~\?=;]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return t
[...]

...and then constructing the ugliest context-free grammar I’ve ever had the dishonour of writing:

fragmentlist : fragmentlist COMMA fragment
fragmentlist : fragment
fragment : TEXT_IN PLAINTEXT APOSTROPHE_S citation
fragment : TEXT_IN citation
fragment : TEXT_QUOTED_IN citation
fragment : TEXT_COMMENTING_ON citation
fragment : citation
fragment : PLAINTEXT plaintext_suffix
fragment : OPEN_SINGLE_QUOTE PLAINTEXT CLOSE_SINGLE_QUOTE
fragment : PAGENUM
fragment : EPISODE
fragment : CHAPTER
fragment : DATE
plaintext_suffix :
plaintext_suffix : PLAINTEXT plaintext_suffix
plaintext_suffix : APOSTROPHE_S plaintext_suffix
plaintext_suffix : OPEN_SINGLE_QUOTE plaintext_maybe_comma CLOSE_SINGLE_QUOTE plaintext_suffix
                 | OPEN_QUOTE plaintext_maybe_comma CLOSE_QUOTE plaintext_suffix
plaintext_maybe_comma : plaintext_suffix
plaintext_maybe_comma : plaintext_suffix COMMA plaintext_maybe_comma
citation : link
citation : URL
citation : nonlinkcitation
nonlinkcitation  : OPEN_QUOTE plaintext_maybe_comma CLOSE_QUOTE
                 | GENERIC_QUOTE plaintext_maybe_comma GENERIC_QUOTE
nonlinkcitation : TAG_BEGIN EM TAG_END plaintext_maybe_comma TAG_BEGIN SLASH EM TAG_END
link : TAG_BEGIN A HREF EQUALS QUOTE URL QUOTE TAG_END URL TAG_BEGIN SLASH A TAG_END
link : TAG_BEGIN A HREF EQUALS QUOTE URL QUOTE TAG_END PLAINTEXT TAG_BEGIN SLASH A TAG_END
link : TAG_BEGIN A HREF EQUALS QUOTE URL QUOTE TAG_END nonlinkcitation TAG_BEGIN SLASH A TAG_END
l

Running this through a YACC-like parser and doing a little postprocessing is enough to extract all the information in a structured way:

Parsed attribution strings as Python dicts