Mark Pilgrim implements something I've been thinking about for awhile: His "Further Reading Upgrades" now harvest what appears to be the paragraph surrounding a link on a referring page. Along with the RSS feed of "Further Reading" items he's made available, he's got a nice game of follow-the-leader set up for the rest of us referrer log watchers. Nice & elegant & makes me ashamed. :)

Update: Mark pulls back the curtain and reveals the secret. Yay!

Yet another demonstration to me that ideas are f'ing worthless and the making is what matters. Lots of things I've been thinking about doing, but never get done. (See also: Blog:000305) Just a few weeks ago, I managed to upgrade my referrer tracking to dig out the titles of referring pages. And I've had an RSS feed of referrers myself for a few months now, but mine's been ugly as sin and so I've kept it to myself. So now I'll need to think in earnest about how to do some extraction of the link-surrounding excerpt in referring pages. Mark's referrer handling really is elegant - it even seems to know how to collapse multiple views on the same referring entry (ie. front page, archive page, individual entry page).

More things to play with :)


Archived Comments

  • Actually, it's even better than that. I have a list of HTML tags that act as dividers: P, BR, HR, PRE, DIV, SPAN, TD, and a few others. I grab the text surrounding the link, demarcated by any combination of these dividers. There's some other wicked cool stuff in there too, like trying to guess permalinks (and getting it right about 80% of the time, and avoiding false positives the other 20%). First it looks for trackback data; if found, that's golden. Otherwise it looks for the last anchor (name or id attribute in any tag, not just A) before the start of the excerpt, then looks for the first link after the excerpt that contains that anchor. This picks up the permalink on most blogs, since they include anchor tags before the entry named the same as the permalink (either the filename or the fragment). A concrete example (from your own home page): this post is preceded by an anchor """[a name="000306"][/a]""", and followed by a permalink """[a href=""]11:16 AM[/a]""". See how they match? It's not just A tags, though; some anchors are just an id attribute stuck on a div or h3 tag. There are lots of variations. Life is messy.
  • I just noticed the tag detection as I looked at the sample code you posted. Much more sophisticated than I'd initially thought. :) I forgot that Python has a useful SGML parser - I need to play with that more. Usually I assume that I'm stuck in trial-and-error regex-land when I want to dig out some content from a given HTML page. As for the permalink detection, yes I'd found some false positives from the start, but I was amazed at how many non-obvious ones it found. Very cool!
  • Python's SGML parser totally rocks. It took me ages to grok it, but when I did, it changed my world. My script to pull titles, language codes, and RSS auto-discovery URLs from pages? sgmllib. My script to extract excerpts around links? sgmllib. My script to find RSS feeds even without auto-discovery links? sgmllib. My ultra-liberal RSS parser? sgmllib. I wrote up a whole chapter on sgmllib ( ), which no one ever understands.
  • Oo. I think that was a challenge I just heard. Off to read the chapter now.
  • Mark, dude, you write the best damn footnotes I think I have ever seen. Oh, and this is supposed to be hard or something? I get it, and I'm not that bright.
  • He he. I'd forgotten about that one. As I recall, I wrote it at like 4 AM and considered taking it out upon reading it the next morning, but left it in and forgot about it.