Scraping HTML with web services

After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.

So… these are all URLs. I figured I could pull together the site URL, Bill’s XSLT, the tidy service, and the XSLT service, and have a whole lot of scraping going on right in my browser or via wget or curl. Here are the steps in how I composed the URL:

  1. http://www.thenation.com
  2. http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.thenation.com
  3. http://www.w3.org/2000/06/webdata/xslt?

    xslfile=http%3A%2F%2Fwww.whump.com%2Fdropbox%2Fnationrss%2Fnation.xsl&

    xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F

    docAddr%3Dhttp%253A%252F%252Fwww.thenation.com&transform=Submit

Unfortunately, this doesn’t work. In particular, step #2 fails, the Tidy service reporting a failure in processing the original HTML. I imagine, had that worked, the whole process at step #3 would be producing RSS. On my command line, HTML Tidy works fine, so I’ve been thinking of throwing together my own web interface to that program and seeing if that works.

If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites.

    <p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and...  Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>

shortname=rss_scrape_urls

Archived Comments

  • Does the w3c Tidy service support the force output option? That's what I had to do with command line Tidy to get something well formed from The Nation's home page.
  • Unfortunately, it seems that the W3C service only offers an indentation option
  • It's tempting to take the script, and offer it as a service myself, with the force output option. However, I'd need to wrap an authorization service in front of it so it doesn't kill my bandwidth.
Scraping with web services: Success  Previous Scraping HTML with curl, tidy, and XSL Next