Scraping HTML with web services
After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.
So… these are all URLs. I figured I could pull together the site URL, Bill’s XSLT, the tidy service, and the XSLT service, and have a whole lot of scraping going on right in my browser or via wget or curl. Here are the steps in how I composed the URL:
- http://www.thenation.com
- http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.thenation.com
- http://www.w3.org/2000/06/webdata/xslt?
xslfile=http%3A%2F%2Fwww.whump.com%2Fdropbox%2Fnationrss%2Fnation.xsl&
xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F
docAddr%3Dhttp%253A%252F%252Fwww.thenation.com&transform=Submit
Unfortunately, this doesn’t work. In particular, step #2 fails, the Tidy service reporting a failure in processing the original HTML. I imagine, had that worked, the whole process at step #3 would be producing RSS. On my command line, HTML Tidy works fine, so I’ve been thinking of throwing together my own web interface to that program and seeing if that works.
If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites.
<p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and... Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>
shortname=rss_scrape_urls
Archived Comments