2003 • August • 23

Scraping HTML with web services

After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.

<p>So&#8230;  these are all URLs.  I figured I could pull together the site <span class="caps">URL</span>, <a href="http://www.whump.com/dropbox/nationrss/nation.xsl">Bill&#8217;s <span class="caps">XSLT</span></a>, the tidy service, and the <span class="caps">XSLT</span> service, and have a whole lot of scraping going on right in my browser or via wget or curl.  Here are the steps in how I composed the <span class="caps">URL</span>:</p>

<ol>
<li><a href="http://www.thenation.com">http://www.thenation.com</a></li>

http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.thenation.com

http://www.w3.org/2000/06/webdata/xslt?

xslfile=http%3A%2F%2F www.whump.com%2Fdropbox%2Fnationrss%2Fnation.xsl&

xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F

docAddr%3Dhttp%253A%252F%252Fwww.thenation.com&transform=Submit

<p>Unfortunately, this doesn&#8217;t work.  In particular, step [#2](/tag/2) fails, the Tidy service reporting a failure in processing the original <span class="caps">HTML</span>.  I imagine, had that worked, the whole process at step [#3](/tag/3) would be producing <span class="caps">RSS</span>.  On my command line, <span class="caps">HTML </span>Tidy works fine, so I&#8217;ve been thinking of throwing together my own web interface to that program and seeing if that works.</p>

<p>If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites.  </p>

    <p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and...  Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>

shortname=rss_scrape_urls

Archived Comments

Bill Humphries

2003-08-24T16:08:56

Does the w3c Tidy service support the force output option? That's what I had to do with command line Tidy to get something well formed from The Nation's home page.
l.m.orchard

2003-08-24T20:07:03

Unfortunately, it seems that the W3C service only offers an indentation option
Bill Humphries

2003-08-25T02:13:11

It's tempting to take the script, and offer it as a service myself, with the force output option. However, I'd need to wrap an authorization service in front of it so it doesn't kill my bandwidth.