After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well.

<p>So&#8230;  these are all URLs.  I figured I could pull together the site <span class="caps">URL</span>, <a href="">Bill&#8217;s <span class="caps">XSLT</span></a>, the tidy service, and the <span class="caps">XSLT</span> service, and have a whole lot of scraping going on right in my browser or via wget or curl.  Here are the steps in how I composed the <span class="caps">URL</span>:</p>

<li><a href=""></a></li>
  • <p>Unfortunately, this doesn&#8217;t work.  In particular, step #2 fails, the Tidy service reporting a failure in processing the original <span class="caps">HTML</span>.  I imagine, had that worked, the whole process at step #3 would be producing <span class="caps">RSS</span>.  On my command line, <span class="caps">HTML </span>Tidy works fine, so I&#8217;ve been thinking of throwing together my own web interface to that program and seeing if that works.</p>
    <p>If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites.  </p>
        <p>What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and...  Thinking some very lightweight scripting here, or some LISP-ish expressions would help.</p>


    Archived Comments

    • Does the w3c Tidy service support the force output option? That's what I had to do with command line Tidy to get something well formed from The Nation's home page.
    • Unfortunately, it seems that the W3C service only offers an indentation option
    • It's tempting to take the script, and offer it as a service myself, with the force output option. However, I'd need to wrap an authorization service in front of it so it doesn't kill my bandwidth.