2003 • August • 22

Scraping HTML with curl, tidy, and XSL

Continuing with making it easier for "Big Pubs" to create RSS feeds. I'm assuming that they have a publishing system, but it wasn't built with RSS in mind, but they want on the bandwagon.

Source: More Like This WebLog: Thursday, 21 August 2003

Using curl, tidy, and XSL to scrape content from HTML pages into an RSS feed. This is basically what I do now with a half-baked Java app using JTidy, XPath, and BeanShell. I keep meaning to release it, but it’s too embarassing to share so far. Yet, it’s been working well enough to scrape what sites I’m interested in such that I haven’t been too motivated to tidy it up and tarball it. One thing I like better about Bill Humphries’ approach, though, is that it doesn’t use Java :)

shortname=rss_scrape_xsl

Archived Comments

Bill Humphries

2003-08-22T20:23:59

Well, it could use Java, if you really, really, want to since Xalan and Saxon have command line variants. I'm using LibXSLT in the demo.
l.m.orchard

2003-08-23T13:21:08

Well, I actually like the idea of chaining a few shell programs together much better than the all-in-one Java scraper I was tinkering with. Seems so much easier all around.