Scraping HTML with curl, tidy, and XSL

Continuing with making it easier for "Big Pubs" to create RSS feeds. I'm assuming that they have a publishing system, but it wasn't built with RSS in mind, but they want on the bandwagon.

Using curl, tidy, and XSL to scrape content from HTML pages into an RSS feed. This is basically what I do now with a half-baked Java app using JTidy, XPath, and BeanShell. I keep meaning to release it, but it’s too embarassing to share so far. Yet, it’s been working well enough to scrape what sites I’m interested in such that I haven’t been too motivated to tidy it up and tarball it. One thing I like better about Bill Humphries’ approach, though, is that it doesn’t use Java :)

shortname=rss_scrape_xsl

Archived Comments

  • Well, it could use Java, if you really, really, want to since Xalan and Saxon have command line variants. I'm using LibXSLT in the demo.
  • Well, I actually like the idea of chaining a few shell programs together much better than the all-in-one Java scraper I was tinkering with. Seems so much easier all around.
Scraping HTML with web services  Previous Syndication feeds to replace email? Next