Scraping HTML with curl, tidy, and XSL
Continuing with making it easier for "Big Pubs" to create RSS feeds. I'm assuming that they have a publishing system, but it wasn't built with RSS in mind, but they want on the bandwagon.
Using curl, tidy, and XSL to scrape content from HTML pages into an RSS feed. This is basically what I do now with a half-baked Java app using JTidy, XPath, and BeanShell. I keep meaning to release it, but it’s too embarassing to share so far. Yet, it’s been working well enough to scrape what sites I’m interested in such that I haven’t been too motivated to tidy it up and tarball it. One thing I like better about Bill Humphries’ approach, though, is that it doesn’t use Java :)
shortname=rss_scrape_xsl
Archived Comments