Part of the reason this whole must ignore thing with respect to feeds has me a bit fired up is because it seems like so few feed processing tools out there embrace this idea. And because of that, these tools are unfortunately brittle and prone to future shock.

For example, take Syndication.framework on OS X: Amongst the monkeys and ninjas and pirates and robots, you've got your standard title-date-link-description columns with a few other bits for good measure. But, where's the data from iTunes RSS extensions? Nowhere, gone, lost. If it was in the feed when Syndication.framework found it, it wasn't understood and so it wasn't retained after the parser finished chewing up and spitting the data into that DB table.

I've written about shared feed processing foundations before, but I don't think I've totally gotten the idea to gel in my head until now. Here's the thing: If you want feed processing tools that are useful for the general case, they have to be tolerant of things not understood. Rather than intrusively breaking apart and recasting feed data into a predetermined data structure, you've got to remain hands-off as much as possible.

This is what I did in FeedSpool. This code can subscribe to feeds, poll feed data periodically, and even work out which items in a feed are new—but it punts on everything else by only caring about where a feed starts and ends, and where its individual entries start and end. The rest is left in its original XML form. So, if there was data in there for iTunes? It's still there, because FeedSpool didn't know enough to do anything to it.

This is the major difference in feed processing model. Syndication.framework loses information when it encounters things it doesn't understand. FeedSpool retains the information, because it leaves things alone when it doesn't know any better.

Now, this is not to pick on just Syndication.framework. Despite the general-sounding name, this framework is pretty much just around to power Safari RSS and not iTunes or anything else. And as I said, pretty much every other feed processing framework and tool works in this manner. Just about everybody uses a destructive process when they parse and marshal feed data into local-idiom structures.

And this destructive process is fine when all you want to do is satisfy a specific purpose—display a few headlines, for example. But if you're trying to build general-case feed plumbing, this is unacceptable and untenable. If most everything in a feed item gets lost as it goes through a pipeline, we might as well abandon feed extensions entirely in favor of microformats stuffed into the description tag—an idea to which I'm not entirely opposed.

On the other hand, consider this: This past summer, Microsoft released an architectural overview about an OS-wide RSS framework in the upcoming Vista. Unfortunately, like many RSS and Vista related things from this past summer, all links have evaporated into a redirect to a single developers portal. I can't figure out why they didn't just redirect over to here, which took me a search and a few clicks to rediscover.

At any rate, there's one sentence in this overview that gives me hope for RSS as a general service in Windows Vista:

"It is also possible to access the item XML for applications that want to perform operations on the XML instead of using the item's properties."

So, dig that. If it works the way I hope it does, RSS in Vista will take care of the subscriptions for you, poll the feed data, grab new stuff—but then it leaves the data intact for you to process whatever new and unanticipated feed payloads that may arrive.

That's how it should work.

Archived Comments

  • If you want feed processing tools that are useful for the general case, they have to be tolerant of things not understood. Rather than intrusively breaking apart and recasting feed data into a predetermined data structure, you’ve got to remain hands-off as much as possible.

    It sounds to me like you're arguing for a triplestore and SPARQL. You'd stash everything in the triple store, and then your front end app just needs to be able to construct the appropriate query and process the results. More general and extensible than creating custom classes for filtering by specific fields, and I think if you were to ever write a general filter app where users can specify filter fields and values, you'd basically be reimplementing SPARQL.

  • Well, a triplestore would be great if these feeds were RDF. But alas, with the exception of RSS 1.0, they're XML. I could play with trying to make transformations from XML to RDF, but that's getting back to a dangerously unlazy level of intelligence required to map unanticipated future feed extensions to RDF equivalents.

  • (Although, it'd be really cool if syndication feeds were all RDF and not just XML. Triples would be a lot nicer to sling around than SAX parsing events.)

  • Well, a triplestore would be great if these feeds were RDF. But alas, with the exception of RSS 1.0, they’re XML. I could play with trying to make transformations from XML to RDF, but that’s getting back to a dangerously unlazy level of intelligence required to map unanticipated future feed extensions to RDF equivalents.

    Isn't there already a clean mapping between Atom and RDF? There's a list of integration ideas here.

    On a side note, while Googling for Atom/RDF notes, I came across blogseive, which claims to be:

    ...a free web-based tool that creates new feeds by filtering, merging and sorting existing feeds. The BlogSieve engine accepts virtually every (valid) feed format, processed results are then exported into any feed format you choose.

    I haven't tried it, but it claims to allow filtering.

  • Isn’t there already a clean mapping between Atom and RDF?

    Maybe, though I don't think it's official. And even it if is, it leaves out RSS. But even if it worked for RSS too, what about all the feed extensions that might be? I think RSS 1.0 had the right idea for extension modules in the RDF universe, but the world seems to be settling for XML.

    I haven’t tried [Blogsieve], but it claims to allow filtering.

    It does filter, but it does so destructively. (And that's not to mention the 7-8 step form I had to go through to start filtering. Definitely not a URL-line application.) But, with respect to their filtering and conversion, check out these feeds:

    If you compare these to each other, you'll find information loss and even just plain corruption. The dc:subject elements encoding del.icio.us tags are gone, even in the RSS-1.0-to-RSS-1.0 transformation. And somehow, in the Atom version, they managed to jumble up titles and authors. Granted, my stuff doesn't do conversion yet, but I wouldn't want to do it like this.

    I haven't tried it yet, but I'd have to guess that a podcast feed with iTunes and/or Yahoo! Media elements would get mangled in a very nasty way.

  • Wow, is it really 1985 again? I could have sworn all this was hashed out back then (only then it was a binary format, not a text-based format).

    Sigh.

  • Sean: Ooh, nice link! It's been awhile since I read about IFF, and I don't think I ever quite understood the concept.

    But, this part certainly caught my eye:

    Our task is similarly to store high level information and preserve as much content as practical while moving it between programs. But we need to span a larger universe of data types and cannot expect to centrally define them all. Fortunately, we don't need to make programs preserve information that they don't understand.

    (Also, I miss my Amiga.)

  • Maybe, though I don’t think [a clean mapping between Atom and RDF] 's official. And even it if is, it leaves out RSS. But even if it worked for RSS too, what about all the feed extensions that might be?

    You could store the fully qualified entries, with the appropriate namespaces, and then define equivalencie using owl:equivalentClass. Then (I believe) a SPARQL query that extracted the rss:entry resources would also pick up the entries from Atom and the various RSS flavours. Although at that point you'd need a OWL-capable triplestore and library.