2004 • June • 14

Info Freako, or who's already past arguing about syndication formats?

syndication

You know that sleep is getting hard to get

'Cause you never know what you'll forget

And I've got to know of all the news

'Cause one day there'll be news for me

I never let a headline by

'Cause every one will catch my eye

And though it's tough to keep alert

You never know what could hurt me

Source: Jesus Jones, "Info Freako"

The Syndications Wars are over--at least, as far as I'm concerned.

It's hard to resist jumping in where I think someone's got it wrong or when my fingers compel me to feed trolls, but resisting that impulse is what needs to be done. At this point, all that seems to happen is that the same old threads get recycled. No one's got anything new to bring, except maybe ad hominem grousing or possibly a Yo Mama joke (though I've yet to see that particular innovation). Anyone who cares can do a bit of Googling to catch up on the story.

The fact is, I no longer care about RSS versus Atom versus $foo. Mark Pilgrim enables me to do so with his feed parser, and most other aggregators I might care about have also implemented support for both Atom and RSS. And if they don't, I can route around the damage just like I do when I scrape sites devoid of any feeds.

While I do prefer Atom over RSS, almost a year later I still say the magic is in syndication, not the format. I'll let the tag-level grumblers foam on without comment and just thank them for their work when a good new spec bubbles up or when something fun and useful comes out. I'm circling that whole area of concern and sticking a post-it on it that reads: RSS and Atom both useful, neither perfect, neither going away.

Whew. That's a weight off my brain. Now what is there left to talk about?

Well, how about let's talk some more about what to do with the items we get, once we do manage to parse a feed? (Sheesh, you mean we're not already fighting over that topic?)

Rogers Cadenhead wants gluttonous RSS feeders:

With thousands of information sources producing RSS and Atom feeds, we need people like Thauvin [whose linkblog is here] who have integrated weblogging into their daily news-gathering routine. Weblog links are like ant trails -- a lot of people have to link to something good in order to get noticed.

I self-identify as such, since my feed list has topped 550 in count. But I'm happily surprised to find that I'm not even in the top 10 of prolific subscribers--at least I'm not the biggest Info Freako. (Yet.)

I'm adding between 2-3 feeds to my list daily, so I can see myself approaching 1000 eventually. But I'm starting to hesitate at adding one more feed now. Even with my current streamlined multi-pass skimming process, I'm starting to see diminishing returns. I breeze past screen loads of chaff that I'll never view, but it still bogs me down. I can only think that people with twice the subscriptions as I either have more free time, or have a better mousetrap.

The usual response I get toward my subscriptions is, "Why don't you cut that list down to about 100 essentials?" And even that's said with a smirk, usually by someone with under 50 subscriptions and usually by someone who's not as obsessive an Info Freako as I am. Thing is, though, good stuff has at one point or another shown up on each and every feed I monitor. I want to figure out how to scale up from 1 to 10 to 100 to 1000 to 10000 sources and beyond.

(Singing interlude: "Info Freako / There is no end to what I want to know")

Besides, this is an area where I can tinker with and learn about another area I'm interested in: machine intelligence and intelligence amplification. Rogers says, "I want a Bayesian filter that can guess which new headlines I'm most likely to read" Though someone else might apply Bayes in a way that works for them, I didn't find my experiments with SpamBayes very satisfying. I suspect it has something to do with the fact that SpamBayes is geared toward sorting out a quasi-binary world of spam-versus-ham, while I'm interested in a spectrum between must-read and shrugs.

But, the idea of introducing another pass through items at the head of the process, this one partially or completely automated, has great appeal. Done right, this could be the bit that adds an order of magnitude to my capacity to monitor feeds. I need to investigate other machine learning approaches.

The idea is that, while I freakishly want to catch as much info as possible, I can only handle so much in a day. For certain, I can't handle everything that might be interesting to me, so I need some prioritization and some pre-filtering before my attention gets applied to the flow.

The way I picture this is trying to apply a sort of inverted pyramid approach to the incoming flow of items.

I started with a few primitive tools in AmphetaOutlines' adaptivity to reading patterns, limited mostly to just sorting channels by a count of items read historically. I also introduced some information hiding and exploration aspects: I tried to hide or de-emphasize older items by use of font size and weight; and I put items into a JavaScript-driven outline where item descriptions and more ancient items could be hidden or revealed via disclosure triangle.

In my latest attempt, I've not yet implemented any adaptive sorting, but I've kept and improved the outline display (see: screenshot #1, screenshot #2). Also, I can now mark items as seen and/or flag them to be viewed in a queue for later. I've got some lame SpamBayes integration in there, but I've let it atrophy in daily use due to a complete lack of usefulness.

I'm starting to think about next steps now toward a more advanced aggregator. I've still got my wishlist for AmphetaOutlines, and I've actually covered quite a few of the items with this new aggregator. But, I'm thinking things like the following would be useful to pursue:

What do you think is more important? Do you value one group of feeds over another? Personally, I want to see every single web comic that appears in my queue, most items from Engadget and Boing Boing, and maybe only a few from some of the firehoses I've hooked myself up to. Also, there are some bloggers who post somewhat infrequently, but I don't want to miss a thing when they do post. I need to be able to group and prioritize manually.
What do you demonstrate as important? Which feeds' items receive more of your attention, and within those feeds, what topics and phrases appear most frequently? The machine should be able to make some observations about your history of behavior and give some input into the organization of items presented. Also, it should give me some way to give feedback to its recommendations with a simple and lazy thumbs up and thumbs down.
Republishing of interesting items to a linkblog is a must. On the flip-side, it would be nice to somehow pull in others' linkblogs in a more meaningful way than simply watching their feeds. I should be able to triangulate some things and get some recommendations based on mutual links predicting future interest in items. We need to start chasing ant trails unconsciously and automatically.
Time-limited subscriptions which expire after a set time, or request renewal from the user. Use these to track comment threads which offer RSS feeds. (Like this one.)
More statistics and health monitoring of subscriptions. How active are your feeds? Which are dead & gone, or merely just in hiatus? Have any moved?

Now, I haven't done any sort of comprehensive survey of the aggregator landscape in a long time, so I'd be very intrigued if any existing software implements these sorts of things. I've seen some progress toward monitoring feed health, but I've seen next to nothing toward automatic filtering of items and recommendations based on past behavior. I have seen manually constructed filters, but I'm too lazy to try to figure out how to tell the computer what I want. I want the machine to ride shotgun, watch and learn.

Where's the state-of-the-art for feed aggregators, and what's next? I'm tired of reverse-chronological versus three-pane; I'm tired of copying Usenet and email. What needs to happen next to expand our Info Freako capacity by an order of magnitude or two? The invention of aggregators has opened the door to the first few orders, but I need more.

Exit singing (while twitching for more info):

Info Freako,

There is no end to what I want to know

But it means I'll have the edge over you

And it means I'll always have the edge over you

And you know there's nothing that you can do

Archived Comments

Fazal Majid

2004-06-14T21:52:14

I am trying to do this with my own home-made aggregator (www.temboz.com). I have serious doubts about software that could automatically determine whether an article is interesting or not, and would settle for a way to suppress duplicates as a meme propagates through the blogosphere, e.g. by finding posts that link to the same URLs. I did implement a simplistic kill-file style filtering system, that works reasonably well, no false positives and 204 out of 1701 items in the last 7 days caught.
Olav

2004-06-14T22:25:55

Anyone who subscribes to so many feeds has a mental illness. Like some kind of obsessive compulsive disorder. I bet they record all the channels on T.V. in case they miss something. I bet they don't throw away any newspapers either. I subscribe to about 144 feeds but that is only because I am testing an app. Most of it is garbage.
Greg Linden

2004-06-14T22:49:58

You might check out Findory Blogory (http://blogory.com). It's a personalized weblog reader that learns from the weblog articles you read and recommends other articles that appear to match your interests. It's not a Bayesean filter like SpamBayes, but it does "guess which new headlines you're most likely to read." It is "automatic filtering of items and recommendations based on past behavior." You don't have to manually construct filters or "tell the computer what you want." Findory Blogory will "ride shotgun, watch and learn." Take a look. I'd love to hear your thoughts.
l.m.orchard

2004-06-14T23:03:38

Olav: See, for me, it isn't so much a compulsion as a challenge. I don't read through my aggregator all day - I spend about an hour per day at lunch and maybe another at home skimming through items and maybe another hour or so reading longer stories. This isn't handwashing OCD behavior here - I don't read or even linger on more than a small fraction of what ends up on the screen. It's a lot of quick glancing, page-down, and mass delete. It's pretty much as much time as I used to spend just between a half-dozen sites like slashdot.org and cnet.com, only now my sources are so much more varied. The challenge is that wanting to squeeze more coverage of more sources into my daily browsing has me wanting to learn more about certain technologies. That, and just dealing with more volumes of data is kinda fun in a geeky sort of way. And, no, I don't record all the channels on the TV. Besides watching 1 or 2 shows, and playing a few video games, this is what I do instead of sitting in front of the tube.
Ron

2004-06-16T19:52:44

When I wrote that story I wasn't expecting people to jump into a fight about RSS vs Atom. Should have known better. I just didn't like the original story and wanted to rant about it. Oh well you learn something new everyday.
Ian Bicking

2004-06-17T00:36:51

You might want to look at Reverend: http://www.divmod.org/Home/Projects/Reverend/ It's a general-purpose Bayesian classifier.
Lou Quillio

2004-06-17T05:01:13

There’s what we’d like to scan and what we’d like to read, yes? Moving content to the must-read front burner will always be deliberate, but improving the content we scan can probably be improved mechanically. I doubt Brother Orchard thinks there’s an algorithm for the A-list. We’re just talking about making more productive use of the proximate cloud. Scenario [#1](/tag/1): You Google for topics of interest. The hits returned can’t be reliably filtered by creation date, and they consider (in general) the entire indexed Web. It’s a good thing, if scatter-shot. Queries can only be refined so much. Scenario [#2](/tag/2): You amass and manage a list of syndicated feeds in which you have a passing interest. The feed content is indexed (Bloglines does this; Google does or will, too). You can query this universe discretely, using Google-like devices and maybe a few new ones. What’s different about these two? In [#2](/tag/2), you selected (and casually manage) the universe. Its chunks — though perhaps truncated or paraphrased by the author — are reliably date-stamped. A free-text query of such data would be very different from an identical Google query. Add a few more discriminators and it gets better. Suppose you could discriminate by 1. Paragraph length. That might get you recent topical hits of terse or essay length, as you choose. 2. Link-text density. Filter in or out those link-collection posts. 3. Query-term density or semantic weight. Is the topic being addressed directly or simply mentioned? 4. Item weight or image density. Helps filter noisy or glitzy corporate feeds (think Ziff-Davis). There are plenty more. The main differences, though, are date discrimination and the pre-filter of a confined universe. Me, I’m down to thirty feeds and ignore half of those. Better tools might change that. LQ
Matt Brubeck

2004-06-19T21:06:49

I love the picture. Need input!