I'm a complete neophyte when it comes to machine learning, but I'd like to get into learning more about the field in general. In particular, I'd like to make my news aggregator smarter. I've already tried using SpamBayes, but that didn't make me happy. Whether it was my approach or whether it was that Bayes itself is not suited toward this task, I'm not sure, though I suspect it's a little of both.

It seems like the magic Bayesian pixie dust works well for spam-vs-ham in my email box, so why shouldn't the magic for interesting-vs-yawn work for my aggregator firehose? Well, here are the issues I'm guessing at:

In the case of spam-vs-ham, you want to classify things into this or that-- that which is kept, and that which is tossed away. But in the case of items in my aggregator, I want a relative sort order or a score. I want a fuzzy guess toward my interest with which to inform presentation of items. Interesting-vs-yawn is more of a continuum than a pair of buckets.

And then, there's the passive gathering of behavioral data from my interactions with the aggregator, because I'm sure as hell not going to click ratings or thumbs-up/down all day. In spam-vs-ham, I could build up two clean mailboxes for training the categorizer, with one containing all spam and the other all ham. But, in the case of my aggregator, the only thing I'm tracking are items in which I showed interest by revealing more information or by clicking through.

So, I can say that a particular pile of items are all interesting. But, my interest level for the rest of the items received is a complete unknown-- maybe I'm vehemently disinterested in those 50 items, but maybe I just never got around to looking at those other 20 and just let them fall off my date range for display. Thus, I have a pile of ham, and a pile of undifferentiated unknown. I'm not bothering to provide any cues as to whether I don't like something, because that'd be boring work-- I mean, I am disinterested in those items, after all. So, I'd like to leverage what the system knows from what I care to provide, but not jump to any conclusions about the items in the unknown pile. There is no spam, only various flavors of ham.

Given all this, then, is there anyone out there who knows more about machine learning than me who could maybe point me toward a better approach or algorithm that fits this profile?


Archived Comments

  • There's a package called Reverend out there, which is a Bayesian filter not specialized for spam. It might be worth looking at. Also, in training the system should tell you what it thinks you'll think, and then you'll correct it if it's wrong. Without this you can overtrain, creating a continual positive feedback, where all you want is a corrective feedback. OTOH, there's rating systems, like (I think) on Amazon. There they correlate your ratings to other people's ratings. There you can usefully rate anything, because determining your correlated users is separate from the rating it actually presents you -- it's not determining your preference, simply your demographic. Bloglines could do this (and they try a little), but you couldn't in isolation. Anyway, some ideas.
  • Have you looked at AmphetaRate (http://amphetarate.sf.net)? They claim to be using Bayesian training. I've been using it for about a week, there's a number of things I don't like about it, but it's an interesting system. It's Perl / PHP, FWIW.
  • Where to start. Hmmm. The problems with bayes spam/ham sorting for news readers is down to the simplistic stats. The most simplistic statistical measure to take is on individual words. For normal email spam/ham there are (or moreover, were) words that stood out as good/bad markers. But in something like an rss feed where you have already selected feeds that are reasonably close to your interest set, the individual words are no longer useful for sorting interesting vs yawn. As you can probably tell, I had a go at doing this for an AI course I did last semester, and I wound up with a furball that didn't actually fly. I used python and wxpy to build an aggregator around an IE control with good/bad buttons for marking, my word distributions didn't wind up correlating. Is your aggregator open for hacking? I have a few theories that I'd like to testbed, and now owning a mac, my old code is nigh on useless :-)
  • Machine learning is not within my list of "do-know-something" areas, but I could suggest a bit of commensense logic to the system to learn from uninterested items. The fact that an item is uninteresting is denoted by the propertiy that you have not clicked on it. While one may not be able to deduce much pattern from a single item as such, it could be used to deduce useful information about the metadata associated with the item. For example, I found that I was reading less and less of items from blogger X's feed. There were days at a stretch where I would not find anything written by X interesting enough. If a cron job was run that could look through the last time I read any of X's item, it could have noticed this trend and thus lowered X's score, thus pushing his item below my "casual-glance-view". While doing this alone would cause one to miss a very interesting item that X could write in a blue moon, if one uses other metadata associated with the feed item (like keyworkds etc.), it could be brought within my "casual-glance-view". Intellie-Aggie tries to use these logics to filter out items, but I have left the codes untouched for far too long to expect any wonders.
  • I'm hoping that aggregators will get smarter too, because due to the info-glut RSS enables the computer is just going to have to get smarter about having a guess as to what we're going to be interested in.
  • Take a look at my Aggrevator: http://www.oshineye.com/software/aggrevator.html which uses scoring to enable users to deal with very large numbers of feeds. I went with this approach after running into re-calculation problems with using Bayesian analysis for ranking entries: http://www.advogato.org/person/ade/diary.html?start=11