Issues in using SpamBayes to filter news items
Despite a reading an entry by Srijith discussing Bayes-based classification as unsuitable for use in news aggregators, I tied SpamBayes into my homebrew news aggregator and have been trying it out this week. I know I’ve been talking about it for awhile, but procrastination and being busy all round kept me from getting to it. Funny thing is, when I finally got a chance to really check things out, the integration was a snap. I’d anticipated a bit of work, but was pleasantly surprised. I doubt that any other aggregator written in Python would have a hard time with it.
If, that is, anyone else wants to do it. I already knew it wasn’t magic pixy dust but I figured it might be worth a try. I will be eating my dogfood for awhile with this, but I’m thinking already that what’s good for spam might not be so good for news aggregators.
Srijith’s post mentions some snags in ignoring some of the semantics of a news item, such as whether a word appears in the item’s title or information about the item’s source. I don’t think that this completely applies to how I’m doing classification, since SpamBayes appears to differentiate between words found in email headers and the body itself. When I feed an item to SpamBayes for training and scoring, I represent it as something like an email message, with headers like date, subject, from, and an “X-Link” header for the link. However, even with this, I think Srijith’s got a point when he writes that this method will miss a lot of available clues for classification.
Unlike Srijith’s examples, though, I’m not trying to train my aggregator to sift entries into any specific categories. So far, I’ve been trying to get it to discriminate between what I really want to read, and what I’m not so interested in. So, I figured that something which can learn the difference between spam and normal email could help. But, although it’s early, I’m noticing a few things about the results and I’ve had a few things occur to me.
See, in the case of ham vs spam, I really want all the ham and none of the spam. A method to differentiate between these two should be optimized toward one answer or the other. SpamBayes offers “I don’t know” as a third answer, but it’s not geared toward anything else in-between. However, in measuring something like “interest“, inbetween answers are useful. I want all of the interesting stuff, some of the sort-of interesting stuff, and a little of the rest.
This is also a problem for me in deciding to what I should give a thumbs up and what gets the thumbs down. Even though I’ve subscribed to a little over 300 feeds, every item from each of them is somewhat interesting to me. I wouldn’t have subscribed to the feed if there wasn’t anything of interest there, so I’ve already biased the content of what I receive. Some items are more interesting than others, but the difference between them is nowhere near the difference of wanted ham vs unsolicited spam. So, I find myself giving the nod to lots of items, but only turning down a few. SpamBayes would like equal examples of both, if possible.
I’ll still be playing with this for awhile, but I need to look around at other machine learning tech. I’m just hacking around, but the important thing is to try to understand the algorithms better and know how they work and why. Bayes is in vogue right now, but as Mark Pilgrim intimated, it’s not magic. It’s just “advanced” :)
In the immortal words of Mark Jason Dominus: “You can’t just make shit up and expect the computer to know what you mean, retardo!”
shortname=bayes_agg_one
Archived Comments