Month: 2003/09
-
2003 September 29
-
Dynamic polling times for news aggregators, II
Okay, so that thing with the SQL I did Friday? I'm not exactly sure what I was thinking with it. I was doing something that seems really odd now, trying to collect counts of new items together by hour, then averaging those hourly counts across a week. Instead, I'm trying this now: SELECT source, 'update_period' AS name, round(min(24,max(1,(max(1,(iso8601_to_epoch(max(created)) - max(now() - (7*24*60*60), iso8601_to_epoch(min(created)))) / (60*60))) / count(id))),2) AS value FROM items WHERE created >= epoch_to_iso8601(now() - (7*24*60*60)) GROUP BY source This bit of SQL, though still ugly, is much simpler. This leaves out the subselect, which I think I might have been playing with in order to build a little graph display of new items over time by source. What the above does now is to get an average time between new items for the past week, with a minimum of an hour, and a maximum of a day. This seems to be working much better. An alternate algorithm I've been playing with was suggested in a comment by Gnomon, inspired by TCP/IP's Additive Increase / Multiplicative Decrease. With this, I subtract an hour from the time between polls when a poll finds new items, and then multiply by 2 every time a poll comes up with nothing new. Using the average of new items over time lessens my pummeling of servers per hour, but the second approach is even lighter on polling since it's biased toward large leaps backing off from polling when new items are not found. I'll likely be trading off between the two to see which one seems to work best. Hoping that, after playing a bit, I'll settle on one and my aggregator will play much nicer with feeds, especially once I get the HTTP client usage to correctly use things like last-modified headers and ETags. There's absolutely no reason for a news aggregator to poll a feed every single hour of a day, unless you're monitoring a feed that's mostly quiet, except for emergencies. In that case, well, a different polling algorithm is needed, or maybe an instant messaging or pub/sub architecture is required. Update: As Gnomon has corrected me in comments, I've got the AIMD algorithm mixed up. What I really should be doing is making quick jumps up in polling frequency in response to new items (multiplicative decrease of polling period) and creeping away in response to no new items (additive increase of polling period). As he notes, this approach should make an aggregator jump to attention when clumps of new posts come in, and gradually get bored over periods of silence. I've adjusted my code and will be tinkering with it. Also, although Gnomon makes a good point that bloggers and their posting habits are not easily subject to statistical analysis, I've further refined my little SQL query to catch sources which haven't seen any updates during the week (or ever): SELECT id as source, 'update_period' AS name, round(min(24,max(1,coalesce(update_period,24)))) AS value FROM sources LEFT JOIN ( SELECT source AS source_id, (iso8601_to_epoch(max(created)) - max( now()-(7*24*60*60), iso8601_to_epoch(min(created)) ) ) / (60*60) / count(id) AS update_period FROM items WHERE created >= epoch_to_iso8601(now() - (7*24*60*60)) GROUP BY source ) ON sources.id=source_id Also, in case anyone's interested, I've checked all the above into CVS. This beastie's far from ready for prime time, but it might be interesting to someone. [ ... 1237 words ... ]
-
2003 September 25
-
Dynamic feed polling times for news aggregators
Today, my aggregator got the following SQL worked into its feed poll scheduling machinery: SELECT id as source, 'update_period' as name, max(1, 1/max((1.0/24.0), sum(update_count)/(7*24))) AS value FROM sources LEFT JOIN ( SELECT source AS count_id, round(iso8601_to_epoch(created)/(60*60)) AS hour, count(id) AS update_count FROM items WHERE created>epoch_to_iso8601(now()-(7*(24*60*60))) GROUP BY hour ) ON id=count_id GROUP BY source ORDER BY value It's likely that this is really nasty, but I have only a street-level working knowledge of SQL. Also, a few of the date functions are specific to how I've extended sqlite in Python. It works though, and what it does is this: For each feed to which I'm subscribed, work out an average time between updates for the past week, with a maximum period of 24 hours and a minimum of 1 hour. My aggregator does this daily, and uses the results to determine how frequently to schedule scans. In this way, it automatically backs off on checking feeds which update infrequently, and ramps up its polling of more active feeds. This shortens my feed downloading and scanning time, and is kinder in general to everyone on my subscription list. Next, among other things, I have to look into making sure that the HTTP client parts of this beast pass all the aggregator client HTTP tests that Mark Pilgrim put together. Update: Well, it seemed like a good idea, anyway. But, on further examination, it has flaws. The most notable is that it assumes a polling frequency of once per hour. This works right up until I start changing the polling frequency with the results of the calculation. I haven't poked at it yet, but maybe if I take this into account, it'll be more accurate. On the other hand, I've also been thinking about a much simpler approach to ramping polling frequency up and down: Start out at a poll every hour. If, after a poll, no new items are found, double the time until the next poll. If new items were found, halve the time until the next poll. Provide lower and upper limits to this, say between 1 hour and 1 week. Also, consider the ramp up and ramp down factor as a variable setting too. Instead of a factor of 2, maybe try 1.5 or even 1.25 for a more gradual change. To go even further, I wonder if it would be valuable to dynamically alter this factor itself, to try to get the polling time zeroed in on a realistic polling time. Okay. There the simpler approach leaves simplicity. I'm sure there's some decently elegant math that could be pulled in here. :) [ ... 638 words ... ]
-
Atom is its Name-O?
I would like to propose, nay, admonish, that the name of the format and spec should be Atom, that the current naming vote should be killed, and we should move on to grander things without the auspices of "what's it called?!" over our heads. This has been going on far too long. Source:Morbus Iff: 'Atom' Should Be It's Name, and It's Name Was Atom I haven't been anywhere near the epicenter of Atom/Pie/Echo much, so this is mostly a 'me too' kind of posting. But, you know, as an interested hacker waiting for dust to settle before I start paying much attention, the decision on a name, as superficial as it is, seems telling to me. On one hand, I could take it to be representative of what's going on inside the project as a whole. (If they can't settle on a name, how can they settle on what's included in the spec?) On the other hand, it could just be that naming the thing is the least interesting aspect of the project. But I consider that because I'm a nerd, I've been there, and I want to see the project thrive. Others might not be so charitable or patient. :) So just name the dang thing Atom already. [ ... 273 words ... ]
-
2003 September 21
-
Feedback loops and syndication
Enter attention.xml. Of course it monitors my attention list, noting what feeds are in what order. Then it pays attention to what items I read, in what order, or if not, then what feeds I scan, and for how long. The results are packaged up in an attention.xml file and shipped via some transport (RSS, FTP, whatever) to Technorati. Dave has some ideas about what he will provide in return: "If you liked these feeds and items, then here are some ones you don't know about that you may want to add to your list." But the real power comes in a weighted return feed that works like this: OK, I see who you think is important and what posts are most relevant to your interests. Then we factor in their attention.xml lists weighted by their location on your list, average the newly weighted list based on this trusted group of "advisors", and return it to your aggregator, which rewrites the list accordingly. Source: Steve Gillmor's Emerging Opps Dave Winer says this guy’s full of shit. I’m not sure why, or it if’s sarcasm. In a lot of ways, what Steve Gilmore wrote about sounds like syndicating whuffie and what Gary Lawrence Murphy of TeledyN wrote about republishing RSS items read and rated from one’s news aggregator.
Sounds like the next one of the next steps this tech needs to take to hit a new level of intelligence, forming a minimum-effort feedback loop from writers to readers and between readers themselves. What did I read today, and was it interesting? What did you read today, and was it interesting? What did we both read and both find interesting? What did you read, and find interesting, that I didn’t read and might find interesting? And then, back around to the author again, what of your writings was found very interesting, and (maybe) by whom?
[ ... 313 words ... ] -
2003 September 19
-
Flash MX Hates Progressive JPEGs
Okay, I may be the last person fiddling with Flash to discover this, but here's what I've learned today: Flash MX hates progressive JPEGs. From the above: "The Macromedia Flash Player does not have a decompressor for progressive JPEG images, therefore files of this type cannot be loaded dynamically and will not display when using the loadMovie action." This would have been nice to know, hours ago. Or maybe fixed in the past year or so since the above linked tech note. See, although I'm a Jack of a lot of Trades, I don't really pay attention much to things like JPEGs and their progressive natures. It wasn't until I finally started randomly clicking buttons on and off in Macromedia Fireworks while exporting a test JPEG that I finally narrowed down the problem. This was after a day worth of examining ActionScript, XML data, HTTP headers, and a mess of other random dead ends. And a lot of last-ditch random and exhaustive twiddling of checkboxes and options. Then, once I had the words I wouldn't have had unless I already knew what my problem was, a Google search for "flash progressive jpeg" got me all kinds of info. Problem is, the JPEGs supplied to the particular Flash app on which I'm hacking come from a random assortment of people working through a content management system on the backend. They upload them with a form in their browser, and this Flash app gets a URL to the image via an XML doc it loads. Me, I'm probably in bed when this happens. I'd love to have tested every one... er, rather, no I wouldn't. So... Now I just have to figure out how to get all these people to start making sure that their JPEGs aren't progressive. Hmph. I can only hope that this message gets indexed and maybe provides more triangulation for some other poor sucker in the future. [ ... 429 words ... ]
-
2003 September 12
-
Don't copy that floppy, or cracked software strikes back
* Orangerobot uses cracked software. I will respond to the following commands: !ame
, !amsg , !quit , !open_cd, !switch_my_mouse_buttons Hmm. If what Orangerobot just emoted is true, that's funny as hell. !amsg Wang! Wang! and what's the purpose? AnitaR: Of the message from Orangerobot? yes must be part of the joke I'm not getting yet * Orangerobot uses cracked software. I will respond to the following commands: !ame , !amsg , !quit , !open_cd, !switch_my_mouse_buttons AnitaR: Could be a joke, but it appears that this person is using pirated software that's detected its illegitimacy and is allowing us to manipulate that user's computer. or its a social experiment by the person behind OR :) adamhill: Or that. :) Either way, it's fun I'm glad it isn't one of those experiments that tests how strong a shock we'll give the owner ?def orangerobot Some googling points to this software: http://www.klient.com !switch_my_mouse_buttons !ame likes cheddar cheese. * Orangerobot likes cheddar cheese. ?learn Orangerobot is either a person using cracked software or a social experiment by a demented psych student I understand now, Dr. Chandra; orangerobot is either a person using cracked software or a social experiment by a demented psych student !open_cd okay, I'm done. * Orangerobot uses cracked software. I will respond to the following commands: !ame , !amsg , !quit , !open_cd, !switch_my_mouse_buttons !quit hush. <-- Orangerobot has quit ("hush.") [ ... 327 words ... ] -
2003 September 06
-
An API for Wikis? Here's one.
Some folks are experimenting with using Wiki to build websites. I particularly like what Matt Haughey did with PHPWiki and a bit of CSS magic dust. Looks nice, eh? [Via Seb's Wikis are Ugly? post at Corante] Janne Jalkanen's Wiki-based Weblog is interesting too. Hmm. Maybe blog API(s) can be used for Wikis too. That reminds me, shouldn't Wiki formatted text have their own MIME type? Is there one? "text/wiki"? For now, different dialects of Wiki formatting rules will have to be accounted for like "text/wiki+moinmoin". Source: Don Park's Daily Habit It's been a while since I last worked on it, but I did implement an XML-RPC API on a few wikis, called XmlRpcToWiki. Janne Jalkanen did a lot of work toward the same interface with JSPWiki. I use this API in the linkage between my blog and the wiki on this site. Now that I've drifted away from XmlRpc a bit and am more in favor of simpler REST-ish web service APIs, I'd like to see something more toward that end. Seems like a lot of people are discovering or rediscovering wikis since the introduction of Sam Ruby's wiki for Atom/Echo/Pie work, so it's interesting to see a lot of things come up again like grousing about APIs and mutant wiki-format offshoots and standards. [ ... 290 words ... ]
-
2003 September 05
-
White Hat Worms and robots.txt?
Or maybe it's time to release our own Defender.A worm which could invasively close down the relevant "holes" in Internet security. A defensive worm could use standard intrusion tactics for benign result. For example, it could worm it's way into Windows XP computers and get the owner's permission to turn their firewalls on. It could survey open TCP/IP ports and offer to close them. Source: Superworm To Storm The Net On 9/11 (via KurzweilAI) So, anger is my first reaction to the idea of any unwelcome visitors on any of my machines, well intentioned or not. I’m sure that there aren’t many who wouldn’t feel the same way. But, although a lot of us try to keep up on patches and maintain decent security, there’s the “great unwashed masses” who just want to “do email“.
On one hand, it’s easy to say, “Tough. Learn the care & feeding of your equipment.” Yeah, as if that will help or get any response from all the people who’ve bought into AOL and have been reassured for years that computers are friendly and easy beasts (despite their intuitions to the contrary). Hell, I’d bet that, more often than not, the same person who gets regular oil changes and tune-ups for the car has no idea how to do the equivalent for a computer (or that it even needs it). Cars have been positioned differently than computers. No one expects a Spanish Inquisition when they live in a virtual preschool of a user interface with large and colorful buttons and happy smiling faces. They know there’s some voodoo going on underneath, but the UI tells them that it’s nothing to worry about (until it isn’t working).
Now if the problem was just that stupid users ended up with broken computers, there’d be no problem. But, like cars with problems waiting to happen (like worn down tires), their users become a hazard to others. Unlike cars, however, the problems of stupid users’ computers are contagious and self-replicating: every tire blowout becomes a 1000 car pileup.
It’s like everyone sits on their recliners watching TV in their houses; not even realizing that there are doors to lock; not even hearing the intruders rummaging through the fridge in the kitchen; and certainly not knowing that there’s a guy sleeping on the sofa at night working by day to let his army of clones into the neighbor’s houses.
So, about what about vigilante “white hat” worms? Wouldn’t it be nice if there was a guy wandering the neighborhood locking door for the ignorant? Wouldn’t it be nice if there was a truck driver on the road that forced cars with bald tires off to the side for free tire replacement? Okay, maybe that’s a bit whacky, but then again, people with bald tires aren’t causing 1000 car pileups.
I’m thinking that “white hat” virii and worms are one of the only things that will work, since I’m very pessimistic about the user culture changing to be more responsible. Though, what about a compromise? Install a service or some indicator on every network-connected machine, somewhat like robots.txt , which tells friendly robots where they‘re welcome and where they‘re not. Set this to maximum permissiveness for white hat worms as a default. The good guys infect, fix, and self-destruct unless this indicator tells them to stay out. Then, all of us who want to take maintenance into our own hands can turn away the friendly assistance of white hat worms. It’s an honor system, but the white hats should be the honorable ones anyway. The ones which ignore the no-worms-allowed indicator are hostile by definition.
So, then, the internet develops an immune system. Anyone can release a white hat worm as soon as they find an exploit to be nullified, and I’m sure there are lots of geeks out there who’d jump at the chance to play with worms and virii in a constructive way. And if you want to opt-out of the system, go for it. Hell… think of this on a smaller scale as a next-gen anti-virus software. Instead of internet-wide, just support P2P networks between installations of your anti-virus product. When it’s time to close a hole, infect your network with a vaccinating update. I doubt this would work as well as a fully open system, but might have less controversy.
Anyway, it’s a whacky idea to a whacky problem that just might work.
[ ... 1056 words ... ] -
2003 September 04
-
Litany against meetings, courtesy of purl
[ ... 104 words ... ]
-
Jibot and purl, distant cousins?
What the [#joiito](/tag/joiito) bot knows. I'm dumping it out dynamically with the Twisted webserver, which is all Python too. Source: Epeus' epigone - Kevin Marks weblog While the #joiito bot is looking pretty keen, I keep wondering if anyone hacking on it has seen Infobot ? It’s the brains behind purl, the bot serving [#perl](/tag/perl) channels on a few IRC networks. Jibot seems to have some funky punctuation-based commands, but purl accepts commands in formulatic english and even picks a few things up from normal channel chatter. When I look at Kevin Marks’ dump of Jibot’s brains, I can’t help but think of the gigantic factoid packs available for Infobot. [ ... 167 words ... ]
-
2003 September 03
-
Another BookmarkBlogger in Python
I haven't been paying attention to my referrers as much lately, but I probably should. Because, when I do, I find things like another implementation of BookmarkBlogger in Python, this one by David Edmondson. His version has many fewer requirements, using only core Python libraries as far as I can see. One of these which I hadn't any idea existed is plistlib, "a tool to generate and parse MacOSX .plist files". When I get another few round tuits, I'll likely tear out all the XPath use in my version and replace it with this. Bummer. And here I thought I was all clever using the XPaths like that in Python :) [ ... 163 words ... ]
-
2003 September 02
-
ChangeLog to RSS web service
Hanging out on joiito on IRC today, I read Ecyrd asking around about any tools to present GNU-style changelogs as an RSS feed. I couldn't find any, but I did find this changelog parser, apparently by Jonathan Blandford. So, when I had a few free minutes, I took some parts I had laying around, along with this parser, and made this: - Changelog for JSPWiki Source code for cl2rss This is at the "it works" stage. It needs much work in what it presents in an RSS feed, so feel free to suggest changes! [ ... 188 words ... ]
-
Using web services and XSLT to scrape RSS from HTML
After tinkering a bit with web services and XSLT-based scraping last week for generating RSS from HTML, I ripped out some work I was doing for a Java-based scraper I'd started working on last year and threw together a kit of XSLT files that does most everything I was trying to do. I'm calling this kit XslScraper, and there's further blurbage and download links avaiable in the Wiki. Check it out. I've got shell scripts to run the stuff from as a cron job, and CGI scripts to run it all from web services. For quick gratification, check out these feeds: - The Nation (using Bill Humphries' XSL) - KurzweilAI.net - J-List -- You've got a friend in Japan! - New JOBS at the University of Michigan (By Job Family) [ ... 141 words ... ]