Month: 2003/08
-
2003 September 01
-
Switching to a JVDS server
It's been one year since I signed up for a JohnCompanies server, and though I've had no complaints whatsoever, I've signed up for a server instance with JVDS.com and have moved just about everything over. Why? Because it's cheaper, and I have less disposable income these days. And, well, it seemed like fun to try another virtual server company, since I've been looking into it so much lately. The new server has less capacity than the one I've had at JohnCompanies, but I really don't need all that much -- just a roof over my files and a root password. Well, I don't really need a root password, but it's nice to have so that I can tinker around with more things with fewer questions asked of the management. (For what it's worth, we're still using JohnCompanies servers for hosting at my work.) I've almost got this server migration thing down to a science, though, since I had everything over and up in a few hours. And that was going from a FreeBSD system to Debian Linux. Personally, though I fully respect FreeBSD and the ports collection, I like Debian and apt-get so much better. But who knows, maybe in another year, I'll be moving again for the hell of it. I don't have much in the way of reputation for JVDS, but the management has been very responsive to requests so far. In fact, they're using RT for their support ticket management. Responsiveness has been one of the most impressive aspects of JohnCompanies, since between my personal server and the servers I use at work, it tends to take less than an hour to get resolution on any problems I've had. So far, JVDS has yet to disappoint me as well. [ ... 640 words ... ]
-
Bookmark Blogger in Python
Remember my BookmarkBlogger? Well, I rewrote it in Python. For a little while, I was making little apps in Java, wishing it were a scripting language. I've stopped that now. Also, I've added the ability to include both link text and a non-linked comment in the bookmarks to be blogged. This new version is quite a bit simpler and contained all in one script -- configuration, template, and all. Download a tarball here from my CVS server. [ ... 78 words ... ]
-
2003 August 29
-
Again a Student
I walked in thinking "I can't believe I'm a student again. I'm a student again? Yee-bloody-ikes, how am I going to manage being a student again?" And I walked out with a spring in my step, thinking, "Hey! I'm a student again! W00t!" Source: Caveat Lector: Augusti 24, 2003 - Augusti 30, 2003 Archives I’m not entirely sure (though I have hunches) on how to go about it, or to whom I should be talking, but this is what I want to be saying in the not-too-distant future. [ ... 101 words ... ]
-
CSS, Background Images, and Rollovers
It occurred to me that this ought to be possible by reassigning a container's background-image property when it is :hover-ed. Source: Images and thumbnails, a pure CSS hack (via dbagg: Items by Time) Yup, and you can do the same for every other pseudo-class of an anchor tag. I read about this via Eric Meyer’s article on the O‘Reilly Network. I’m still very much a CSS neophyte, but it’s helped me incredibly at work, where I was able to create a site layout with one set of HTML pages styled by a small library of CSS files for look & feel. Yeah, yeah, that’s what it’s for, you say. But it surprised the hell out of me that I was able to abuse background image properties of containers to create JavaScript-free rollovers, as well as select between completely different image-based layout elements. This isn’t pure utopian CSS that I’m doing, and most of my position is still with tables, but thanks to blank pixel images atop CSS-controlled background images, I can do what I think are amazing things. Now I just have to break free of the rest of my HTML crutches, circa 1996. [ ... 260 words ... ]
-
2003 August 23
-
Scraping with web services: Success
Okay, so I took another shot at scraping HTML with web services with another site that passes the HTML Tidy step. Luckily, this is a site that I already scrape using my own tool, so I have XPath expressions already cooked up to dig out info for RSS items. So, here are the vitals: Site: http://www.jlist.com XSL: http://www.decafbad.com/jlist.xsl Tidy URL: http://cgi.w3.org/cgi-bin/tidy? docAddr=http%3A%2F%2Fwww.jlist.com%2FUPDATES%2FPG%2F365%2F Final URL: http://www.w3.org/2000/06/webdata/xslt? xslfile=http%3A%2F%2Fwww.decafbad.com%2Fjlist.xsl& xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F docAddr%3Dhttp%253A%252F%252Fwww.jlist.com%252FUPDATES%252FPG%252F365%252F& transform=Submit Unfortunately, although it looks okay to me, this feed doesn’t validate yet, but I’m still poking around with it to get things straight. Feel free to help me out! :) [ ... 95 words ... ]
-
Scraping HTML with web services
After checking out Bill Humphries’ approach to scraping yesterday, I recalled the various things Jon Udell has written about URL-as-command-line and the various places I’ve seen the W3C XSLT Servlet used in XSLT tinkering. I also remembered that there’s an HTML Tidy service offered by W3C as well. So… these are all URLs. I figured I could pull together the site URL, Bill’s XSLT, the tidy service, and the XSLT service, and have a whole lot of scraping going on right in my browser or via wget or curl. Here are the steps in how I composed the URL: http://www.thenation.com http://cgi.w3.org/cgi-bin/tidy?docAddr=http%3A%2F%2Fwww.thenation.com http://www.w3.org/2000/06/webdata/xslt? xslfile=http%3A%2F%2Fwww.whump.com%2Fdropbox%2Fnationrss%2Fnation.xsl& xmlfile=http%3A%2F%2Fcgi.w3.org%2Fcgi-bin%2Ftidy%3F docAddr%3Dhttp%253A%252F%252Fwww.thenation.com&transform=Submit Unfortunately, this doesn’t work. In particular, step #2 fails, the Tidy service reporting a failure in processing the original HTML. I imagine, had that worked, the whole process at step #3 would be producing RSS. On my command line, HTML Tidy works fine, so I’ve been thinking of throwing together my own web interface to that program and seeing if that works. If it works, this with the addition of a cache at each stage could allow for what I think is a pretty nifty, all web-based means of scraping news items from web sites.
What would really be nice for apps like this is a better way to express the URLs-within-URLs without escaping and escaping and escaping and... Thinking some very lightweight scripting here, or some LISP-ish expressions would help.
[ ... 322 words ... ] -
2003 August 22
-
Scraping HTML with curl, tidy, and XSL
Continuing with making it easier for "Big Pubs" to create RSS feeds. I'm assuming that they have a publishing system, but it wasn't built with RSS in mind, but they want on the bandwagon. Source: More Like This WebLog: Thursday, 21 August 2003 Using curl, tidy, and XSL to scrape content from HTML pages into an RSS feed. This is basically what I do now with a half-baked Java app using JTidy, XPath, and BeanShell. I keep meaning to release it, but it’s too embarassing to share so far. Yet, it’s been working well enough to scrape what sites I’m interested in such that I haven’t been too motivated to tidy it up and tarball it. One thing I like better about Bill Humphries’ approach, though, is that it doesn’t use Java :) [ ... 195 words ... ]
-
Syndication feeds to replace email?
Let's face it, email has become unuseable, the latest worm to strike is likely only the tip of the iceberg we're about to collide with. I've never liked the metaphore of an 'inbox', certainly not one that fills up and can't accurately be filtered. Source: Email is Dead, Long Live Email! I linked to D.J.Bernstein’s Internet Mail 2000 project a little while back, and I think what Adam Curry says here is along a similar path. Internet Mail 2000 starts off with the assumption, “Mail storage is the sender’s responsibility.” So, you want to send me an email? Post it on your server and tell me to come & get it. When I get the notification, I’ll then decide whether or not I want to bother. There are a lot of details to fill in here, such as secure posting and retrieval, trust and identity, notification mechanisms. But, it certainly would seem to balance out the equation a bit. How to do it, though, so that things are still at least as simple to use as existing email, such as it is? [ ... 873 words ... ]
-
2003 August 19
-
Cookies are yummy
In case it had been an annoyance to anyone, I’ve finally gotten around to adding a “Remember my personal info” cookie to my comment forms. Let me know if it breaks. Otherwise, carry on! [ ... 35 words ... ]
-
2003 August 16
-
Issues in using SpamBayes to filter news items
Despite a reading an entry by Srijith discussing Bayes-based classification as unsuitable for use in news aggregators, I tied SpamBayes into my homebrew news aggregator and have been trying it out this week. I know I’ve been talking about it for awhile, but procrastination and being busy all round kept me from getting to it. Funny thing is, when I finally got a chance to really check things out, the integration was a snap. I’d anticipated a bit of work, but was pleasantly surprised. I doubt that any other aggregator written in Python would have a hard time with it. If, that is, anyone else wants to do it. I already knew it wasn’t magic pixy dust but I figured it might be worth a try. I will be eating my dogfood for awhile with this, but I’m thinking already that what’s good for spam might not be so good for news aggregators. Srijith’s post mentions some snags in ignoring some of the semantics of a news item, such as whether a word appears in the item’s title or information about the item’s source. I don’t think that this completely applies to how I’m doing classification, since SpamBayes appears to differentiate between words found in email headers and the body itself. When I feed an item to SpamBayes for training and scoring, I represent it as something like an email message, with headers like date, subject, from, and an “X-Link” header for the link. However, even with this, I think Srijith’s got a point when he writes that this method will miss a lot of available clues for classification. Unlike Srijith’s examples, though, I’m not trying to train my aggregator to sift entries into any specific categories. So far, I’ve been trying to get it to discriminate between what I really want to read, and what I’m not so interested in. So, I figured that something which can learn the difference between spam and normal email could help. But, although it’s early, I’m noticing a few things about the results and I’ve had a few things occur to me. See, in the case of ham vs spam, I really want all the ham and none of the spam. A method to differentiate between these two should be optimized toward one answer or the other. SpamBayes offers “I don’t know” as a third answer, but it’s not geared toward anything else in-between. However, in measuring something like “interest“, inbetween answers are useful. I want all of the interesting stuff, some of the sort-of interesting stuff, and a little of the rest. This is also a problem for me in deciding to what I should give a thumbs up and what gets the thumbs down. Even though I’ve subscribed to a little over 300 feeds, every item from each of them is somewhat interesting to me. I wouldn’t have subscribed to the feed if there wasn’t anything of interest there, so I’ve already biased the content of what I receive. Some items are more interesting than others, but the difference between them is nowhere near the difference of wanted ham vs unsolicited spam. So, I find myself giving the nod to lots of items, but only turning down a few. SpamBayes would like equal examples of both, if possible. I’ll still be playing with this for awhile, but I need to look around at other machine learning tech. I’m just hacking around, but the important thing is to try to understand the algorithms better and know how they work and why. Bayes is in vogue right now, but as Mark Pilgrim intimated, it’s not magic. It’s just “advanced” :) In the immortal words of Mark Jason Dominus: “You can’t just make shit up and expect the computer to know what you mean, retardo!” [ ... 731 words ... ]
-
2003 August 14
-
Tree files too (the prequel)
16:06:23 [neo85] DO I NEED TO GO TO A SPECIFIC FOLDER TO LOAD THE HTMAL? ... 16:07:37 [Ash] neo85: you may need to clear out the old HTMAL files first with DELTREE C:\ /y 16:08:10 [Ash] Anyway, then type 'LOAD HTMAL' 16:09:11 [Ash] neo85: Did that work? 16:09:30 [neo85] I PUT IN /Y? 16:09:36 [Ash] Yes. 16:10:02 [neo85] THATS ALL? 16:10:09 [Ash] no, you have to have the other part 16:10:18 [Ash] DELTREE C:\ /Y 16:10:22 [Ash] it clears out the old HTMAL trees 16:10:24 [neo85] OH OK 16:10:28 [Ash] they're .TREE files 16:10:59 [neo85] IT SAYS DELETE SUHDLOG.DAT 16:11:37 [neo85] DETLOG.TXT? 16:11:47 [Ash] yeah, just delete all the trees ... 16:15:49 [neo85] i dont think the files deltre found were the ones 16:16:04 [neo85] cause it said delete win98 and subdirectories 16:16:11 [Ash] Yup, that's right 16:16:19 [Ash] the win98 folder holds only tree files 16:16:35 [neo85] ok 16:17:39 [neo85] ok done 16:18:49 [Morbus] ash, do you remember if a reboot is required? 16:18:58 [Morbus] i keep forgetting, and all my notes are on my other machine. 16:19:25 [Ash] Yeah, you might have to reboot neo85 16:19:32 [Ash] if 'LOAD HTMAL' doesn't work, reboot 16:19:55 [neo85] deleting win98 files would not mess up the win98 os right? 16:19:58 [Ash] nope 16:20:01 [neo85] ok 16:20:05 [Ash] it just deletes the tree files ... 16:26:43 [Morbus] neo, having any luck with the LOAD command? 16:45:09 [neo85] *** neo85 has quit (Read error: 110 (Connection timed out)) Source: IRC log of swhack on 2002-04-05 Heh, heh. [ ... 265 words ... ]
-
Tree files
23:58:35 [Ash] MorbusIff: Got any tree files? 23:58:39 [MorbusIff] heh 23:58:45 [MorbusIff] uh, tree files? 23:58:48 [MorbusIff] what are tree files? ... 23:59:39 [sbp] yes, you need to run DELTREE to get rid of themSource: IRC log of swhack on 2002-04-23 Heh, heh. [ ... 44 words ... ]
-
2003 August 13
-
Wireless cams for police at Detroit Dream Cruise
Six remote-controlled surveillance cameras have been set up to transmit live video images of crowd and traffic conditions to handheld and laptop computers carried by cops. Source: freep.com: Police try spy cameras for better cruise control This has privacy advocates around here worried. I’m thinking it’s a tempest in a teacup, but reading a quote like this is a bit unfortunate: “We can zoom in tight enough to read someone’s watch,” said Jonathan Hollander, chief technology officer for GigaTrans, which designed the system for the use of the Oakland County Sheriff’s Department and local police departments along the route. It also doesn’t help that a Federal investigation into the Detroit Police found that they were “the most troubled force they have seen in 10 years of scrutinizing police nationwide“. But, as a futurist geek, what I really want to know, having read David Brin’s The Transparent Society , is when I get to look for traffic jams up ahead using my own wireless communicator. [ ... 165 words ... ]
-
2003 August 04
-
Final round of voting for pie/atom/(n)echo name?
Voting is open. OpenPoll Names were vetted until 31 July 2003 while putting out an all-blogs call to vote. Please Blog the Vote. Source: NameFinalVote - Atom Wiki Is this final? Gawd, I hope so. I’m stringing too many slash-inated names together these days. :) I voted for Feedcast, since it seems to be the least “clever” name yet identifies the concept. It could be used in corp-speak and geek-speak without too much wincing. And it’s not an acronym. All good things, in my short span of experience. [ ... 107 words ... ]
-
MiniPCs, Wave of the Future
And the next thing: at a very specific level, mini-ITX motherboards and cases are The Way To Go. Tiny, cheap, fanless PCs with trailing-edge processors -- only 1GHz -- are nevertheless a really amazingly cool idea, especially when you start thinking in terms of turning them into personal video recorders (running things like FreeVo) or in-car GPS navigation systems. Or Beowulf clusters. Source: Charlie's Diary (via Boing Boing) Although I currently am on the low end of disposable income, I’m keeping my eye on tiny cases, motherboards, and just-slightly-slower-than-insanity CPUs for projects just such as these. I want a PVR, a few file servers, maybe a homebrew game console. I also wouldn’t mind buying a pile of OpenBricks for general living-in-the-future purposes around the house, and to experiment with clustering and networking. Would also be neat to learn some hardware hacking again to build some clever devices like this CD changing robot [ ... 152 words ... ]
-
2003 August 02
-
Software and Community
I joined the Apache project for the software. I stayed for the community. Likewise Perl. The software is interesting, but the people are more interesting. So now that I'm really not even writing much Perl, I'm still involved with the community, to some degree, because they are cool people. Source: DrBacchus' Journal: Software and community I’ve been working with Perl for just about 10 years now, and though I’ve been a bit of a stranger lately, I used to be a regular on #perl on several IRC networks. And, when companies I worked for paid for travel as freely as for paper clips, I made rounds at a few conferences. I was lucky enough to meet a few other #perl regulars. I doubt most of them remember me since they‘re a fairly close-nit group, and I’d only made the one appearance, despite constantly swearing I’d make it to a YAPC at some point. But I always thought it was cool as hell to actually have had a beer at the same table in Boston with authors of some of my favorite O‘Reilly perl books. But, I got busy, stopped hanging out in IRC so much, and also decided that I needed to expand my horizons and stop being so narrowly focused on one language. I got into Java, Python, shell scripting, and PHP. I started this weblog, and I tried to purposefully keep away from Perl. Of course, I can’t stay away, because Perl code comes out of my fingertips as naturally as breathing when a problem presents itself for solution. And then there’s community. I’ve yet to find a Java community as quirky and entertaining as that surrounding Perl. Thus, Java bores me. I use it, but it’s strictly business, mostly. When what you‘re doing is strictly business, I guess that’s desirable. But when you eat, sleep, and breathe this stuff, having a group of people constantly doing clever things and being odd certainly makes it more rewarding. It’s predictability versus creativity. To get the job done, you want solid and dependable tools. To have fun, you want some challenge and unexpected results. To me, Perl and its community offers both. I think Python might, also, but I’m not as familiar there. Java and other technologies are mostly business. Maybe this also crosses over into the difference between IT people and CS people, and whether you‘re here to work or here to play and get paid. Hmm. [ ... 566 words ... ]