Remember that I wrote a little while ago about wanting to publish some articles here that I'd want to read? Well, I've been hard at work since then to turn out the first set and I think I've finally got something for you. I mentioned earlier this week that I was taking this seriously, so I hope it shows. So, with many thanks to my girlfriend's kind editorial help, and with some measure of anxiety, here goes...
For some time now, my girlfriend and I have been accumulating things we want in wishlists on Amazon.com. Here's mine and here's hers - if you visit them, you can see we've both got quite a few things listed. Though they have come in handy with relatives at Christmas and on birthdays, neither of us really expects to see a regular flow of gifts from them. For the most part, they've just become holding tanks for things we intend to buy for each other or ourselves.
However, I tend to forget we have these lists except for occasional visit to Amazon when I think, "Oh yeah, wishlists. I should pick up a thing or two, there's some good stuff piled up in them." On one particular visit, though, the notion of a Wish-of-the-Month club popped into my head: We could afford to grab at least one item for each of us from our wishlists on a monthly basis, provided that we remembered to place an order. It'd be better than signing up for a book or music club, driven by someone else's idea of what we wanted. Unfortunately, there's that problem for busy, absentminded, and people like us: remembering to place an order.
But wait, isn't this the sort of thing computers are for? I should be able to cobble something together that would peruse our wishlists and--given some criteria like a price maximum--select an item at random for each of us and send them on their way. With this, I could schedule a monthly run and start whittling down those lists.
Before I start working through the project itself, let's establish some assumptions and then gather some tools and materials:
I'm going to assume that you're using a UN*X operating system (ie. Linux, Mac OS X, etc.) and that you're reasonably familiar with getting around in a shell and editing files. Things presented here could be adapted for Windows fairly easily, but I'll leave that as an exercise to the reader. Also, you may need to build and install a package or two, so know-how in that regard will serve as well. And finally: some familiarity with XML and XSLT would be useful, but you won't need to be a guru with either.
Oh, and all the files I'll be introducing in this project can be downloaded from my website as a tarball:
wishes.tar.gz. If you feel like browsing, you can see these files in my CVS repository. And if you feel like checking out a copy via anonymous CVS, the username is
anoncvs and the password is blank--email me for help, if you need it.
So, how do we get a look at these wishlists? Lately, I've been tinkering a bit with scraping information from and automating access to websites. It's a bit like a puzzle game, with all the accompanying frustrations and happy breakthroughs. However, where most puzzle games are designed with a solution in mind, this game isn't even necessarily meant to be played depending on the intentions of website owners.
Fortunately, the folks at Amazon.com have made things very friendly to tinkerers by providing an API, called Amazon Web Services (or AWS). You'll want to download the AWS developer's kit, which contains a wealth of documentation and examples. After downloading these materials, you should apply for a developer's token for use with the service. AWS provides both SOAP and REST interfaces to functionality and data at their site; personally, I prefer the HTTP-and-XML approach taken by the REST interface, so that's what we'll be using here.
To handle the XML produced by AWS, we'll be using the
xsltproc command from the XML C parser and toolkit of Gnome. There are other XSLT processors--such as Xalan, Sablotron, and Saxon--but I've found libxslt easiest to feed and care for on the various platforms with which I tinker. It also seems to support a very large swath of EXSLT extensions, all of which come in very handy, yet seem to receive uneven support in other XSLT processors. We'll be pulling a trick or two out of that bag, so its support is key.
You may or may not already have libsxlt installed. Depending on your variant of Linux, it might be as simple as a single package-management command or it might be a bit more complex if you need to compile from source. For Mac OS X, I recommend using Fink for your packaging needs. Although, DarwinPorts is nice as well, if you're used to The BSD Way.
A bonus for OS X users: Marc Liyanage has provided a great Open Source tool named TestXSLT that embeds libxslt, among other XSLT processors, in a slick GUI for easier use. This might come in handy for you as things develop.
Wishlists in XML
Okay, we've got a working environment, a head start on accessing Amazon wishlists as XML, and a way to manipulate that XML using
xsltproc. Let's start playing. First things first, we need to gain access to Amazon wishlists in XML form. Reading through the AWS documentation reveals that wish list searches are available via a URL constructed like so:
http://xml.amazon.com/onca/xml3? t=[Associates ID goes here]& dev-t=[Developer Token goes here]& WishlistSearch=[wishlist ID goes here]& type=[lite or heavy]& f=xml
I received an ID of
0xdecafbad-20 when I signed up to be an associate a few years ago. This will ensure that I get credited for sales made via the API--which isn't as important for the present project, since I'll be buying items myself, but it'll come in handy in later projects. Also, when I signed up for a developer's token, this is what I was given:
D8HVH869XA0NP I'm disclosing my own here for the sake of example, but you should sign up and get your own.
So, that fills in the first two parts of the URL. For the purposes of this project, let's just go with the
lite option for type. As for the wishlist ID, let's take a look the wishlist URLs to which I linked earlier:
You can discover these wishlist URLs using Amazon's Wish List Search feature, in which case a wishlist URL might appear like so:
http://www.amazon.com/gp/registry/registry.html/ 002-7899886-3676027?%5Fencoding=UTF8& id=35OIOYWQ9XQAE
In either case, there is a 13-character ID in each variety of wish list URL: this string is the wish list ID. So, the ID for my girlfriend's wishlist is
35OIOYWQ9XQAE and mine is
1QWYI6P2JF3Q5. Given this piece of the puzzle, we can fill in the blanks to come up with the following URL for my girlfriend's wish list:
http://xml.amazon.com/onca/xml3? t=0xdecafbad-20& dev-t=D8HVH869XA0NP& type=lite& WishlistSearch=35OIOYWQ9XQAE& f=xml
Check out the XML resulting from this URL--you may want to use a tool such as
wget instead of viewing this directly in your browser. You'll see some XML that looks something like this:
<ProductInfo> ... <Details url="(some long URL)"> <Asin>0262133601</Asin> <ProductName>Foundations of Statistical Natural Language Processing</ProductName> <Catalog>Book</Catalog> <Authors> <Author>Christopher D. Manning</Author> <Author>Hinrich Schütze</Author> </Authors> <ReleaseDate>18 June, 1999</ReleaseDate> <Manufacturer>MIT Press</Manufacturer> <ImageUrlSmall>(another long url)</ImageUrlSmall> <ImageUrlMedium>(yet another long url)</ImageUrlMedium> <ImageUrlLarge>(one last long url)</ImageUrlLarge> <Availability>Usually ships within 24 hours</Availability> <ListPrice>$75.00</ListPrice> <OurPrice>$63.75</OurPrice> <UsedPrice>$49.99</UsedPrice> </Details> ... </ProductInfo>
Note that the long URL in the
url attribute links to the human-viewable product detail page at Amazon. I've also left a few other things out, such as the URLs to product images; I just thought I'd edit it a bit to be friendlier to your browser at home. There's a schema for this XML data, and the ins-and-outs are explained in the AWS documentation under "Amazon Web Services Data Model".
Querying The Wishes
Some ready-made files are available for this section:
wishes-ex1.xsl: The first iteration of the stylesheet in development.
wishes.xml: An XML document used as input with the stylesheet.
Now that we've got some XML from Amazon to play with, let's start tinkering with an XSLT stylesheet to process it. In the interests of flexibility and reusability, we can parameterize a few things in XML before starting in on the stylesheet:
<wishes xmlns="http://www.decafbad.com/2004/05/wishes"> <maxprice>15.00</maxprice> <associate>0xdecafbad-20</associate> <devtoken>D8HVH869XA0NP</devtoken> <email>email@example.com</email> <wishlists> <wishlist label="The Girl">35OIOYWQ9XQAE</wishlist> <wishlist label="Me">1QWYI6P2JF3Q5</wishlist> </wishlists> </wishes>
Hopefully, the data here is fairly self-explanatory: I've established a maximum price for item selection; provided my associate ID and developer token; there's an email address to which I eventually want to send the results of all this work; and I've made a list of wishlist IDs, each with a readable label. Given this, let's start out simple and use this to get some data from Amazon:
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:wishes="http://www.decafbad.com/2004/05/wishes" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output indent="yes" /> <!-- Grab our global settings --> <xsl:variable name="maxprice" select="/wishes:wishes/wishes:maxprice" /> <xsl:variable name="associate" select="/wishes:wishes/wishes:associate" /> <xsl:variable name="devtoken" select="/wishes:wishes/wishes:devtoken" />
So far so good--things start off by pulling in some of the parameters into variables. Next, let's dig into actually querying wishlist data with a reusable template:
<xsl:template name="processWishlist"> <xsl:param name="wishlist" /> <xsl:variable name="details" select="document(concat( 'http://xml.amazon.com/onca/xml3?', 't=',$associate,'&', 'dev-t=',$devtoken,'&', 'WishlistSearch=',$wishlist,'&', 'type=lite&f=xml'))//Details" />
First thing into this template, we accept a parameter named
wishlist which is expected to contain a wishlist ID string. Next, we build an AWS URL by concatenating together the pieces we have in variables (associate ID, developer's token, and wishlist ID) using the XPath function
concat(). Once we have this URL, we use the function
document() to make a request and fetch the XML data for that URL. From this, we select all the
Then with that data, we can do some filtering on the price and availability. We want to make sure that not only will we select items that are within our budget, but that they are available to buy in the first place:
<xsl:copy-of select="$details[ number(substring(OurPrice/text(),2)) < $maxprice and contains(Availability, 'Usually ships within') ]" /> </xsl:template>
This code is just a little bit funky, since the price data given by Amazon contains a dollar sign, and we want to make a numerical comparison. So, we chop the dollar sign off and convert to a number before making the comparison. Also, there's an assumption here about what will show up in the
Availability element: "Usually ships within" Other things that might show up will declare that the item is out of stock, discontinued, or otherwise not shipping. This might need some tweaking someday, but it seems to work for now.
Taken all together, this template has the effect of a SQL SELECT statement somewhat like this:
SELECT * FROM Amazon.WishlistItems WHERE WishlistID = $wishlist AND OurPrice < $maxprice AND Availability like '%Usually ships within%';
document() is a very useful XPath function. It allows us to pull in XML from external files and, in our case, from external URLs via HTTP requests. This gives us the ability to make queries against REST web services like AWS--which, among many other reasons, is why I prefer REST web services over SOAP. (I don't even want to think about trying to access a SOAP service from XSLT.)
Now, let's wrap up this first iteration of the stylesheet by trying out the query template on each of the wishlist IDs:
<xsl:template match="/wishes:wishes"> <xsl:for-each select="//wishes:wishlist"> <wishes:wishitem> <xsl:copy-of select="." /> <xsl:call-template name="processWishlist"> <xsl:with-param name="wishlist" select="." /> </xsl:call-template> </wishes:wishitem> </xsl:for-each> </xsl:template> </xsl:stylesheet>
You can get a completed version of this stylesheet, along with the input XML, in case you haven't been cutting and pasting together a copy of your own along the way. Try it out in a shell with:
$ xsltproc wishes_ex1.xsl wishes.xml
Alternately, you could check it out using TestXSLT under OS X. You should get something like the following:
<wishes:wishitem xmlns:wishes="http://www.decafbad.com/2004/05/wishes"> <wishes:wishlist label="The Girl">35OIOYWQ9XQAE</wishes:wishlist> <Details ...>...</Details> <Details ...>...</Details> ... </wishes:wishitem> <wishes:wishitem xmlns:wishes="http://www.decafbad.com/2004/05/wishes"> <wishes:wishlist label="Me">1QWYI6P2JF3Q5</wishes:wishlist> <Details ...>...</Details> <Details ...>...</Details> ... </wishes:wishitem>
Obviously, this example XML is much abridged, but hopefully you can get the gist: For each wishlist ID, there is a containing
wishitem element. It contains a copy of the
wishlist element from the input XML, followed by all the
Details elements filtered and copied from the Amazon XML with the help of the
That's All for Now!
And that's the end of Part 1. Next up, we'll be delving into a few more wrinkles in the wishlist querying process, selecting random items in XSLT, and the Remote Shopping Cart interface in Amazon Web Services. Stay tuned!