This is supposed to be simple, right?

So, I’ve been spending more time working on coming up with my own feed aggregator. I’ve learned some pretty interesting things so far, as well as just how FUBAR RSS and Atom happen to be. You’d think that this would all be somewhat simple, and in an ideal world, it probably would. This is the real world though, not an ideal one.

Some of the fun fun things I’ve encountered thus far are:

The confused server that seems to think that, instead of giving a proper 304 Not Modified response (with no body) when the content I’m requesting hasn’t changed, it should instead give a 200 OK response, with no body. This kept throwing me as I was trying to debug, as I kept believing my logs that there should be new content there…
The confused server that gives a 406 Not Acceptable response to requests that don’t supply a User-Agent: header with the request. Admittedly, were I less lazy, I’d be specifying one anyway, just to identify the new kid on the block. But as I haven’t even officially settled on a name yet, I was leaving things at their default (and made the assumption that Ruby’s Net::HTTP library had some sort of default value). What really threw me was that wget would work, but telnet wouldn’t, even when I was listing a huge laundry list of formats I was interested in in the Accept: header (including the type that wget said was coming back). After a little while, I decided to just pass a User-Agent: moo, and lo and behold, content!
The really confused server that would reply to all requests with a Last-Modified: Thu, 01 Jan 1970 00:00:00 GMT, but would then completely bomb out with an application error if it got that value back in a If-Modified-Since: request header.
The feed that seems to randomly return something that isn’t RSS or Atom at all, but from the exception the content generates, looks like it could be normal HTML. I say randomly, because after the last time it happened (and I finally got around to catching that exception to log it, deal with it, and carry on), it’s happened a grand total of once. And in that time, it’s checked the feed a few more times, though the content hasn’t updated (from before the time stored…), with no issues. I’m hoping something happens during the night, just so I know that I’m not the one going crazy.

Feeds in general have also been interesting. I’ve discovered that there’s such a huge variation in the quality of the actual data that I’ve had to special-case a few things, and delve into the XML (or XML-like) content itself in order to pull out data that I’d hoped would percolate into the normal accessor methods of the classes I’m using. It’s not a big deal, but it’s one of those things that becomes a minor annoyance after a while. Fortunately, of the information I care about, it’s only dates and links to individual entries (or in at least one case, the complete and total lack thereof), and I think I have that in the bag.

Speaking of dates, one thing I’ve been noticing over the months of adding various and sundry feeds to the readers I’ve used in the past is that there are some pretty horrid implementations of Atom out there. I’ve noticed several feeds that various apps/frameworks provide that seem to think that publication, update, etc dates should always reflect the current time when you make the request, and not the time that each item was actually published or underwent a major update. This makes trying to figure out anything temporally useful about feed items a non-starter. Such is life…

Now, lest folks think that all I’m doing is complaining, let me just say that I’m glad that I’ve been running into these problems. I never would have discovered some rather fragile assumptions that I made without real-world data. In order to do that, I whipped up a pretty brain-dead (but functional) OPML import feature, and an add-a-feed feature, so that I could import the OPML file that I’d managed to export from my other reader a short time ago. Importing that, and adding some feeds that I’ve been pasting into a file for a week or two, I’m able to see how everything fares trying to parse 294 highly varied feeds instead of a few minor test cases. So when I’m ready to release the first alpha version to the guinea pigs, it should at least do better than it might otherwise.

So, yeah, I’ve made some decent progress on a lot of the back-end code and infrastructure. There’s just enough of a UI in place so that I can at least verify that things are working. It’s butt-ugly, and almost completely useless, but meets my current needs. Eventually I’ll get around to making it functional for me, maybe even functional for you. 🙂 There’s logging in place so that I can at least figure out where something goes wrong. Hell, it even saves data (without corruption despite multiple crashes). You just can’t, well, read or navigate or do anything else at the moment.

2 Comments

Warren Henning

2007/03/08 at 02:03

I don’t know how much it matters but I have started and abandoned aggregator projects three or four times because of this. So, I feel your pain.

Anyone who thinks the Semantic Web has a chance of succeeding needs to take a realistic look at how badly people implement even simple data formats in the real world. RSS should be a simple thing. It isn’t. Now throw some weirdo first-order logic and design-y RDF knowledge representation yadda-yadda stuff I don’t understand where your inference capabilities are only as strong as the weakest link in the data sources you’re integrating, and you’re setting yourself up for failure, as I see it.

People, even smart people, often cannot produce quality data or quality metadata. Their MP3 collections are poorly tagged, they don’t back up their precious files, they don’t even have complete albums for their digital music. That is the simple truth.

Reply to this comment
emag

2007/03/09 at 15:48

Warren:

After seeing the shenanigans that go on with feeds in general, I can fully understand why you’d have abandoned your own. What I don’t understand is why you seem to be such a masochist as to have done it repeatedly. 🙂

The more I look at all the minor variations and the organic growths that have attached themselves to RSS, the mare I think “that can wait until someone wants it”. There are a LOT of interesting ideas that people have implemented to try to ease their own burdens, but so much seems to be poorly documented, invalidly specified, or both. I think I’ve currently got a happy medium (or at least a lowest common denominator) to pull and store basic info, but anything fancy will definitely need to wait for later.

Looking at my own MP3 and OGG collection, I have to admit that I’m guilty of poor tagging myself. AND I’m really bad about personal backups. Man, you’ve pegged me, along with most of the rest of the world.

Reply to this comment

Mike's Place

Inarticulate ramblings on whatever strikes my fancy

This is supposed to be simple, right?

Like this:

2 Comments

Leave a ReplyCancel reply

Share this:

Like this:

2 Comments

Leave a ReplyCancel reply