So, I’ve been spending more time working on coming up with my own feed aggregator. I’ve learned some pretty interesting things so far, as well as just how FUBAR RSS and Atom happen to be. You’d think that this would all be somewhat simple, and in an ideal world, it probably would. This is the real world though, not an ideal one.
Some of the fun fun things I’ve encountered thus far are:
- The confused server that seems to think that, instead of giving a proper
304 Not Modifiedresponse (with no body) when the content I’m requesting hasn’t changed, it should instead give a
200 OKresponse, with no body. This kept throwing me as I was trying to debug, as I kept believing my logs that there should be new content there…
- The confused server that gives a
406 Not Acceptableresponse to requests that don’t supply a
User-Agent:header with the request. Admittedly, were I less lazy, I’d be specifying one anyway, just to identify the new kid on the block. But as I haven’t even officially settled on a name yet, I was leaving things at their default (and made the assumption that Ruby’s
Net::HTTPlibrary had some sort of default value). What really threw me was that
wgetwould work, but
telnetwouldn’t, even when I was listing a huge laundry list of formats I was interested in in the
Accept:header (including the type that
wgetsaid was coming back). After a little while, I decided to just pass a
User-Agent: moo, and lo and behold, content!
- The really confused server that would reply to all requests with a
Last-Modified: Thu, 01 Jan 1970 00:00:00 GMT, but would then completely bomb out with an application error if it got that value back in a
- The feed that seems to randomly return something that isn’t RSS or Atom at all, but from the exception the content generates, looks like it could be normal HTML. I say randomly, because after the last time it happened (and I finally got around to catching that exception to log it, deal with it, and carry on), it’s happened a grand total of once. And in that time, it’s checked the feed a few more times, though the content hasn’t updated (from before the time stored…), with no issues. I’m hoping something happens during the night, just so I know that I’m not the one going crazy.
Feeds in general have also been interesting. I’ve discovered that there’s such a huge variation in the quality of the actual data that I’ve had to special-case a few things, and delve into the XML (or XML-like) content itself in order to pull out data that I’d hoped would percolate into the normal accessor methods of the classes I’m using. It’s not a big deal, but it’s one of those things that becomes a minor annoyance after a while. Fortunately, of the information I care about, it’s only dates and links to individual entries (or in at least one case, the complete and total lack thereof), and I think I have that in the bag.
Speaking of dates, one thing I’ve been noticing over the months of adding various and sundry feeds to the readers I’ve used in the past is that there are some pretty horrid implementations of Atom out there. I’ve noticed several feeds that various apps/frameworks provide that seem to think that publication, update, etc dates should always reflect the current time when you make the request, and not the time that each item was actually published or underwent a major update. This makes trying to figure out anything temporally useful about feed items a non-starter. Such is life…
Now, lest folks think that all I’m doing is complaining, let me just say that I’m glad that I’ve been running into these problems. I never would have discovered some rather fragile assumptions that I made without real-world data. In order to do that, I whipped up a pretty brain-dead (but functional) OPML import feature, and an add-a-feed feature, so that I could import the OPML file that I’d managed to export from my other reader a short time ago. Importing that, and adding some feeds that I’ve been pasting into a file for a week or two, I’m able to see how everything fares trying to parse 294 highly varied feeds instead of a few minor test cases. So when I’m ready to release the first alpha version to the guinea pigs, it should at least do better than it might otherwise.
So, yeah, I’ve made some decent progress on a lot of the back-end code and infrastructure. There’s just enough of a UI in place so that I can at least verify that things are working. It’s butt-ugly, and almost completely useless, but meets my current needs. Eventually I’ll get around to making it functional for me, maybe even functional for you. 🙂 There’s logging in place so that I can at least figure out where something goes wrong. Hell, it even saves data (without corruption despite multiple crashes). You just can’t, well, read or navigate or do anything else at the moment.