If you're considering writing a tool which reads web feeds, you probably think you can focus on the user experience and rely on an off-the-shelf XML parser to do the grunt work of parsing RSS and Atom. Unfortunately, I'm about to piddle on that idea.
When I created FeedDemon back in 2003, I quickly discovered that many feeds - including some very popular ones - weren't well-formed XML. This meant that they couldn't be read by a validating parser such as MSXML, and I had planned to use MSXML. I figured that left me with two choices:
- Don't support feeds that aren't well-formed
- Write my own XML parser that can handle broken feeds
Given the sheer number of invalid feeds I ran into, I decided that #1 wasn't an option (after all, customers would complain if FeedDemon failed to handle feeds that existing aggregators had no trouble with). So I ended up writing my own XML parser, which was only slightly more enjoyable than a sharp poke in the eye. And as it turned out, my hand-coded XML parser was FeedDemon 1.0's Achilles' heel. I received countless bug reports that were caused by oversights in my parser, and countless more related to how I chose to interpret invalid feeds.
So when I started working on FeedDemon 2.0, I was determined to improve this situation. I kept track of the most common well-formedness errors, and figured out how to fix them on-the-fly. I could then parse the fixed feed with MSXML, which was more reliable (albeit slightly slower) than my crappy little XML parser. This approach seems to have paid off, since the number of parser-related bug reports has gone way down since the release of FeedDemon 2.0 (although I should add that this may be due partly to the fact that feeds in general are less funky than they used to be).
If you're just starting down this road, hopefully I can save you some time. While there are all sorts of things that can invalidate an XML document, there are only a handful of problems that I've seen with any regularity:
- Using & instead of &
- Using < instead of <
- Using > instead of >
- Using a character such as a form feed () that's not permitted in XML
- Forgetting the semicolon after an entity (ex: < instead of <)
- Using an HTML entity such as that's not supported by XML
- Using an undefined namespace
- Whitespace before the XML prolog (MSXML doesn't like this!)
If you're building a tool that needs to handle feeds that aren't well-formed, you may still be able to use a third-party parser if you correct these common mistakes. Although fixing well-formedness errors isn't a task for a novice programmer, it's still simpler than coding an XML parser, and overall this approach has worked well for me.
Excellent post Nick. I was a little thrown off by the title because I associate "funky" with "funky RSS" -- you know, the practice of using non-standard extension elements (Dublin Core, content:encoded, etc.) in place of standard RSS elements. Perhaps a better title would be "Fixing broken feeds."
Posted by: Dave Johnson | Thursday, September 21, 2006 at 04:08 PM
Ah, I'd forgotten about the whole "funky RSS" thing. I've changed the title.
Posted by: Nick Bradbury | Thursday, September 21, 2006 at 04:36 PM
My experience more closely matches Google's: http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html
In particular, I would say that the most common XML level error I see is encoding errors - things like smart quotes.
Posted by: Sam Ruby | Thursday, September 21, 2006 at 06:22 PM
Sam, thanks for posting that link - that's a useful reference. I also see quite a few encoding errors, but I unintentionally left them off my list because I handle them separately from other well-formedness problems.
It's interesting that Google Reader has encountered so many problems with mismatched open/ending tags. I don't run into that problem very often, although that probably says more about the feeds I subscribe to than it does about the feed world as a whole.
Posted by: Nick Bradbury | Thursday, September 21, 2006 at 09:34 PM
Thanks for the tips Nick. I know you used Delphi to develop FeedDemon no? I found that using the TXmlDocument (with msxml for the domvendor) is a lot more slow than using the MsXml directly (using the IXmlDomDocument).
Thanks
Posted by: Ariel Selig | Thursday, September 21, 2006 at 11:45 PM
I think you have a problem, with your blog! The names and the text in the comments are not the right one...
Posted by: Morten | Friday, September 22, 2006 at 08:58 AM
Ariel, yes, FeedDemon was developed using Delphi. I also found it faster to use MSXML directly.
Morten, I'm not sure what you mean. Where are you seeing incorrect names in the comments?
Posted by: Nick Bradbury | Friday, September 22, 2006 at 10:32 AM
I don't disagree with your decision to support broken feeds. There is an uneasy feeling, however, in allowing bad feeds to exist and perhaps encouraging propogation of bad feeds (bad code is most likely being reused and copied). Is there anything that can be done to support but at the same time cleanup the source of bad feeds?
Posted by: Alan Kleymeyer | Friday, September 22, 2006 at 10:35 AM