If you're considering writing a tool which reads web feeds, you probably think you can focus on the user experience and rely on an off-the-shelf XML parser to do the grunt work of parsing RSS and Atom. Unfortunately, I'm about to piddle on that idea.
When I created FeedDemon back in 2003, I quickly discovered that many feeds - including some very popular ones - weren't well-formed XML. This meant that they couldn't be read by a validating parser such as MSXML, and I had planned to use MSXML. I figured that left me with two choices:
- Don't support feeds that aren't well-formed
- Write my own XML parser that can handle broken feeds
Given the sheer number of invalid feeds I ran into, I decided that #1 wasn't an option (after all, customers would complain if FeedDemon failed to handle feeds that existing aggregators had no trouble with). So I ended up writing my own XML parser, which was only slightly more enjoyable than a sharp poke in the eye. And as it turned out, my hand-coded XML parser was FeedDemon 1.0's Achilles' heel. I received countless bug reports that were caused by oversights in my parser, and countless more related to how I chose to interpret invalid feeds.
So when I started working on FeedDemon 2.0, I was determined to improve this situation. I kept track of the most common well-formedness errors, and figured out how to fix them on-the-fly. I could then parse the fixed feed with MSXML, which was more reliable (albeit slightly slower) than my crappy little XML parser. This approach seems to have paid off, since the number of parser-related bug reports has gone way down since the release of FeedDemon 2.0 (although I should add that this may be due partly to the fact that feeds in general are less funky than they used to be).
If you're just starting down this road, hopefully I can save you some time. While there are all sorts of things that can invalidate an XML document, there are only a handful of problems that I've seen with any regularity:
- Using & instead of &
- Using < instead of <
- Using > instead of >
- Using a character such as a form feed () that's not permitted in XML
- Forgetting the semicolon after an entity (ex: < instead of <)
- Using an HTML entity such as that's not supported by XML
- Using an undefined namespace
- Whitespace before the XML prolog (MSXML doesn't like this!)
If you're building a tool that needs to handle feeds that aren't well-formed, you may still be able to use a third-party parser if you correct these common mistakes. Although fixing well-formedness errors isn't a task for a novice programmer, it's still simpler than coding an XML parser, and overall this approach has worked well for me.