Scoble has been writing about RSS bandwidth concerns lately, so I thought I'd once again post on this topic. I've posted before about using conditional HTTP Get (If-Modified-Since
) to decrease RSS bandwidth consumption, but here's a simple recap of how this works:
Almost all aggregators store the date/time that a feed was last updated, and they pass this to the HTTP server via the If-Modified-Since
HTTP header the next time they request the feed. If the feed hasn't changed since that date/time, the server returns an HTTP status code 304 to let the aggregator know the feed hasn't changed. So, the feed isn't re-downloaded when it hasn't changed, resulting in very little unnecessary bandwidth usage.
This sounds simple enough, but there's a big problem here: many high-traffic RSS feeds are created dynamically through server-side code, and the HTTP server won't automatically support conditional HTTP get for dynamic feeds. So, all too often the feed is rebuilt each and every time it's requested - which is obviously a huge waste of both bandwidth and CPU time. One solution is to write your own code to return a 304 based on the If-Modified-Since
header, but in many cases it makes more sense to use a static feed that's rebuilt only when new information needs to be added to it. For example, my FeedDemon FAQ feed is a static RSS file that's rebuilt whenever I add a new entry to the FeedDemon FAQ. This way, my HTTP server takes care of the If-Modified-Since
comparison, and there's no unnecessary regeneration of the feed.
However, while this works well for feeds that don't require many updates, it's not the best approach for feeds that need to be updated more frequently. This is the problem I faced with my support forum feeds, which are created dynamically from information stored in a SQL Server database. Since new forums posts are often made every few minutes, I decided to use server-side code to limit how often aggregators can download the feeds. Almost all aggregators support conditional HTTP get, so I simply check the If-Modified-Since
date/time, and if it's within the last 15 minutes I return a 304 to tell the aggregator the feed hasn't changed - even if it has. This prevents aggregators from downloading the entire feed more often than once every 15 minutes.
Here's a snippet of the ASP.NET code I use to do this:
Dim dtNowUnc As DateTime = DateTime.Now().ToUniversalTime
Dim sDtModHdr = Request.Headers.Get("If-Modified-Since")
' does header contain If-Modified-Since?
If (sDtModHdr <> "") And IsDate(sDtModHdr) Then
' convert to UNC date
Dim dtModHdrUnc As DateTime = Convert.ToDateTime(sDtModHdr).ToUniversalTime
' if it was within the last 15 minutes, return 304 and exit
If DateTime.Compare(dtModHdrUnc, dtNowUnc.AddMinutes(-15)) > 0 Then
Response.StatusCode = 304
Response.End()
Exit Sub
End If
End If
' add Last-modified to header - FeedDemon stores this with cached feed so it's
' passed to the server the next time the feed is updated
Response.AddHeader("Last-modified", dtNowUnc.ToString("r"))
Now, I'll be the first to admit it's not the most elegant hack, but so far it has worked very well for me. I considered checking the date/time of the most recent forum post and using that for the If-Modified-Since
comparison, but that would've required a database hit each time the feed was requested, so I opted for the less precise but more CPU-friendly solution.
Extremely minor but as long as you're posting code... I'm curious as to why you don't use string.Empty instead of "".
Posted by: Nicholas Hanson | Thursday, September 09, 2004 at 02:53 PM
This was actually the first piece of ASP.NET code I've written - so I wasn't aware of string.Empty :)
Posted by: Nick Bradbury | Thursday, September 09, 2004 at 02:54 PM
Heh, the idea sometimes touches two brains at the same time. I was thinking along the lines of query string params to pass date, but this is generally non-standart, so http header idea is probably the best.
I'm in Russia on paid-per-Mb cable, so I really feel the pain as my blog list grows bigger ;)
Posted by: Raven | Friday, September 10, 2004 at 12:00 AM
Heh, the idea sometimes touches two brains at the same time. I was thinking along the lines of query string params to pass date this morning when I stumbled on your article, but this is generally non-standart, so http header idea is probably the best.
I'm in Russia on paid-per-Mb cable, so I really feel the pain as my blog list grows bigger ;)
Posted by: Raven | Friday, September 10, 2004 at 12:03 AM
Nick,
I've been thinking about scoble's recent blog entry as well. In fact I went back to check my log entries, and he has a point....
Here's a thought for feed deamon that maybe you haven't considered. Use an approach similar to Bloglines, and aggregate content for all your users. You can then push the status to your desktop clients using what ever method you like. You could use a more efficient protocol between the client and your server.
This allows the feed demon server to make one status poll for all the users. If you look at your logs you'll see that this is exactly what bloglines does, and I think is a pretty decent approach.
This changes your model of selling only desktop software a bit, but it is worth giving it some thought.
Posted by: Christopher Baus | Friday, September 10, 2004 at 01:12 AM
Excellent! I'm working on a ASP.NET app that consumes loads of feeds, but I need to optimize the retrieval process. Atm, I'm using Atom.NET and RSS.NET (sourceforge projects) to load each feed and check it's last modified date. I guess it would be much more efficient to manually check each If-Modified-Since header before actually loading it. However, this field is null on every feed I've tried so far. Any idea why? Here's how I load the ModifiedSince header (c#):
HttpRequest r = new HttpRequest(null, feed.FeedURL, null);
string lastmodified = r.Headers.Get("If-Modified-Since");
-kenny
Posted by: kenny | Friday, September 10, 2004 at 01:28 AM
The flaw in this approach is that you're relying on the agreggators supporting conditional GETs, which was the problem in the first place. Those clients that don't support conditional GETs will never send an If-Modified-Since header and therefore will always receive a freshly generated copy.
Posted by: mmj | Friday, September 10, 2004 at 03:38 AM
mmj: I'd not be suprised to see cases where unconditional GETs for rss feeds get returned either an error or some static data suggesting that an updated aggregator is required.
Posted by: Gwyn Evans | Friday, September 10, 2004 at 06:31 AM
I can see a scenario in which this would break quite badly.
A shared cache (i.e. an ISP's proxy server) requests the feed for the first time, and stores the result.
Somebody else at the same ISP requests the same feed within 15 minutes. Since they checked the feed longer than 15 minutes ago, the cache will see that its own copy is fresher, so it validates its copy (i.e. sends a request with a Last-Modified matching its own copy). It receives a 304 response, and sends that to the client. It updates its own copy to reflect how recently it checked for freshness.
A third person requests the feed, again within 15 minutes. The cache notices that it's got a fresher copy, validates it again, and the same thing happens again and again.
As long as at least one user of this proxy requests the feed in each 15 minute period, no users will ever receive an up-to-date feed.
You can't fix this by switching off public caching, as that would undermine your efforts to save bandwidth completely.
It would be better to set an Expires header for 15 minutes into the future. That way nobody should be requesting feeds more often.
Posted by: Jim | Friday, September 10, 2004 at 07:44 AM
Actually, it's already happening and yes, it does break when shared caches come into play. See http://nick.typepad.com/blog/2004/05/rss_abuse_and_s.html">http://nick.typepad.com/blog/2004/05/rss_abuse_and_s.html">http://nick.typepad.com/blog/2004/05/rss_abuse_and_s.html and the "blog entry" mentioned there, for example.
I'd read it, but forgotten about it!
Posted by: Gwyn Evans | Friday, September 10, 2004 at 09:23 AM
Well that's not quite the same issue, as Nick's proposal here isn't about banning identical IP addresses. It's the same underlying principle though; the wrong people are being left out because the system can't deal with shared caches effectively. It's just the "identifying mark" is the Last-Modified header rather than the IP address.
Posted by: Jim | Friday, September 10, 2004 at 10:55 AM
mmj, I agree with Gywn. If I was shelling out $$$ to pay for bandwidth consumed by aggregators that fail to support essential features like conditional HTTP get, I'd ban them from retreiving my feeds. I wouldn't be surprised to see this happen with some high-profile feeds before too long.
Posted by: Nick Bradbury | Friday, September 10, 2004 at 06:11 PM
Jim, are you saying that the proxy server would use its own Last-Modified date rather than the one FeedDemon uses in its HTTP request?
Posted by: Nick Bradbury | Friday, September 10, 2004 at 06:17 PM
Please don't mind the "with" part, it's a part of a whole post, but MT uses excerpt to send trackbacks, so I suppose I can't add links there.
Sorry.
Posted by: Rami Kayyali | Saturday, September 11, 2004 at 09:13 PM
A nit: DateTime.UtcNow()
Posted by: Scott Hanselman | Monday, September 13, 2004 at 04:43 PM