Bloglines Crawler

Up until fairly recently, I didn’t use an RSS aggregator. Using my own bandwidth to check so many sites was just hammering my connection, and starting and stopping the application when I needed to reserve the bandwidth was just a pain.

Enter Bloglines. Yes, I’m late to the party. People have been praising it for months now but I only very recently saw the light. It is an excellent tool: allowing me to check sites quickly when I want to (and not using up much of my bandwidth), keeping unread items up to date no matter which computer I’m using, and making me far more productive.

Before Bloglines, I was struggling to keep up with about 50 sites regularly. I now easily keep up with around 120. Thoroughly recommended.

Now, I checked my server logs for the first time in months tonight and noticed that a single host had hit my site 2000 times this week. That’s a hit every 5 minutes from one entity. Investigating a little further made it clear that the one entity was an aggregator: the bloglines aggregator.

Now, Solitude is not a high-throughput site. I attempted to update once a day, but it’s usually more like once every two days. In the last week, there have been 3 updates (this being the 4th).

Think about that: 3 updates, 2000 checks. 3. 2000 checks. Notice the ever so slight disparity?

Those Bloglines guys make a very usable interface to a damn fine service, but they really need to work on the crawler updating logic. It’s not that hard to extrapolate predictable update patterns. If a site is updating every 15 minutes, check it every 15 minutes. If it slows down and stays at once every day, 15 minutes is probably very inappropriate. Once an hour would be better. You don’t less any real sense of freshness and you don’t over do server hits.

Common sense and the polite thing to do.

  1. Mark Fletcher’s avatar

    Thanks for the comments about Bloglines. The Bloglines crawler does only hit each URL once an hour. Is it possible that you publish multiple feeds, and that all those feeds are in the Bloglines database? If you send me a sample of the server logs we can investigate.

  2. Gary Fleming’s avatar

    Christ, that was a fast and somewhat unprompted reply; another sign of a great service.

    Mark: 3 feeds: assuming Bloglines checks RSS 0.91, RSS 2.0 and Atom feeds only. The full list of feeds provided is on my Syndication page, but a few of those are unlikely to be checked.

    I don’t have raw logs, I’m afraid. My host only gives basic stats and a few top 10 lists (bloglines being number 1 on the host requests list).

  3. Chris Miller’s avatar

    Perhaps it is because different people are subscribing to different feeds through bloglines. If perhaps someone subscribes to an RSS 2.0 feed and a 0.91 feed you will be polled twice as many times as just having one feed. Having the availability of three different feeds may have your site being polled three times as much as required. Perhaps some type of bloglines “subscription” bypeople with multiple feeds is required (i.e. giving one feed to check for all syndication to your site through bloglines).