March 21, 2004 | Category:

TwentyFour24

Yes, it has been an inordinately quiet week here. The last push of work in third year, along with the weekend of celebrations following the end, meant that I’ve been kept rather busy. Which brings us nicely to the subject of this post.

As part of my third year at the University Of Glasgow, I’ve been involved in building a distributed, topic-driven web crawler in Java, with Derek, Matt and two others.

The end result is TwentyFour24bot.

There isn’t much there on the site yet, but I think we’ll be putting the source up (once we know it is ok to do so), as well as the dissertation.

I don’t know how interesting the dissertation would be to most people (at 90 pages, it’s not exactly easy going), but we do outline some of the more interesting aspects of the design: keeping data moving smoothly around a distributed system under heavy load, implementation of politeness constraints for a crawler (a heavily overlooked area – we couldn’t find any other papers on this), and relevance algorithms (that don’t rely on PageRank networks).

It’s probably not worth reading if you don’t have an interest in information retrieval, but we’re all just glad it’s over (and the website was sitting there, unlinked).