Happenings

One Interface

Raymond Chen has reasonably pointed out that most low-end users don’t like using the search boxes that modern browsers provide in addition to address bars. I couldn’t agree more.

The first thing I do with any new FireFox install, before loading up bookmarks or any of my favourite extensions, is to remove the search bar. Why?

  1. Ease of use. I don’t want to have to think about which box I’m going to use for a particular action, especially when the most common action is using Google: which works equally well in both boxes.
  2. Address bar is faster. Any time you move your hands from the keyboard to the mouse, you lose a significant portion of time. I don’t know if the search bar has a keyboard shortcut, but I know that the address bar does (Alt+d). Even if it does, how do I then quickly select a different search engine than the default?
  3. The address bar is multi-functional. I can use it for URLs and google, but if I add some keyword searches to my bookmarks (replace the query string %s and add a keyword), then I can have any search engine I want at my finger tips.

This makes the search box dead space in a prime location. It’s one (recent) saving is the addition of Google-style autocomplete. With a little more work this could, and should, be integrated into the address bar. At the very least, the option should be available.

So, what other things do I use the address bar quick search for?

  • Javadoc – Look up any standard class with a quick “jdoc Class”.
  • Man pages – As with terminals, “man [program]” works just fine for looking up UNIX stuff.
  • Acronyms – Need to know what something means ASAP? “acronym ASAP” would do.

I use it for dictionaries (“dic”), the thesaurus (“thes), torrents (“torr”), and just about any other common search because it is better. Give it a try.

iBatis And Like

A reminder for myself for the future (and any other people doing Java development): using the like operator in iBatis. The meta character escaping you need in your XML is column like '%'||#property#||'%', where property property is a bean property in your parameter class.

Also, if you’re doing everything in your where clause dynamically, I would recommend looking at dynamic and isNotNull tags.

Film Fight: October 2006

Due to a silly number of other events in October, like a stupid number of birthdays (including, of course, my own birthday), I only saw three films in the cinema (though I did watch a decent number from my DVD backlog).

Will Ferrell hits the race track for his latest character film, Tallageda Nights: The Ballad of Ricky Bobby. This is exactly what you expect from a recent Ferrell film: a tale of a man who is top of his field hitting hard times and unexpected competition, having to re-learn and evaluate himself, and coming back stronger than ever, with a number of bizarre characters and memorable lines. While it fails to match previous efforts, like the excellent Anchorman, it’s still a worthwhile comedy if you like the lead’s goofy schtick.

Former Ferrell collaborator, Steve Carrell, plays it straight for once in delightful indie flick, Little Miss Sunshine. The story of a suburban family whose ties are wearing thing, the film follows them on a road trip across state to a (fairly nauseating) children’s beauty pageant. As with most good tales, the focus is on the journey, not the destination. The trials and breakdowns as they make their trip are tragic, hilarious and occassionally shocking. While a certain gloom hangs over areas of the film, it never lasts for long; the film serving as an affirmation of the tribulations of family life.

A couple of years back, the great Hong Kong crime trilogy, Infernal Affairs, kicked off. That is long since finished, but a heavyweight Hollywood version has appeared in the form of The Departed. The tale of a cop going undercover in a crime syndicate, and the same crime syndicating planting a mole in the cops, this version builds the same tension through paranoia and near-misses, with some excellent performances from Jack Nicholson and Leonardo Dicaprio. Although the dialogue is better (post-translation), it does go wayward in a few places (like the love story). Still a very worthy film.

It’s a tricky month, but I would say Little Miss Sunshine edges it.

Plagiarism: Tokenising, Part 1

After a long break since our plagiarism introduction, it’s time to get started on the real work. I would go and get some coffee before reading, because this is going to take a while.

The first thing you really need to know about implementing a plagiarism detection system, and most natural language processors, is called tokenising: the process of taking a document and splitting it into smaller parts before the real analysis can take place.

What exactly do I mean by that? Say you have a 300 students all handing in essays or a piece of code (we will choose the latter simply because the syntax can be easier), and you want to compare them for the purpose of finding plagiarists. We’ll say that one of these pieces of code includes the line “for i in 1..10 loop“. The computer does not understand that snippet, or the language it is written in (ADA, incidentally). For it to compare it to other snippets, it must first do what the human brain has evolved to do naturally: split the sentences into words and extract meaning from the words. Without splitting up a page somewhat, you can understand it far less well than you can if you break it up, and you have less tolerance for slight differences when comparing documents. Same applies for the computer.

That’s not strictly true. The computer can just read those documents in as one big long string and compare them the dumb way, character by character. This, however, gives you much less tolerance for slight changes when comparing two very similar documents. A series of changes, even simple ones like adding spaces or punctuation, would make two very similar documents look different. This is a bad thing. We don’t want the cheaters to be able to beat our system by making small changes. If they make a lot of changes, that makes our life harder (are they still really cheating if they’ve changed everything?) but we certainly don’t want people them getting away with small changes.

So, we take the document, figure out what constitutes a “word” in the language of the document, and turn the document into a list of these “words”. We call these words tokens. We can compare tokens much more easily and efficiently than comparing full strings, and with more tolerance for change. So, our earlier example might be broken up like this: ["for", "i", "in", "1", "..", "10", "loop"]

Anyone who is being astute will notice that I just skipped over the real crux of the problem: how do we determine what a token is? Well, that really depends on the language that the document is in and, to an extent, the domain. Consider:

  • An essay written in English is very different from an essay written in Spanish.
  • An essay written in English is very different from a Java source file.
  • An essay about biology is likely very different (in language and style) from an essay about history.

What you really need to understand about tokenising is that it is:

  1. Language specific – in that to get anywhere worthwhile with your tokeninsing, you will need a different tokeniser for each language.
  2. Domain specific – to get the most out of your tokeniser, it needs to be tuned to the domain of the topic (i.e. History, medicine etc) to better take advantage of how people in that domain exploit and use the language.

The latter of these two I will leave as an exercise to the reader, and not discuss much further. The former is the core of the problem we want to solve: how do we create a tokeniser for a specific language? Easy, we look at the grammar of the language.

Now, I know some people just shuddered at the mention of the word “grammar”, but it’s not so bad. The grammar for a language is simply the set of rules that allow for the construction of language features, a bunch of very simple rules that state which tokens can appear where and what form they should take when near other tokens. Thankfully, just about every computer language has a detailed specified grammar somewhere. If you’re suffering insomnia, have a look at the Ada syntax or the Java grammar. Thrilling stuff, I know.

You know your language, you have your grammar, now you need a parser. The parser takes the grammar and documents you provide, and spits out your tokens. This is a well-understood area, and you may even have some tools to do it for you. If not there are plenty of tools available that will parse for you: yacc, bison and lexx are just a few command line tools that will parse documents. I would recommend SableCC if you’re writing your plagiarism detection system in Java: it has a decent API and does most of the work for you. You just need to take the trees it produces (for compiler use), and flatten them into an array; a trivial task that amounts to an in-order walk.

Wow! We’ve now tokenised the document. End of story.

Well, not quite. The problem with using a pure parser-and-grammar approach is that it doesn’t work if your input (your code snippets or essays) are malformed. If even one part of your document breaks a rule, your parser is going to complain. Most will refuse to process it further. Handling this is your responsibility, and it’s not an easy problem to solve. The simplest solution I can suggest is falling back to white-space tokenising when something fails and logging the problem, feeding a better solution into the next version of your system. It’s hardly ideal. Also, always remember Postel’s law and consider that just because someone doesn’t fully follow the rules of a language, that does not mean they should get away with plagiarism.

Are we done now? No, but we’ve done enough for today. More tokenising soon.

This Is 23

Another year down. 23. I’m at that fun stage in life where you’re sure you’re still young but end up in a whole bunch of situations where you feel old. It doesn’t help the ego, but I imagine it will only get more and more acute so I should enjoy my pseudo youth as best as I can.

It’s been a fast year. This time last year, I was about 3 months into my job with the BigCo and still a little out of my depth. Now, I’m completely comfortable with my job and skills: they keep throwing me stuff out of my comfort zone, and I figure it out pretty quickly. Need a massively scalable web application with ridiculously low latency? I can do that. Want to compare any number of Solaris boxes from Sun for specific infrastructures? I can do that too. I couldn’t say that a year ago.

I’ve also moved into a flat that I rather like. City centre living is definitely for me, right now. I can see myself wanting to live further into the countryside later in life but, right now, in the middle of everything is where I am enjoying myself. Two minutes from all the things I want to do.

I’ve continued all that fun learning about the world and people, etc, but feel less inclined to talk about it. I know I’m better, and that’s all I need to say.

Incidentally, by pure chance, I had the same song stuck in my head from my birthday post last year. So, older, better, but still the same.