Happenings

Plagiarism: Tokenising, Part 1

After a long break since our plagiarism introduction, it’s time to get started on the real work. I would go and get some coffee before reading, because this is going to take a while.

The first thing you really need to know about implementing a plagiarism detection system, and most natural language processors, is called tokenising: the process of taking a document and splitting it into smaller parts before the real analysis can take place.

What exactly do I mean by that? Say you have a 300 students all handing in essays or a piece of code (we will choose the latter simply because the syntax can be easier), and you want to compare them for the purpose of finding plagiarists. We’ll say that one of these pieces of code includes the line “for i in 1..10 loop“. The computer does not understand that snippet, or the language it is written in (ADA, incidentally). For it to compare it to other snippets, it must first do what the human brain has evolved to do naturally: split the sentences into words and extract meaning from the words. Without splitting up a page somewhat, you can understand it far less well than you can if you break it up, and you have less tolerance for slight differences when comparing documents. Same applies for the computer.

That’s not strictly true. The computer can just read those documents in as one big long string and compare them the dumb way, character by character. This, however, gives you much less tolerance for slight changes when comparing two very similar documents. A series of changes, even simple ones like adding spaces or punctuation, would make two very similar documents look different. This is a bad thing. We don’t want the cheaters to be able to beat our system by making small changes. If they make a lot of changes, that makes our life harder (are they still really cheating if they’ve changed everything?) but we certainly don’t want people them getting away with small changes.

So, we take the document, figure out what constitutes a “word” in the language of the document, and turn the document into a list of these “words”. We call these words tokens. We can compare tokens much more easily and efficiently than comparing full strings, and with more tolerance for change. So, our earlier example might be broken up like this: ["for", "i", "in", "1", "..", "10", "loop"]

Anyone who is being astute will notice that I just skipped over the real crux of the problem: how do we determine what a token is? Well, that really depends on the language that the document is in and, to an extent, the domain. Consider:

  • An essay written in English is very different from an essay written in Spanish.
  • An essay written in English is very different from a Java source file.
  • An essay about biology is likely very different (in language and style) from an essay about history.

What you really need to understand about tokenising is that it is:

  1. Language specific – in that to get anywhere worthwhile with your tokeninsing, you will need a different tokeniser for each language.
  2. Domain specific – to get the most out of your tokeniser, it needs to be tuned to the domain of the topic (i.e. History, medicine etc) to better take advantage of how people in that domain exploit and use the language.

The latter of these two I will leave as an exercise to the reader, and not discuss much further. The former is the core of the problem we want to solve: how do we create a tokeniser for a specific language? Easy, we look at the grammar of the language.

Now, I know some people just shuddered at the mention of the word “grammar”, but it’s not so bad. The grammar for a language is simply the set of rules that allow for the construction of language features, a bunch of very simple rules that state which tokens can appear where and what form they should take when near other tokens. Thankfully, just about every computer language has a detailed specified grammar somewhere. If you’re suffering insomnia, have a look at the Ada syntax or the Java grammar. Thrilling stuff, I know.

You know your language, you have your grammar, now you need a parser. The parser takes the grammar and documents you provide, and spits out your tokens. This is a well-understood area, and you may even have some tools to do it for you. If not there are plenty of tools available that will parse for you: yacc, bison and lexx are just a few command line tools that will parse documents. I would recommend SableCC if you’re writing your plagiarism detection system in Java: it has a decent API and does most of the work for you. You just need to take the trees it produces (for compiler use), and flatten them into an array; a trivial task that amounts to an in-order walk.

Wow! We’ve now tokenised the document. End of story.

Well, not quite. The problem with using a pure parser-and-grammar approach is that it doesn’t work if your input (your code snippets or essays) are malformed. If even one part of your document breaks a rule, your parser is going to complain. Most will refuse to process it further. Handling this is your responsibility, and it’s not an easy problem to solve. The simplest solution I can suggest is falling back to white-space tokenising when something fails and logging the problem, feeding a better solution into the next version of your system. It’s hardly ideal. Also, always remember Postel’s law and consider that just because someone doesn’t fully follow the rules of a language, that does not mean they should get away with plagiarism.

Are we done now? No, but we’ve done enough for today. More tokenising soon.

This Is 23

Another year down. 23. I’m at that fun stage in life where you’re sure you’re still young but end up in a whole bunch of situations where you feel old. It doesn’t help the ego, but I imagine it will only get more and more acute so I should enjoy my pseudo youth as best as I can.

It’s been a fast year. This time last year, I was about 3 months into my job with the BigCo and still a little out of my depth. Now, I’m completely comfortable with my job and skills: they keep throwing me stuff out of my comfort zone, and I figure it out pretty quickly. Need a massively scalable web application with ridiculously low latency? I can do that. Want to compare any number of Solaris boxes from Sun for specific infrastructures? I can do that too. I couldn’t say that a year ago.

I’ve also moved into a flat that I rather like. City centre living is definitely for me, right now. I can see myself wanting to live further into the countryside later in life but, right now, in the middle of everything is where I am enjoying myself. Two minutes from all the things I want to do.

I’ve continued all that fun learning about the world and people, etc, but feel less inclined to talk about it. I know I’m better, and that’s all I need to say.

Incidentally, by pure chance, I had the same song stuck in my head from my birthday post last year. So, older, better, but still the same.

Plagiarism: Introduction

Having been been involved in various natural language problems at university, the area of study interests me greatly as a series of very difficult problems. Despite that, I’ve written very little about it. This series on plagiarism detection is an attempt to rectify that.

First, the problem: people are opportunistic, lying cheats who will take any advantage to get ahead. Well, if you believe the tabloids. While we might not all be like that, sadly, the number of people cheating in universities is increasing. Rather than work hard and produce their own work, professors are dismayed to find that several students hand in suspiciously similar pieces of work. Of course, with rather a lot of students in any given course, it can be very difficult to match up a handful as being similar. Lots of people, lots entries, not enough time to go through them and compare every single one against every other one by hand.

Take some time to appreciate the scale of the problem: if a class has 300 people, and you want to compare them all against each other, you’re invoking the classic handshake problem. To compare them all, you have to do 44850 checks. Ouch.

This is where plagiarism detection comes in: we want to automatically detect any near-duplicates in a group of submissions, eliminating any honest submissions in the process.

Note that although I will refer to most examples in terms of students, submissions and universities, you can substitute in a wide range of domains. Copied texts in a library, finding cross-quotations, finding duplicate entries in financial market systems, the list is nearly endless.

Next, a couple of useful terms: If one person copies another without that person knowing, we call that plagiarism. If two or more people copy one another intentionally, we call that collusion. For our purposes, these are the same thing. Why? It would be tremendously difficult, if not impossible, to infer the intentions of any entries marked as potential duplicates. Because of this and other possible failings in the system, it is never for the system to act as a final decision maker in whether or not people are cheating; it merely serves as an ordering system for someone else to make a decision. We will, therefore, use the terms “plagiarism” and “collusion” in interchangeably.

It’s worth reiterating that plagiarism detection systems are just ordering systems: they take in a bunch of entries, find groups of similar entries (normally simple pairs of entries), and order those groups by the similarity found within. A user would then look at the produced ordering and decide whether these entries really are similar. There’s nothing unusual about that, when you think about it. Google is doing something pretty similar: it takes a bunch of web pages and a query, groups them on key attributes (words, meta-data, rankings), and orders the pages by how similar they are to the query. A user would then look at the produced ordering and decide which entries best matched the query. You’ve done this a million times yourself, you know it mostly works.

You also know that, heck, sometimes it doesn’t; you can’t make decisions on the ordering with absolute certainty. In this way, plagiarism detection, search engines and other natural language problems are, at the time of writing, incomplete. While we don’t have the natural language problem nailed, we have to accept that they sometimes produce errors. In the case of plagiarism detection systems, those errors could lead to someone being kicked out of university if the risks are not understood. Again, these systems are just for ordering: not absolute decision making.

Another point worth noting is that plagiarism detection systems generally deal with intra corpus search. That is, we know and have absolutely every entry in the set: they all come from submissions for a set exercise with a set deadline. After the deadline, the set is fixed in size, it will never grow, and it is almost always relatively small in size (a few hundred or thousand entries). Compare with web search where the corpus of documents is continually increasing and has billions of entries. So we’re comparing the entries in a fixed size set of documents, and not against a wider set. It is possible to compare against a much larger set like, say, the internet to find people copying from there, but it’s beyond the scope of much of the series as it is a much more difficult problem.

We’re nearly down with this introduction, I hope by the time that I end this series, you will have a much better feel for how plagiarism detection works, how natural language problems can be solved, and what the trade-offs can by through-out.

Finally, I highly recommend that you read Tim Bray’s excellent On Search series. While reading the whole thing will do you no harm, if you’re lacking time I would read the sections on Precision and Recall and Stop Words as they hint at areas that need to be understood for most of this series.

Dual Monitors

After being almost dead against big monitors (I didn’t see the point given how close you sit to them anyway), low prices at play.com convinced me to get a new 19″ widescreen monitor. I am happy to say I was completely wrong and that the extra space is fantastic.

While finding a decent wallpaper for my desktop proved tricky, everything else is great. 1440*900 resolution lets you have a 1024*768 browser and enough space for any other active tasks that you might have, such as instant messenger windows. I recently bought Grand Theft Auto: San Andreas for the PC and playing it widescreen makes quite a difference to the views, particularly in the wilderness between Los Santos and San Fierro.

The one thing better than the wider screen is keeping the old one in a dual-screen set-up. You get all the benefits of the larger area, plus you can have one monitor dedicated to passive tasks (like IM contact lists, media players, bittorent clients) such that it works as an information display at a glance. I highly recommend the extra space to anyone.

Film Fight: September 2006

Another month, another bunch of films.

Normally, I abhor Owen Wilson. His acting abilities stretch to a single character: goofy and dazed. Thankfully, in You, Me and Dupree that’s exactly how he is cast. A lovable fuck-up that ends up living with newly married friends, we begin to see his character learn that he needs to make something of himself or forever be branded a loser. The real failure in this cast is Matt Dillion: a normally reasonably comic actor trying and failing to pull-off some trickier emotions. Not for one second will you believe his characters downward spiral or the subsequent, inevitable rise back to form. All in all, a watchable but forgettable film.

For fans of Broken Lizard, the comic writing, acting and directing team behind Super Troopers, Beerfest will hit all the sweet spots. A goofy comedy about something a little nonsensical, nods to it’s own stupidity, some fantastic quotes and over the top characters, and just a little grossness: a low-brow comedy for those of us beyond the cheap simplicity of teen gross-out comedies. Their debut was excellent, and Broken Lizard have once again shown they know how to write funny. Worth seeing.

The Black Dahlia is a beautifully shot piece full of ugly characters, motives, and premises; and that’s what makes it so good. A fictionalisation of a real murder, the film follows two detectives trying to solve the case of the most brutal unsolved murder in LA history. While the eventual conclusion is less than satisfying, the characterisation and interactions are so compelling that it does not matter. Another excellent film.

Jason Statham, by now, knows that his niche is in cheap action films. Crank is no exception. The plot is ridiculous (he’s been given a drug that will stop his heart unless he does progressively more extreme things as he tracks down the people who are killing him) but it doesn’t really matter. As dumb action goes, this hits all the sweet spots: excellent set pieces, hints of comedy without getting slapstick, all the while not taking itself seriously. If you can forgive the film the premise, acting and cringe-worthy moments (the falling scene, for example) and just enjoy it for what it is, a dumb action piece, then you’ll be fine.

The Children of Men future is one of the most convincing, well-conceived worlds ever seen. Set in the near future where, because women are no longer giving birth, hope is gone and the world is in turmoil, near civil war and refugees trying to escape fighting further afield. A disinterested, ex-activist is approached by a political group and asked to provide papers for a girl to leave the country; there’s something special about her. As various factions hunt down the pair, the film is relentlessly focussed on their escape, framing the political machinations and fighting all around as a background to their story, and not the other way around. Intense, emotional and unnerving; a beautifully imagined and desolate film. One of the best of the year.

Finally, Kevin Smith returns to his debut film in order to make a sequel. Clerks 2 is a smart enough piece of film making not to simply rehash the jokes and story of its predecessor, while being a disgusting enough look at current pop culture to set-up the main conflicts well. Again, we follow Dante and friends through a series of amusingly unpleasant conversations, and watch as their issues come to a head. While it’s hardly high-brow, it has the kind of honesty in dialogue many better films lack.

It has been a good month for cinema and, while Beerfest or Clerks 2 could have taken it for comedy in quieter months, the only real choice for winner is the excellent Children Of Men. Expect it to finish very high for the year.