Plagiarism research

Saturday 26th of January 2013 10:42:00 AM

When My Learning was first created way back in early 2008, we added a basic anti-plagiarism tool which allowed a teacher to analyse student documents against the internet in order to determine the likelihood that the document was 'borrowed'.

The tool was free, as it used a combination of publically available search engine results and comparisons with other student's work to attempt to work out a basic plagiarism score. As a basic module it worked fairly well - but lacked the detail of more costly 3rd party solutions.

More recently, My Learning were asked if such a tool existed in our platform... so we dug out the old code and brought it back to life for our new v20.0 edition. (Update: This tool will be provided in v20.3 at the start of March)

  • Firstly - the code is entirely new (only the basic database code remained).
  • Secondly - Google, Bing and Yahoo all now charge for detailed web search API use (so much for a free internet....)
  • Thirdly - It's amazing what you can learn about the advanced maths of plagiarism.
  • Finally, it's fun being able to check and compare similar works from our pilot schools.

Our new engine uses 4 mathematical principles (and our own cool algorithm) to determine two types of values:

1) The "Likeliness" that something is copied
2) The "Certainty" of the likeliness (or, how sure we are of our presumption)

This is a unique concept, and an entirely different way to look at 'copying'. We decided to make the maths emulate how a human distinguishes copying, rather than opting for a purely mathematical approach. Our algorithm considers the fact that it may indeed be slightly incorrect - or, that it's very certain of it's advice. It's always better to allow a computer to be partly human, and be unsure about something!

We used the following mathematical models in our approach:

a) N-grams (allow us to detect differential keywords and similar keywords in a document)
b) Verbatim comparison (using our own blend of Hamming-Levenshtein weighted algorithms)
c) Cosine similarity (a more detailed algorithm which can compare varying data size sets)
d) Our own algorithm (comprising an exponential/weighted data quality summing equation)

So, we use these 4 detection routines together (in varying combinations/situations) to form a fairly accurate detection of the different types of plagiarism. Thus far, our research test harness is correctly determining plagiarism within 80% of short essay documents.

As always, if you're geeks like us - please stay tuned to our labs - we have some pretty cool stuff underway at the moment!




<< Laboratory RSS