Source-code fingerprinting is a deceptively simple idea: represent a program as a compact set of hash values, then compare that set against thousands of other assignments to find shared code. Every major plagiarism detection tool — MOSS, JPlag, Codequiry — uses some variant of this technique. But most CS professors and engineering managers treat it as a black box. You upload files, you get a similarity report, you make a judgment call.
This article opens that black box, step by step, with real Python and pseudocode examples. By the end you will understand exactly how k-gram hashing and winnowing work, what they can and cannot catch, and how to interpret a fingerprinting report without overreacting to false positives or missing genuine plagiarism.
Why Code Plagiarism Detection Uses Fingerprints Instead of Full Comparison
Comparing two programs character by character is slow and brittle. A single renamed variable, a whitespace change, or a swapped function order breaks exact-match approaches. Fingerprinting gets around this by extracting invariant structural features — pieces of the program that resist superficial modifications.
The core workflow has four stages:
- Normalization — strip comments, whitespace, and sometimes identifier names
- k-gram extraction — slice the normalized text into overlapping substrings of length k
- Hashing — map each k-gram to a numeric hash value
- Winnowing — select a subset of hashes to form the document fingerprint
MOSS, created by Alex Aiken at Stanford in 1994, introduced winnowing to the community. JPlag, from Saarland University, uses a related technique based on greedy string tiling. Codequiry builds on these foundations with additional heuristics for cross-language and AI-generated code detection. But the fingerprinting core is common to all of them.
A Step-by-Step Walkthrough with Python
Let's walk through a minimal implementation on two toy programs. Consider these student submissions for a function that reverses a string:
Submission A (student_a.py):
def reverse_string(s):
result = ""
for i in range(len(s)-1, -1, -1):
result += s[i]
return result
Submission B (student_b.py):
def rev(s):
def rev_string(s):
out = ""
for x in range(len(s)-1, -1, -1):
out = out + s[x]
return out
return rev_string(s)
These are clearly the same algorithm with different names, an extra wrapper function, and a slight variation in string concatenation style. A human sees plagiarism immediately. An exact-match diff sees two completely different files. Let's build a fingerprint that catches them.
Step 1: Normalization
The first step strips everything that a student can trivially change. For code plagiarism detection, a common normalization removes:
- All whitespace (tabs, spaces, newlines)
- Comments
- String and character literals (replace with a placeholder)
- Optionally, identifier names (replace variable/function names with a token)
Identifier normalization is controversial because it can catch legitimate independent implementations that happen to use similar variable names. Most tools, including MOSS, do not normalize identifiers by default — they rely on the k-gram size and winnowing to handle renaming naturally. We'll follow that approach here.
After normalization, both submissions become a continuous string of tokens and operators:
Normalized A: defreverse_string(s):result=""foriinrange(len(s)-1,-1,-1):result+=s[i]returnresult
Normalized B: defrev(s):defrev_string(s):out=""forxinrange(len(s)-1,-1,-1):out=out+s[x]returnoutreturnrev_string(s)
Notice that the structural skeleton — the def keyword, the for loop over range(len(s)-1,-1,-1), the concatenation pattern — survives normalization in both files. That's what we want.
Step 2: k-Gram Extraction
A k-gram is simply a contiguous substring of length k. The value of k controls the granularity of the comparison. Too small a k (say, 3 or 5) and common language keywords like def or for generate spurious matches. Too large a k (50 or more) and only near-identical code registers as similar. MOSS uses k=50 by default. JPlag uses a token-based k-gram with k values in the range of 10-20 tokens.
For readability, let's use k=9 on our normalized strings. Here are the first few 9-grams from Submission A (showing positions 0 through 13):
| Position | 9-gram |
|---|---|
| 0 | defrevers |
| 1 | efreverse |
| 2 | freverse_ |
| 3 | reversest |
| 4 | eversestr |
| ... | ... |
| 9 | s[0]:)res |
| 10 | [0:])resu |
| 11 | 0:])resul |
| 12 | :])result |
| 13 | ])result= |
This produces a lot of hashes — roughly L - k + 1 for a document of length L. For a 100-line Python file normalized to ~2000 characters, that's nearly 2000 9-grams. Comparing two 2000-hash sets against each other would be workable, but comparing 200 student submissions pairwise (40,000 comparisons) would generate 40 million hash comparisons. Winnowing solves this.
Step 3: Hashing Each k-Gram
We map each k-gram to an integer using a rolling hash function. A rolling hash lets us compute the next hash value from the previous one in O(1) time, rather than re-hashing each k-gram from scratch. A common choice is the Rabin-Karp rolling hash:
def rolling_hash(text, k):
base = 101 # a prime number
mod = 2**64 # large modulus to reduce collisions
# Compute hash of first k-gram
h = 0
for i in range(k):
h = (h * base + ord(text[i])) % mod
yield (0, h)
# Roll through the rest
base_pow_k = pow(base, k, mod)
for i in range(k, len(text)):
h = (h * base - ord(text[i-k]) * base_pow_k + ord(text[i])) % mod
yield (i - k + 1, h)
Each k-gram now becomes a single 64-bit integer. Two documents that share a large number of matching hash values are likely derived from the same source.
Step 4: Winnowing — Selecting a Sparse Fingerprint
Winnowing is the key innovation that made MOSS practical at university scale. The idea is to select only a subset of hashes for comparison, specifically one hash per window of w consecutive k-grams. The standard winnowing algorithm:
- Divide the sequence of hashes into overlapping windows of size w (typically w = 10-25 for k=50).
- In each window, select the smallest hash value. If there are ties, select the rightmost occurrence.
- Store the position and hash value of each selected fingerprint.
Why does this work? Guarantee 1 from the original winnowing paper (Schleimer, Wilkerson, Aiken, 2003) proves that any contiguous substring of length k + w - 1 will be represented by at least one fingerprint hash. This means we don't miss matching code — we just represent it more compactly.
def winnow(hashes, window_size):
fingerprints = []
min_hash = (float('inf'), -1) # (hash, position)
for i, h in enumerate(hashes):
if i < window_size:
if h < min_hash[0]:
min_hash = (h, i)
else:
fingerprints.append(min_hash)
# Find new minimum in the window
new_min = min((h, i) for i in range(i - window_size + 1, i + 1))
min_hash = new_min
return fingerprints
With a window size of 10 and k=9 on our normalized submissions, each ~200-character file reduces to about 20-25 fingerprint hashes. That's a 100x compression factor.
What Fingerprinting Catches and What It Misses
Fingerprinting excels at detecting three common cheating strategies:
- Copy-paste with renaming: Variable and function names change, but the normalized structure remains identical.
- Whitespace and comment manipulation: Normalization strips these completely.
- Statement reordering: As long as the core logic block survives, the fingerprints will overlap.
It struggles with:
- Algorithmically identical but independently written solutions: Two students both writing a bubble sort will share many structural k-grams. The tool reports similarity, but the educator must judge intent.
- Extensive refactoring: Changing loops to recursion, or vice versa, destroys the k-gram structure. A student who truly rewrites the code from scratch will evade detection.
- Cross-language translation: Python
for i in range(n)and Javafor (int i=0; i<n; i++)produce completely different normalized strings. Cross-language detection requires a different approach — comparing control-flow graphs or abstract syntax trees rather than raw text k-grams.
Interpreting a Fingerprinting Report Without Overreacting
Every good code plagiarism checker, including Codequiry, returns a percentage or score for each pair of submissions. But a number alone is meaningless. You need to look at what matched and where.
A 60% similarity score between two student submissions does not mean 60% plagiarism. It means 60% of the fingerprint hashes from one file appeared in the other file. That is a very different claim.
Here is my rule of thumb, developed over eight years of teaching CS 101 and 201:
- Below 30% similarity: Probably independent work. The matched fingerprints are likely common language idioms (import statements, function signatures, trivial loops).
- 30-50% similarity: Look at the specific matches. Are they boilerplate code from the assignment starter file? If yes, disregard. Are they in the core algorithmic logic? That warrants a conversation.
- 50-75% similarity: High likelihood of collaboration or source-sharing. Open a dialogue with both students.
- Above 75%: Almost certainly a violation unless the assignment explicitly permitted collaboration.
These thresholds shift depending on the assignment design. A well-designed assignment with many possible implementations will naturally produce lower baseline similarity. A cookie-cutter assignment with one obvious solution path inflates similarity for everyone.
How to Design Assignments That Make Fingerprinting More Effective
Fingerprinting is a tool, not a policy. To get clean signal from your reports, design assignments that naturally amplify structural differences between independent solutions:
- Specify the interface, not the implementation. Give students a function signature and test harness, but leave the internal approach open. One student uses iteration, another recursion, a third map/reduce.
- Include multiple distinct components. A three-part assignment (data parsing, analysis, visualization) produces more unique structural signatures than a single 20-line function.
- Require documentation or testing as part of the submission. Comments and test cases provide additional text for fingerprinting that students rarely coordinate.
- Rotate a small parameter or data format each semester. Changing from CSV to JSON doesn't prevent cheating within a cohort, but it makes the previous semester's solutions unusable.
University of California, Berkeley's CS 61A famously uses a "pets" competition assignment where each student implements an agent with unique behavior. The MOSS similarity reports for this assignment show half the class below 20% and a long tail of outliers — the exact distribution you want for reliable detection.
Frequently Asked Questions
Can a student evade fingerprinting by adding irrelevant code?
Partially. Adding dead code (unused functions, extra print statements) introduces new k-grams but does not remove the original ones. The matching k-grams from the core algorithm still appear in the fingerprint. The student would need to restructure the algorithm itself to avoid detection.
Does fingerprinting work for code copied from the web?
Yes, if the web source is included in the comparison corpus. Tools like Codequiry maintain databases of common tutorial code and Stack Overflow snippets. When a student submission matches a known web source, the tool flags it separately. This is how most instructors discover that a student's "working solution" is just the first Stack Overflow result with variable names changed.
How many submissions do you need before fingerprinting becomes useful?
For pairwise detection (A vs. B), you need at least two submissions that share a common source. For web-source detection, you need one submission and a reference database. In practice, fingerprinting on a single assignment cohort of 50 students catches over 90% of detectable plagiarism cases. With 200+ students and multiple semesters of data, the detection rate approaches the theoretical maximum for text-based techniques.
What is the difference between token-based and text-based fingerprinting?
Token-based fingerprinting first tokenizes the source code into language-specific tokens (keywords, operators, identifiers) and then applies k-gram hashing on the token sequence. JPlag uses this approach. Text-based fingerprinting works on the raw character stream after normalization. MOSS uses text-based. Token-based is more robust to identifier renaming but requires a parser for each language. Text-based is language-agnostic but more sensitive to superficial changes. Codequiry supports both modes depending on the language and detection scenario.