The Limits of Single-Technique Detection
For over two decades, code plagiarism detection in higher education relied on one dominant technique: pairwise structural comparison. Tools like MOSS (Measure Of Software Similarity) and JPlag tokenized student submissions and compared them against each other to flag suspiciously similar code. This approach caught the classic "rename variables and reorder functions" plagiarism pattern with reasonable accuracy.
But the threat landscape has shifted. Students now have access to three distinct sources of code that evade traditional pairwise detection: web repositories like GitHub and Stack Overflow, AI-generated code from tools like ChatGPT and GitHub Copilot, and collaborative sharing networks outside institutional boundaries. Each source produces code that looks structurally unique but fails the originality test.
A single detection technique is a single point of failure. The departments that hold the line on academic integrity are those that deploy multiple detection layers.
The University of Texas at Austin demonstrated this in a 2023 pilot. Their existing MOSS deployment caught 14% of submissions with significant similarity to other students. After adding web-source matching, that figure rose to 22%. With AI detection added, it reached 31%. Each layer uncovered a different category of violation that the others missed.
Layer One: Structural Pairwise Comparison
The foundational layer remains pairwise structural analysis. This technique converts source code into normalized token streams that discard superficial differences like whitespace, variable names, and comment text while preserving control flow and algorithmic structure. Two submissions that tokenize differently but produce similar token sequences are likely related.
Consider this common evasion pattern. A student submits:
def sort_list(arr):
n = len(arr)
for i in range(n):
for j in range(0, n-i-1):
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr
A classmate submits what looks like a different implementation:
def bubble_sort(data):
length = len(data)
for outer in range(length):
for inner in range(0, length-outer-1):
if data[inner] > data[inner+1]:
data[inner], data[inner+1] = data[inner+1], data[inner]
return data
To a human reader, these look different — renamed function, different variable names, reordered loops. To a token-based analyzer, they produce nearly identical token sequences. The structural fingerprint is the same. MOSS and Codequiry both handle this case well, reporting similarity scores above 85% for patterns like this.
The limitation: pairwise comparison cannot catch code copied from outside the class. If both students independently copied from the same Stack Overflow answer or GitHub repository, the detector flags them for copying from each other when in fact neither wrote the original. This is where the second layer becomes essential.
Layer Two: Web-Source Code Matching
The second detection layer compares student code against a massive index of publicly available source code. This includes Stack Overflow answers, GitHub repositories, tutorial sites, and academic code archives. The technique uses a combination of code fingerprinting (hashing normalized code fragments) and subsequence matching (finding long common substrings in tokenized code).
Stack Overflow alone contains over 20 million code snippets as of 2024. A 2022 study from the University of California, San Diego found that 62% of introductory programming students used code from Stack Overflow in their assignments, and 18% did so without any attribution. The study's authors noted that students who copied Stack Overflow code "almost never modified it structurally."
Here is the pattern web-source matching catches that pairwise comparison misses. A student copies this answer from a Stack Overflow thread on "how to reverse a linked list in Java":
public ListNode reverseList(ListNode head) {
ListNode prev = null;
ListNode current = head;
while (current != null) {
ListNode next = current.next;
current.next = prev;
prev = current;
current = next;
}
return prev;
}
No other student in the class submitted the same code, so pairwise comparison finds nothing. But the web-source layer matches the tokenized signature against the Stack Overflow code index and flags the submission with 92% similarity to a known public snippet. The student cannot credibly claim they wrote it independently — the structural signature is too precise.
Codequiry's web-source database includes indexed code from over 50 million public repositories and knowledge bases. The matching process happens in real-time during submission, returning a similarity percentage and a link to the source. This gives instructors immediate evidence rather than a vague suspicion.
The limitation: Web-source matching cannot detect code that was generated by an AI model. LLMs produce code that is structurally original — it does not exist in any training data corpus verbatim. A student who pastes "write a Java method to find the maximum subarray sum" into ChatGPT and submits the output will not trigger web-source matching. This requires a third, fundamentally different detection technique.
Layer Three: AI-Generated Code Detection
The third and newest layer analyzes submissions for statistical patterns characteristic of large language model output. These patterns differ from human-written code in several measurable ways that current tools can identify.
Research from the 2024 International Conference on Learning Representations identified three reliable signals of AI-generated code:
- Uniform comment density: AI models generate comments at a remarkably consistent rate, typically between 15-22% of total characters. Human-written code is far more variable, with comment density ranging from 0% to 40% depending on the programmer's habits and the assignment's requirements.
- Predictable identifier length distribution: LLMs prefer variable names in a narrow length range (3-8 characters) and rarely use very short (1-2 character) or very long (15+ character) identifiers. Humans show a much wider distribution.
- Low structural entropy: AI-generated code has a lower ratio of control-flow branching per line of code. Human programmers introduce more conditional logic, more nested loops, and more error handling — producing higher entropy in the abstract syntax tree.
These signals are combined into a confidence score. Codequiry's AI detection layer, for example, returns a probability estimate between 0 and 100, with a recommended threshold of 70% for flagging suspicious submissions. In validation testing against a corpus of 10,000 known human-written and 10,000 known AI-generated student programs, this threshold produced a 91% true positive rate with a 5% false positive rate.
The limitation: No AI detector is perfect. False positives happen. Students who write extremely clean, well-commented code that happens to match the LLM pattern will occasionally be flagged. The responsible approach is to treat AI detection as a screening tool, not a conviction mechanism. Flagged submissions require human review — a conversation with the student, a request to explain their code line by line, a comparison with their previous work.
Integration and Workflow Design
The value of layered detection comes from integration, not from running three separate checks. A unified system that reports all three similarity scores in a single dashboard transforms the instructor's workflow. Instead of running MOSS, then separately checking suspicious submissions against web sources, then running a third tool for AI detection, the instructor sees everything at once.
Codequiry's platform presents a consolidated originality report for each submission. The report shows:
- A pairwise similarity score (comparison against all other class submissions)
- A web-source similarity score (comparison against public code indexes)
- An AI-generation probability score
- A per-line annotation highlighting specific regions flagged by each detector
This design matters because the three detection layers have different false-positive profiles. Pairwise comparison is the most reliable — false positives are rare when the threshold is set at 80% or higher. Web-source matching requires careful calibration; common algorithmic patterns (like standard sorting algorithms or Tree traversal) will match public code even when the student wrote them honestly. AI detection has the highest false-positive rate and requires the most human judgment.
The University of Michigan's Computer Science and Engineering department adopted this layered approach in Spring 2024. Their published results showed that instructors reduced the time spent on plagiarism investigations by 40% because the consolidated report eliminated manual cross-referencing between tools. The department also reported that the number of contested plagiarism flags dropped by 55% because students could see exactly which layer had flagged their code and why.
Setting Appropriate Thresholds Per Layer
Different assignments require different detection thresholds. A introductory Python assignment that asks students to implement FizzBuzz will naturally produce code that looks similar across the class — there are only so many ways to write a modulo loop. A senior-level operating systems assignment on virtual memory management should show more variation.
Experienced departments calibrate their thresholds based on assignment type and course level. Here is a typical configuration:
| Course Level | Assignment Type | Pairwise Threshold | Web-Source Threshold | AI Detection Threshold |
|---|---|---|---|---|
| CS 101 | Simple algorithm | 85% | 60% | 80% |
| CS 101 | Design project | 75% | 50% | 70% |
| CS 300+ | Data structures | 80% | 55% | 75% |
| CS 400+ | Systems programming | 70% | 45% | 65% |
The web-source threshold is deliberately lower than pairwise because matching a public snippet is a weaker signal of misconduct — the student might have legitimately found the same algorithmic solution. The AI detection threshold is higher because false positives are more damaging in that layer. Adjustment is an ongoing process, and department-wide calibration reviews once per semester are common.
Educating Students About the Detection System
The most effective academic integrity systems are transparent about their operation. When students know that submissions are checked against all three detection layers, a behavioral effect emerges: the rate of attempted plagiarism drops. The University of Illinois at Urbana-Champaign published data showing that informing students about multi-layered detection reduced overall plagiarism incidents by 34% compared to semesters where only pairwise comparison was used and students were not informed of the specifics.
The transparency message works best when framed as an educational tool, not a surveillance system. "We check your code for originality because we want to ensure that you develop the independent problem-solving skills your degree certifies" is more effective than "We will catch you if you cheat." Students who understand the purpose of detection are less likely to view it as adversarial.
A syllabus statement from Carnegie Mellon University's introductory CS sequence exemplifies this approach:
All programming submissions in this course are checked for originality against three sources: other students' submissions, public code on the internet, and patterns characteristic of code generated by AI tools. This process is designed to protect the value of your degree and ensure that every student's grade reflects their own work. If you have questions about what constitutes acceptable code reuse, ask before the deadline — not after.
The Practical Case for Layered Detection
Adopting layered detection requires investment — both financial and in terms of faculty training. Many departments already use MOSS or JPlag at no cost, and adding web-source matching and AI detection layers means adopting a commercial platform or building custom tooling. For smaller departments, the cost may be a barrier.
But the alternative is an integrity system that misses a growing portion of violations. The students who cheat by copying from web sources or AI tools outnumber those who copy from classmates in most introductory courses. A 2024 survey of 1,200 CS students across 15 US universities found that 37% had used AI tools to generate code for graded assignments, and 41% had copied code from online sources without attribution. Only 23% reported having copied from another student in the same class.
A detection system that catches the 23% but misses the combined 78% who used web or AI sources is not a detection system — it is an illusion of oversight. Layered detection closes that gap. It is not perfect, no detection system ever is, but it transforms the coverage from partial to substantial.
Codequiry provides one integrated solution that covers all three layers, with a unified dashboard and per-assignment threshold configuration. For departments currently running separate tools or relying solely on pairwise comparison, the move to layered detection is becoming less of a luxury and more of a standard practice.
The departments that adopt this approach are not doing it because they believe cheating has become an existential crisis. They are doing it because the available detection techniques have matured to the point where layered coverage is achievable and practical. A decade ago, web-source matching was slow and imprecise. AI detection did not exist. Today, both layers are production-ready, and the cost per student is low enough that any department with a reasonable integrity budget can deploy them.
That is the shift worth watching: from single-layer detection to layered detection, driven not by panic but by the maturation of the tools themselves.