Why Some CS Departments Are Moving Beyond Moss for Plagiarism Detection

The Limits of Token-Based Similarity

For five years, Riverdale State University’s computer science department relied exclusively on Moss (Measure Of Software Similarity) to detect plagiarism in undergraduate programming assignments. The tool, developed at Stanford and widely adopted in academia, uses a winnowing algorithm to compare token sequences. It worked well for the department’s first‑year Java courses, where students submitted tightly scoped projects with limited structural variation.

But by 2022, the department’s 2,000+ students were submitting assignments across nine languages — including Python, JavaScript, C++, and R — and many had discovered that simple reordering of methods or variable renaming could slip past Moss’s token‑based detection. Worse, students were increasingly pulling code directly from Stack Overflow, GitHub gists, and AI assistants like ChatGPT, then rewriting variable names or inserting small control‑flow changes. Moss flagged only the most blatant copy‑paste jobs.

“We knew we were missing at least a third of the cases,” says Dr. Amara Osei, the department’s associate chair for undergraduate studies. “Students were openly sharing techniques on Discord for ‘beating’ Moss. It became an arms race we were losing.”

What Moss Couldn’t See

Moss excels at detecting token‑level similarity — it strips comments and whitespace, replaces identifiers with a canonical form, and hashes overlapping k‑grams. But it has fundamental blind spots:

Web plagiarism: Moss compares student submissions only against other submissions, not against the open internet. A function copied from a Stack Overflow answer or a GitHub repository goes undetected unless another student copied the same exact snippet.
Refactoring resistance: Renaming all variables, swapping if‑else branches for ternary expressions, or converting loops to idiomatic list comprehensions can reduce token‑level similarity below Moss’s threshold.
Cross‑language translation: Converting a Python solution to JavaScript (or vice versa) produces completely different token streams, even when the logic is identical.
AI‑generated code: LLM‑written code often contains distinctive statistical patterns — lower perplexity, uniform comment density, and repetitive naming conventions — that Moss’s token comparison is not designed to detect.

Riverdale State’s teaching assistants spent hours each term manually inspecting borderline Moss reports, only to find many were false positives triggered by boilerplate code or library imports. Others were genuine plagiarism that Moss had missed because the obfuscation was too thorough.

The Search for a Layered Approach

In early 2023, Dr. Osei convened a working group of three faculty members, two TAs, and the department’s IT manager. Their goal: find a set of tools that could cover the full spectrum of code integrity threats, from exact copy‑paste to AI‑assisted rewriting. They evaluated six products and services, including JPlag (AST‑based), a commercial static analysis suite, and Codequiry.

The evaluation criteria were designed to match real‑world conditions:

Ability to check a submission against the web (Stack Overflow, GitHub, public repositories) and against all other submissions — not just the current cohort but also historical assignments.
Detection of AI‑generated code using statistical signals such as perplexity and burstiness, not just pattern matching.
Support for the department’s nine languages, including niche ones like Racket used in the functional programming course.
Transparency about false‑positive rates and tunable thresholds.
Scalability to 2,000+ submissions per assignment without requiring HPC infrastructure.

Why They Chose a Hybrid Pipeline

The working group quickly realised that no single tool could do it all. Moss remained useful as a fast pre‑screener for high‑similarity pairs. But they needed additional layers. They decided to augment Moss with AST‑based comparison (using an in‑house built on JPlag’s methodology) for structural similarity, and with Codequiry’s AI‑detection and web‑scanning module for external and generative plagiarism.

“The key insight was that we needed to stop thinking of plagiarism as one phenomenon,” says Dr. Osei. “A student who copies a function from a forum, a student who paraphrases an algorithm from an AI, and a student who shares code with a classmate — each leaves a different trace. You need different lenses to see them all.”

Implementation and Pilot

In fall 2023, the department piloted the layered pipeline on two large courses: CS‑201 (Data Structures in Java, 320 students) and CS‑310 (Software Engineering in Python, 180 students). The workflow:

Moss pass: Submissions were compared pairwise using Moss’s default settings. Reports were generated within 15 minutes for both courses combined.
Web and AI scan: Each submission was sent to Codequiry’s API for web similarity against public code repositories and for AI‑generated code probability scoring. This took about 60 seconds per submission.
AST comparison: Submissions with Moss similarity below 40% but structural resemblance above 60% (e.g., identical control‑flow graphs with different identifiers) were flagged for manual review.
Adjudication: Cases flagged by any single tool were reviewed by two TAs. Cases flagged by two or more tools were escalated to the faculty member overseeing the course.

The results were striking. In the Data Structures course, the combined pipeline identified 47 students with problematic similarities, compared to only 22 flagged by Moss alone. Of the 25 additional cases, 10 were web‑source copies (mainly from Stack Overflow and a popular GitHub repository of Java algorithms), 8 were AI‑generated (detected by Codequiry’s model), and 7 were heavy refactoring that standardised AST comparison caught. Moss had flagged 3 students that the other tools did not — these turned out to be false positives based on boilerplate.

In the Software Engineering course, the numbers were similar: 31 total cases, with Moss alone only capturing 14. The additional cases were split roughly evenly between web plagiarised and AI‑assisted code. The department noted that AI‑generated code appeared most frequently in the project’s test suite — students would ask ChatGPT to generate unit tests and submit them with minimal changes.

Operational Impact and Faculty Response

The new pipeline was not without friction. The expanded detection increased the manual review burden by about 40%: five additional TA hours per large course per assignment. Dr. Osei had to restructure TA training sessions to include guidance on interpreting AI‑detection scores and AST similarity reports.

There was also an initial wave of student complaints, particularly around AI detection. “Some students genuinely wrote their own code but happened to use common patterns — like a for‑loop that looks like what an LLM would produce,” explains Dr. Osei. The department addressed this by setting a conservative threshold: Codequiry’s AI score had to exceed 85% to trigger a flag, and it was never used as sole evidence. Each flagged case required independent human verification of the suspicious code’s unnatural symmetry or lack of logical errors.

“We tell students that detection tools are an input to the process, not the judge. The final call always rests on human judgment. That built trust in the system.”

By the end of the semester, the number of formal academic integrity hearings increased from 12 to 19, but the department felt the trade‑off was worthwhile. “Before, we were only catching the careless students. Now we’re catching the sophisticated ones too — and that sends a message to everyone that we take integrity seriously,” says Dr. Osei.

Broader Lessons for CS Programs

Riverdale State’s experience is not unique. Over the past year, several other mid‑sized CS departments have contacted Dr. Osei to learn about their pipeline. The common thread: institutions that invested in layered detection now feel more prepared for the dual threat of accessible AI code generation and widespread web code reuse.

The department plans to expand the pipeline to all its 20+ undergraduate courses by fall 2025. They are also working with the university’s academic integrity office to update the honour code to explicitly mention AI‑assisted coding, with clear guidelines on what constitutes acceptable use (e.g., using an LLM to explain a concept vs. pasting a complete solution).

For departments still evaluating their options, Dr. Osei offers this advice: “Don’t throw out Moss. It’s fast, free, and catches the low‑hanging fruit. But build around it. Add a web‑source checker, an AI detector, and maybe some AST comparison. The combination is far harder for students to circumvent than any single tool.”

Frequently Asked Questions

Does Moss detect AI‑generated code?
No. Moss compares token sequences between submissions. It does not analyse writing style, perplexity, or any statistical signals associated with LLM output. Separate AI‑detection tools are needed for that.

How does AST comparison improve plagiarism detection?
Abstract Syntax Tree (AST) comparison examines the structural skeleton of code — control flow, function nesting, data dependencies — while ignoring variable names and minor formatting. This catches plagiarism where only identifiers and comments have been changed.

Can a layered pipeline be implemented without hiring more staff?
Much of the detection can be automated. Tools like Codequiry provide APIs that integrate into existing submission systems. The bottleneck is manual review. Many departments handle this by training TAs to interpret multi‑tool reports and using escalation thresholds to limit the number of full reviews.

What is the most effective combination of tools for CS departments?
A common combination among departments making the shift is Moss (token similarity) + Codequiry (web and AI detection) + an AST comparator (either JPlag or an in‑house parser). This covers copy‑paste, internet‑sourced code, heavy refactoring, and AI‑generated content in a single workflow.