A Checklist for Evaluating AI Code Detection Tools

Evaluating tools for a problem that changes every week

In late 2023, a large public university in the Midwest ran a controlled study. They took 200 student submissions from an introductory Python course — half written by students, half generated by GPT-4 with varying prompt strategies — and ran them through three commercial AI detection tools. One tool flagged 78% of the AI-generated submissions correctly. Another caught only 34%. The third returned a 53% false positive rate on the human-written assignments, meaning over a hundred honest students would have been referred to an honor board.

The tools were all marketed as "AI detectors." They all claimed accuracy rates above 90%. None of them were lying — they were just measuring different things, against different baselines, on different data.

This is the problem facing every CS department and engineering organization right now. The market for AI-generated code detection is less than two years old. There are no established standards, no third-party audits, no Consumer Reports for detection tools. Vendors publish their own benchmarks, test on their own datasets, and define "accuracy" however they like.

A CS professor evaluating these tools needs a systematic framework — not a single number. This article provides a seven-point checklist for evaluating AI code detection tools. Use it before you purchase, before you deploy, and before you base an academic integrity decision on a tool's output.

Accuracy is not one number. It is at least four numbers, measured against a representative sample of your actual population.

Checklist item 1: Baseline accuracy across the full confusion matrix

Every detection tool publishes an "accuracy" figure. This is almost always balanced accuracy or area under the ROC curve — useful for comparing models in a research paper, but actively misleading for deployment decisions.

What you actually need are the four numbers that make up the confusion matrix, measured on your student population:

  • True positive rate (recall): Of all AI-generated submissions, what fraction does the tool flag?
  • True negative rate (specificity): Of all human-written submissions, what fraction does the tool correctly leave alone?
  • False positive rate: The complement of specificity. How many honest students get flagged?
  • False negative rate: The complement of recall. How many AI-assisted students slip through?

A tool with 95% recall and 85% specificity sounds impressive. But if your department receives 10,000 submissions per semester and 5% are AI-generated, that 85% specificity means the tool will generate 1,425 false positives — over a thousand honest students flagged for a conversation they don't deserve.

Ask vendors for their full confusion matrix, not just a headline accuracy number. Ask them to test on a held-out sample of your own past submissions. Any vendor who hesitates should raise a red flag.

Checklist item 2: Prompt sensitivity and evasion resistance

The same LLM, prompted differently, produces dramatically different code. Consider these three prompts for generating a function to compute the Fibonacci sequence:

# Prompt A: Simple and direct
"Write a Python function fibonacci(n) that returns the nth Fibonacci number."

# Prompt B: Stylistic influence
"Write a Python function fibonacci(n). Use meaningful variable names, include
a docstring, and add comments explaining each step. Use an iterative approach."

# Prompt C: Adversarial evasion
"Write a Python function fibonacci(n). Vary your coding style: use different
variable naming conventions, mix in a small typo, avoid obvious AI patterns.
Write it like a college sophomore who is tired and rushing through the assignment."

Code output from Prompt A often shows telltale signs: excessive comments, perfect formatting, unnaturally consistent naming. Code from Prompt C can look nearly indistinguishable from a stressed student's work — inconsistent indentation, a copy-paste error, a variable named fibb instead of fib.

A robust detection tool must be evaluated across the prompt sensitivity spectrum. Ask the vendor for a breakdown of detection rates by prompt type. If they've only tested on simple, direct prompts, their real-world performance will be worse than advertised.

Codequiry's detection engine, for instance, is trained on a wide range of prompt strategies — including adversarial ones — to minimize the gap between benchmark performance and classroom reality.

Checklist item 3: Cross-language and cross-version support

AI code generation is not Python-only. Students use Java, C++, JavaScript, Go, and Rust. Instructors teach different versions of languages — Python 2 code still exists in legacy courses, and C++17 code looks different from C++11 code.

Evaluate the tool against:

  • All languages you currently teach. Does the detection engine handle them equally well?
  • The specific language versions your students use. C++20 concepts will not appear in a C++11 submission, but a detection tool trained only on modern C++ may flag them incorrectly.
  • Mixed-language submissions. Some tools only analyze the dominant language in a file, missing AI-generated snippets in secondary languages.
  • Code with heavy boilerplate. GUI frameworks, auto-generated test harnesses, and language-required imports all produce legitimate code that looks suspiciously uniform.

One university discovered that their detection tool flagged over 90% of Android Studio project templates as AI-generated — because the template code was repetitive and perfectly formatted. The false positive rate for Java GUI assignments was catastrophic.

Checklist item 4: Integration complexity and workflow fit

A detection tool that requires students to submit through a separate portal will see lower adoption than one integrated into your existing LMS or submission system. Evaluate the following:

  • API availability. Can you programmatically submit code for analysis, or must you use a web interface?
  • LMS plugins. Does the tool offer plugins for Canvas, Blackboard, Moodle, or whatever your institution uses?
  • Batch processing. Can you submit 200 assignments at once, or must you upload them one by one?
  • Feedback loop. Does the tool produce instructor-facing reports, student-facing reports, or both?
  • Latency. How long does detection take? For a large course with 500 submissions, a per-file analysis time of 30 seconds matters.

One department head I spoke with chose a tool with a 98% accuracy claim — then discovered it had no API and required manual file uploads. Their TAs spent three hours per week just moving files into the web portal. The tool was abandoned after one semester, regardless of its detection rates.

Checklist item 5: Explainability and auditability

When a student is flagged for AI-generated code, the instructor needs to understand why. A simple "98% AI probability" score is not actionable. The detection tool should provide:

  • Feature-level explanations. Which specific features of the code contributed to the score? Is it the comment style? The variable naming? The structural patterns?
  • Comparative baselines. How does this submission compare to a corpus of known human-written and AI-written code at the same level?
  • Per-function breakdowns. The entire file may score high, but is it because one function is clearly AI-generated while the rest is human-written?
  • Exportable evidence packages. If the case goes to an honor board, can you produce a PDF or report that explains the detection rationale?

Tools that are black boxes — score in, score out, no explanation — are dangerous. They shift the burden of proof onto the instructor, who must then become an expert in LLM code patterns just to defend the tool's output.

Checklist item 6: Ongoing model updates

LLMs are not static. GPT-4o writes different code than GPT-3.5. Claude 3.5 writes different code than Claude 2. A detection model trained on GPT-3.5 output in early 2024 will be increasingly ineffective against GPT-4o output in late 2024.

Ask the vendor:

  • How frequently do you retrain your detection model?
  • What is your process for incorporating new LLM versions into your training data?
  • Do you proactively test against newly released models, or do you wait for customer reports?
  • Is the model update transparent — do you publish release notes showing changes in performance?

A vendor who releases "version 2.0" once and then stops updating is selling you a product with a built-in expiration date. The shelf life of an AI detection model, without retraining, is roughly six to nine months.

Checklist item 7: Privacy, data retention, and compliance

Student code is sensitive. It may contain proprietary algorithms, patented techniques, or simply the personal intellectual property of students. Before adopting any tool, verify:

  • Data retention policy. How long is submitted code stored? Is it deleted after the semester?
  • Secondary use of data. Is your students' code used to retrain the detection model? If so, can you opt out?
  • FERPA and GDPR compliance. Does the tool's data handling meet your jurisdiction's privacy requirements?
  • Encryption and access controls. Who can see the submitted code? What about the detection scores?
  • On-premises options. For institutions with strict data governance, does the tool offer a self-hosted deployment?

One European university discovered that a popular U.S.-based detection tool was storing all student code on AWS servers in Virginia — without explicit GDPR-compliant data processing agreements. The university had to abandon the tool mid-semester and revert to manual grading.

Building your evaluation matrix

Once you have answers across all seven checklist items, build a weighted evaluation matrix. Not all criteria matter equally for every institution. A large research university with 10,000 CS students will prioritize latency and batch processing. A small liberal arts college may prioritize explainability and low false positive rates. A bootcamp with intensive, short courses may prioritize integration speed and model update frequency.

Assign weights to each checklist item based on your specific context. Then score each candidate tool against the criteria. This turns an emotional purchase decision — "this tool has the highest accuracy number on the vendor's blog post" — into a structured, defensible evaluation.

A final warning: a single number on a conference slide is not an evaluation. The tool that catches 95% of GPT-4 code in a controlled lab study may catch only 40% of GPT-4o code in a real classroom with real students using real prompting strategies. The checklist above is designed to surface those gaps before you deploy, not after you've already referred a student to the honor board based on a false positive.

Evaluate carefully. Your students — and your department's integrity process — depend on it.