Inside a Startup’s Codebase License Compliance Audit

The Wake-Up Call That Started the Audit

In early 2023, a 40-person fintech startup — let's call them FinStack — was finalizing a Series A term sheet. The lead investor’s technical due diligence included a standard question: "Do you have a complete inventory of open-source dependencies and their licenses?" The CTO, who had joined six months earlier, knew the answer was no.

FinStack’s core product was a Python and Java payment-processing platform. Over two years of rapid development, engineers had copied code from Stack Overflow, GitHub Gists, and internal projects with little regard for license headers. The CTO estimated that 60% of their 1.2 million-line codebase was third-party code, much of it unlabeled.

The investor gave them eight weeks to produce a compliance report. This case study documents how FinStack conducted a full codebase license audit using a combination of automated scanning, manual triage, and legal review — and what they learned about the gap between copying code and staying compliant.

Why Traditional Dependency Scanning Wasn’t Enough

FinStack already used Snyk for known-vulnerability scanning, but Snyk only tracked explicitly declared dependencies (via requirements.txt, pom.xml, etc.). It missed code that had been copied and pasted directly into source files — what many call "copy-paste code" or "viral snippets."

The compliance team needed a tool that could detect open-source code fragments embedded in proprietary files, not just package-level dependencies. This is a classic source code plagiarism detection problem, but applied to corporate software rather than student assignments.

The CTO evaluated three approaches:

FOSSology – open-source license scanner that examines file headers and copyright notices.
ScanCode Toolkit – detects licenses, copyrights, and package manifests.
Codequiry’s code similarity engine – originally built for academic plagiarism detection, its token-based and AST-based comparison could find copied code even when variable names or formatting had been changed.

FOSSology and ScanCode together covered declared licenses. But Codequiry’s fingerprinting algorithms, which compare student code against a corpus of known open-source projects, turned out to be the missing piece for finding undocumented copied code. FinStack used a private instance of Codequiry to scan their entire Git history against a custom index of common open-source libraries (Apache 2.0, MIT, GPL 2.0/3.0, LGPL, AGPL).

The Scan Revealed 47 GPL Violations

After two weeks of scan runs, the numbers were sobering:

Detection Method	Files Flagged	Confirmed Matches
License header scan (FOSSology)	1,203	892
Copyright string matching	614	411
Code fragment similarity (Codequiry)	2,087	1,746

Of the 1,746 confirmed code matches from similarity scanning, 47 files contained code from GPL-licensed projects. Most had been copied years earlier by a now-departed engineer who had imported a serialization_helper.py from a GPLv3 repository and renamed the functions.

"We had zero visibility into that until Codequiry flagged it. The license header wasn't preserved, but the code structure matched at 94% token similarity." — FinStack CTO

The GPL violations fell into three categories:

Direct file copies (license stripped, code unchanged) – 12 files
Refactored extracts (methods or classes pulled from GPL libraries and reworked) – 29 files
Algorithm clones (logic copied but expressed in another language, e.g., Python → Java) – 6 files

The last category was particularly tricky: the team had translated a GPL-licensed sorting algorithm from Python to Java. Because the code was semantically identical, Codequiry’s AST-based matching caught it even though the syntax differed.

Triage and Remediation Process

FinStack’s compliance officer, a former patent attorney, led a four-week triage. For each flagged file, the team had to decide:

Is the match real? Manual review of token overlap, file origin, and context.
What is the license? Look up the original project on GitHub or the package’s manifest.
Is the license compatible with our product? FinStack’s product was proprietary and distributed under a commercial EULA. GPLv3 required them to release the entire derivative work under GPL — not an option for a VC-backed startup.
Remediate: either rewrite the copied code from scratch (using only the public API of the library), replace the library with a permissively licensed alternative, or remove the feature.

For the 12 direct file copies, rewriting was straightforward. The team replaced six of them with Apache 2.0 equivalents and rewrote the other six using only the documented behavior of the GPL library, not its source code. (This is a gray area — clean-room implementation is safer, but FinStack’s legal counsel accepted a reimplementation based on functional specs.)

The 29 refactored extracts were harder. Many were small utility functions — a JSON parser, a logging formatter — that had been subtly modified. The team had to trace each function’s origin by searching for unique string constants and method names in the Codequiry similarity reports. In some cases, the function was so trivial (e.g., def to_camel_case(s):) that the CTO argued it wasn’t copyrightable. Legal disagreed and demanded rewrites for any function exceeding five lines that matched a GPL original.

Algorithm Clones: The Cross-Language Challenge

The Python→Java algorithm clone was the most time-consuming. The Java version was used in a critical payment-routing module. The team decided to replace the entire module with an MIT-licensed library that provided the same functionality. This took three developers two weeks and introduced a new set of integration tests.

Throughout the process, FinStack used Codequiry’s cross-language plagiarism detection to re-scan after each rewrite, ensuring no residual GPL code remained. This iterative "scan-rewrite-verify" loop was essential — a single missed file could have nullified the audit.

Lessons Learned: Where the Tools Break

The audit wasn’t perfect. The team identified several areas where automated license compliance scanning fell short:

False positives from copyleft-ish headers: Many MIT-licensed files include a "Copyright (c) 2023" line that FOSSology flagged as "unknown license." About 15% of the flagged files turned out to be false positives that required manual checking.
Threat modeling gap: The scanners didn’t account for copyleft "viral" propagation through linking in compiled languages. FinStack used Java’s javac and classpath linking; whether that created a derivative work under GPL was a legal question the tools couldn’t answer.
False negatives in embedded snippets: Short code fragments (under 30 tokens) were often missed by Codequiry’s default similarity threshold. The team had to lower the threshold to 40% similarity and then manually filter the resulting 10,000 matches down to manageable chunks.
Originality vs. license status: Codequiry flagged code that matched open-source projects but couldn’t determine whether that match represented a license violation or acceptable reuse (e.g., using an Apache 2.0 library unchanged). The final determination always required a human reading the license terms.

"The tools gave us a high-sensitivity net. The real work was in the triage — understanding what each file meant for our product’s distribution model." — FinStack Compliance Officer

Building a Sustainable Compliance Pipeline

Post-audit, FinStack integrated license scanning into their CI/CD pipeline. Every pull request now runs:

# .github/workflows/license-scan.yml
steps:
  - name: Run Codequiry scan on changed files
    run: |
      codequiry scan --window 10 --index open_source_db \
        --output compliance_report.json ${{ github.head_ref }}

  - name: Fail if GPL or AGPL detected
    run: |
      python block_gpl.py compliance_report.json

They also trained their developers to recognize common GPL red flags: copying an entire file without attribution, using code from "viral" projects, and assuming a 5-line snippet can’t be copyrighted. The CTO now requires that any code copy beyond three consecutive lines be reviewed by the compliance team.

FinStack successfully closed their Series A at the end of the audit. The compliance report, built with input from Codequiry’s code plagiarism detection platform and other tools, satisfied the investor’s requirements. More importantly, it gave the engineering team a practical framework for evaluating open-source license obligations without stifling innovation.

Frequently Asked Questions

How do you find GPL code in a proprietary codebase?
Use a combination of license header scanners (FOSSology, ScanCode) and code similarity engines (like Codequiry) that match local files against a corpus of known open-source repositories. Lower the similarity threshold to catch refactored copies.

Can automated scanning replace a legal review?
No. Automated tools flag possible matches, but determining whether a copy constitutes a derivative work under a specific license (e.g., GPL’s "work based on the Program") requires legal judgment. Tools are a force multiplier, not a substitute.

What is the most common violation in startups?
Undocumented copy-paste from GitHub Gists and Stack Overflow answers. Engineers often strip attribution lines, making it invisible to header-based scanners. Similarity-based detection catches this class of violation.

How long does a full codebase license audit take?
For a ~1 million-line codebase, plan for 6–10 weeks: 2 weeks for tool setup and initial scan, 3–4 weeks for triage and remediation, and 2 weeks for verification and documentation. Parallelize by having multiple engineers review different license families.