Simian: Detect Duplicate Code and Analyze Code Similarity

Simian Similarity Analyzer is open-source software built for engineers and litigation teams who need fast, reliable, and verifiable code analysis.

It is used in environments where accuracy and defensibility are required, including software litigation and technical due diligence. Quandary Peak Research is releasing Simian to support broader access and ongoing improvement of source code review. Learn more and access the project: https://simian.quandarypeak.com/

An abstract rendering of translucent blocks of code

How Simian Detects Duplicate Code and Analyzes Code Similarity

Simian analyzes source code to identify duplication and similarity across files, repositories, or systems.

Identifies where identical or near-identical code appears
Measures the extent of code duplication across a system
Reveals patterns that suggest reuse, copying, or shared origin

In engineering, this improves maintainability. In litigation, it supports defensible analysis of code similarity. Simian functions as a code similarity check and similarity analyzer, helping teams evaluate meaningful overlap across codebases.

Practical Advantages for Scanning Modern Codebases

To conquer the problem of comparing against all files and all blocks within those files, rather than matching all source code files in pairs, Simian accomplishes this with its fingerprint-matching engine, which generates and matches unique fingerprints from source code blocks. This results in a fast, efficient way to detect duplicate code and analyze code similarity across modern codebases. It delivers accurate results without unnecessary overhead.

Fast, Lean Execution: Simian scans code quickly and detects code duplication without requiring heavy infrastructure or long analysis cycles. This makes it practical to run locally, including as part of pre-commit workflows.

Effective Detection of Modified Duplicates: Simian identifies duplicated logic even when superficial elements such as variable names, strings, or numeric values have been changed. This allows it to detect duplicate code that simple text comparison tools often miss.

Language-Agnostic Comparison: Simian applies a consistent approach to scanning code across languages such as C, C++, C#, Java, JavaScript, PHP, Ruby, and COBOL. It can analyze any file encoded as plain text, allowing teams to detect duplication across virtually any codebase or text-based content using a single tool.

Configurable Noise Reduction: Simian can ignore boilerplate code such as headers, imports, and generated structures through simple configuration. This improves the signal-to-noise ratio and ensures results focus on meaningful duplication rather than irrelevant matches.

A Proven Software Analysis Tool with Continued Investment

Simian Version 1 was originally released in 2003 by Simon Harris. It has been used for over two decades to detect code duplication across diverse systems.

Quandary Peak Research acquired the rights to Simian to ensure its continued availability and to support ongoing improvements. By transitioning the project to open-source, we are making it easier for the community to:

Review and refine detection methods
Extend support for modern languages and workflows
Contribute enhancements based on real-world use cases

This is not a new tool entering the space. It is a mature foundation that we are maintaining and evolving.

Use Cases: Detecting Duplicate Code Across Engineering and Litigation

Simian supports multiple audiences:

Software Engineers: Use Simian to detect duplicate code early, reduce redundancy, and maintain cleaner codebases. Its speed makes it suitable for frequent use during development.

Source Code Reviewers: Use Simian to scan large or multi-repository environments and identify meaningful code similarities efficiently across large or multi-repository environments.

IP Litigation Attorneys and Experts: Use Simian to support structured, repeatable analysis of code similarity. Its consistent methodology helps ensure results can be clearly explained and evaluated.

Building a Transparent Standard for Code Similarity Analysis

Simian has long been a reliable tool for detecting code duplication. By making it open-source, we are encouraging continued development and integration into a wider range of workflows.

A consistent approach to code similarity analysis helps produce reliable and repeatable results across systems. Open-sourcing Simian creates a foundation for ongoing collaboration. We plan to expand its capabilities with input from the community, with a focus on:

Enhancing detection techniques for modern development patterns
Establishing common configurations for widely used frameworks
Improving integration with developer workflows, including pre-commit checks
Maintaining high performance and reliable, repeatable results

This is a collaborative effort to maintain a tool that is technically rigorous, broadly usable, and trusted across disciplines.

How to Get Started with Simian

Simian is available now as an open-source tool to detect duplicate code, perform code similarity checks, and scan codebases for duplication and reuse patterns. Documentation, examples, and contribution guidelines are available on the project site: https://github.com/quandarypeak/simian

We welcome contributions that improve performance, expand language support, or refine detection techniques. The goal is to maintain a tool that is both technically rigorous and broadly accessible.

Verifiable Results Start with Transparent Analysis

Code similarity analysis requires clear methods, consistent execution, and results that can be reliably reproduced. Simian supports this approach by providing a consistent framework for analyzing code duplication and similarity across systems.