/ 03.2 / RESEARCH ARTICLES & ESSAYS

Reproducibility Is the Entire Point

Being able to reproduce the results of a published study has been taken for granted. Reproducibility is now asserted and not demonstrated. With signrun, a claim of reproducibility becomes a signed integrity record anyone can verify in seconds, without re-running the work.

Research has a bit of a problem. The reproducibility of a study, today, is more of an assertion than demonstration. It lives in a README, a methods section, or a comment that reads # seed=42 for reproducibility. Each is a claim about the past with no artifact attached that a stranger can check.

Consider this: You want to verify a figure from a 2024 paper. The repository is public, the data is posted, and the code runs. To confirm the figure you must re-run the whole pipeline and then trust that the inputs you fed it, the code you checked out, and the environment you assembled match what the author actually used. The "trust the science" mantra really means "trust many separate people." Frankly, that is a lot to ask. When the referee's whistle blows, what was invisible suddenly requires adjudication, and the sole mediators are people's recollections.

This problem with reproducibility is often diagnosed as behavioral (researchers should be more disciplined) or a tooling problem (we need better lab notebooks). It is neither. It is an evidence problem. Why do we not have some convention, one that produces a small artifact that ties off the loose ends: some thing an author produces one time, ships with their results, and a third party may independently check in seconds.

What would change if every claim of reproducibility came with a signed integrity record a stranger could check?

A simple mechanism

A signed integrity object converts an assertion into a falsifiable artifact, for the artifacts it declares.

The mechanism is simple by design. We take SHA256 over a set of declared inputs, the entry point script, the project-local modules you imported, and the declared outputs. Wrap the hashes in a DSSE envelope around an in-toto Statement, and sign the envelope with an Ed25519 key. The result is a manifest.json you ship alongside your figure. A reviewer runs one command, signrun verify --fingerprint <your-fingerprint> manifest.json, and learns two things at once: whether the bytes on disk still match what you signed, and whether the signature matches the key you published.

An embedded public key is not the only trust criterion. Anyone can generate a keypair, sign something, and embed the matching public key in the manifest. Trusted verification requires the verifying party to obtain your fingerprint from a channel you own (website, a GitHub, an email). Without that, signrun verify reports "signature valid, signer untrusted" and declines to draw a conclusion about authorship.

So: --fingerprint proves continuity with a published key, not real-world identity. A compromised account or a replaced web page could publish a different fingerprint, and a verifier reading that channel would trust the wrong key. Channel integrity and key continuity over time are valid concerns but out of scope in this version, and the manifest does not claim otherwise.

The verifier workflow:

  1. Publish your fingerprint, a sha256: string, somewhere you control. Ideally an immutable channel such as a tagged release, alongside surfaces like a profile page or your personal website.
  2. Wrap your target pipeline: signrun wrap pipeline.py.
  3. Ship the resulting manifest.json with your figure, dataset, paper, or code.
  4. A reviewer fetches your fingerprint from your published channel and runs signrun verify --fingerprint <that-string> manifest.json.

The envelope and Statement formats are not novel. They are DSSE and in-toto, the same artifacts Sigstore, SLSA, and the software supply-chain ecosystem already use in production across Kubernetes and Google's internal build systems. A signrun manifest is a valid in-toto Statement other tooling can read, and tomorrow's additions (keyless signing, third-party timestamps) compose with it rather than replacing it. What previously required assembling Sigstore, Cosign, Fulcio, and Rekor, plus the knowledge to wire them together, now takes pip install signrun and a wrap command for a scoped case of signed integrity over a set of declared artifacts.

Why this matters now

I perceive lopsided demand at the project level, beneath the regulatory layer, and that is where this tool could earn its place first.

It would be easy to claim that NIH data-management policy or the EU AI Act mandates cryptographic provenance. They do not. Those policies require plans for data management and sharing, and descriptions of methods; the artifacts they ask for are descriptive, not cryptographic. Overselling a compliance mandate the tool does not satisfy is the fastest way to get refuted.

Demand sits upstream of compliance. Researchers, software engineers, and principal investigators increasingly want their own evidence of what a pipeline produced: for internal replication six months later, for debugging a value that moved, or for the eventual dispute over a result. A signed record is a cheap insurance policy that satisfies an existence criterion for evidence before it is needed.

Why now? The foundations are mature: DSSE and in-toto have been deployed at scale for years. And scientific Python has not adopted them. The gap between "the supply-chain world has this solved" and "a chemist with a NumPy pipeline has nothing" is a gap worth minding.

How did a product manager come up with this?

I did not wake up and decide to build a cryptographic reproducibility solution. As a formally educated chemist and an ardent politico, I occupy an often controversial position. That position is squarely between the "trust the science!" crowd and the science skeptics. To me, it appears the only defensible position when R1 university Chancellors are resigning for academic misconduct, thousands of published papers are retracted annually, and reproducibility rates among disciplines are low - often much lower than even the "trust the science" crowd is aware. So no, I did not pull this out of my hat. I've been running this discipline by hand for two systems I regularly operate; signrun is a useful, public, and free abstraction of what they do.

The Jury

The Jury is a synthetic-respondent platform: large pools of LLM agents are calibrated against real demographic distributions. The data loader refuses to process low-trust data at runtime. Every conditional prior must satisfy the full provenance criteria (source URL, field year, sample size, question wording, weighting note, known limitations) or the run fails. The discipline catches what a methods section would have quietly omitted.

AI-Mediated Computational Chemistry

The second project I'll discuss is a computational chemistry project of mine. I've been running density-functional-theory pipelines on cloud compute, where every molecular conformer carries a metadata envelope and each stage refuses to process an input whose upstream identity is unverified. No quantum chemical output may enter the analysis pipeline without a verified parent. Auditable provenance falls out as a side effect of my operational hygiene. I don't treat it as a chore.

When I built signrun's first release, I wrapped a real Jury run as the test. It captured eleven declared inputs (the census extract, the config, nine prior files), seven project-local modules, and one output. After four seconds, I was in possession of a manifest that verifies cleanly and catches a one-byte change to any declared input. I built signrun because (a) reproducibility and trust in research is a real problem, and (b) provenance verification of digital packages is becoming more important in light of generative AI. v0.1 is the smallest version that captures what those systems capture in their own domain-specific way.

A Seam Experts Might Point Out

The seam between what this tool captures and what it refuses to claim is probably the most important part of this writeup. A provenance tool that overclaims its coverage is worse than no tool, because it manufactures false confidence. Whether you're reading this on corman.io or somewhere within the release package - do not overlook this plainly stated boundary.

v0.1 captures and signs: declared inputs, the entrypoint and its traced Python imports, declared outputs, the runtime spec (Python version, platform, installed-package hash, dependency-spec hash, lockfile hash when present), and git state.

v0.1 does not capture, and says so in the manifest's own capture_coverage field:

  • A trust anchor. Trusted verification needs the signer's fingerprint, obtained from a channel the verifier already trusts, and even then it only proves continuity with a published key rather than real-world identity. Anyone who can take over that channel can publish a different key, which is why a tagged release is a stronger place to put a fingerprint than a profile page. Keyless signing would close this inside the protocol; that is later work, not a promise.
  • Read-only inputs you did not declare. Inferring reads from filesystem heuristics is unreliable, so v0.1 asks you to declare your dependencies rather than guess at them.
  • Subprocess children, exec'd processes, and multiprocessing workers. The audit hook only sees the parent process. A file a subprocess writes still lands on disk and shows up as an output, but whatever that subprocess imported or read is invisible to the manifest.
  • C-extension I/O that bypasses Python's I/O layer: NumPy memory maps, direct HDF5 reads, libcurl downloads.
  • External I/O: network, databases, object stores.
  • A trusted timestamp. signrun records when a run happened, but the time is self-reported; there is no third-party timestamp authority behind it.
The input has to be byte-identical. Two JSON files that decode to the same dictionary but differ only in whitespace are a drift. It cannot just be equivalent.

A verifier reads capture_coverage and knows exactly what to ask the author about, instead of assuming coverage the manifest never claimed.

What it enables

When a result moves between runs, signrun verify names the drift class (input, output, or code) and the exact path, so triage drops from an afternoon to a minute. Because the manifest separates platform drift from real code drift, a clean cross-machine verify tells you the code and inputs matched even where the wheels did not.

What v0.1 does not enable: standards-body compliance, third-party identity attestation, or "solved" reproducibility. So plan around their absence.


A reviewer who doubts a result no longer has to trust your README, or your file-naming convention; they check the manifest you shipped with the figure.

References

Secure Systems Lab. DSSE: A Dead Simple Signing Envelope. Envelope and protocol specifications. github.com/secure-systems-lab/dsse.

in-toto. Attestation Framework. Statement layer, v1 specification. in-toto/attestation, spec/v1/statement.md.

Sigstore. A new standard for signing, verifying, and protecting software. sigstore.dev.

SLSA. Supply-chain Levels for Software Artifacts. slsa.dev.

Blake Corman is a scientist, product operator, and founder of Chiral Audio. He works across product strategy, AI systems, search and retrieval, experimentation, and experimental audio software. His writing focuses on how rigorous methods from science, engineering, and economics can be translated into practical systems for building better products.