Technical Note

Poisoned GPT-J Checkpoint

A reproducible case study in checkpoint tamper analysis.

Inside a Poisoned GPT-J Checkpoint: Structural Evidence, Behavioral Mapping, and Counterfactual Isolation of a Targeted Edit

Simple version

We found a tiny hidden change inside an AI model that made it answer some questions differently.

In plain terms

Two versions of the same model looked almost identical, but one behaved differently on a targeted set of prompts. We compared the model files, ran a frozen set of behavioural tests, and checked whether reversing only the suspicious change restored the original answers.

Technical artifact

A paired checkpoint tamper analysis using target-blind structural comparison of the eligible 2-D weights, pre-registered behavioural probes, and a one-tensor counterfactual revert, packaged as a Tier-1 signed bundle.

Evidence path

  1. Step 1

    Compare the model files

    Target-blind structural scan over the paired checkpoints.

  2. Step 2

    Test the behaviour

    Frozen behavioural probes over targeted and control prompts.

  3. Step 3

    Reverse the suspicious change

    Counterfactual revert of the suspect tensor.

  4. Step 4

    Package the artifact

    Tier-1 signed bundle with verifier-readable outputs.

Artifact details

Verification surface
Verification
Artifact
Technical note · PDF
Model
GPT-J 6B
Method
Paired checkpoint comparison
Evidence collected
Structural scan, behavioural probes, counterfactual revert
Verification result
12 / 12 checks passed

Scope of this note

This page summarizes one paired checkpoint comparison and the artifact bundle produced from that comparison. The PDF contains the full technical method, thresholds, and verifier details.

Open the technical artifact

The PDF contains the full method, thresholds, comparison details, and verifier information for this case study.

Have an artifact?

Verify a receipt, digest, or bundle.

Verify a receipt