Technical Note

Poisoned GPT-J Checkpoint

A reproducible case study in checkpoint tamper analysis.

Inside a Poisoned GPT-J Checkpoint: Structural Evidence, Behavioral Mapping, and Counterfactual Isolation of a Targeted Edit

Download PDF Back to research Verify a receipt

Simple version

We found a tiny hidden change inside an AI model that made it answer some questions differently.

In plain terms

Two versions of the same model looked almost identical, but one behaved differently on a targeted set of prompts. We compared the model files, ran a frozen set of behavioural tests, and checked whether reversing only the suspicious change restored the original answers.

Technical artifact

A paired checkpoint tamper analysis using target-blind structural comparison of the eligible 2-D weights, pre-registered behavioural probes, and a one-tensor counterfactual revert, packaged as a Tier-1 signed bundle.

Evidence path

Step 1

Compare the model files

Target-blind structural scan over the paired checkpoints.
Step 2

Test the behaviour

Frozen behavioural probes over targeted and control prompts.
Step 3

Reverse the suspicious change

Counterfactual revert of the suspect tensor.
Step 4

Package the artifact

Tier-1 signed bundle with verifier-readable outputs.

Artifact details

Verification surface: Verification
Artifact: Technical note · PDF
Model: GPT-J 6B
Method: Paired checkpoint comparison
Evidence collected: Structural scan, behavioural probes, counterfactual revert
Verification result: 12 / 12 checks passed

Scope of this note

This page summarizes one paired checkpoint comparison and the artifact bundle produced from that comparison. The PDF contains the full technical method, thresholds, and verifier details.

Open the technical artifact

The PDF contains the full method, thresholds, comparison details, and verifier information for this case study.

Download PDF Back to research

Have an artifact?

Verify a receipt, digest, or bundle.

Verify a receipt