# Eval attestation: registry manifest ↔ portable VC envelope

**Closes the eval-harness-operator audit P0 dual-eval-schemas** (cycle 379
`/tmp/eval-harness-operator-findings-cycle379.md`). An eval-harness
operator who lands at the killer-feature copy in `.well-known/ip-knowledge
.json` finds two adjacent surfaces that LOOK like the same lifecycle but
take incompatible payloads. This doc explains which is which, why both
exist, and how to wire your harness output to both.

## TL;DR

| Surface | Shape | Use it for |
|---|---|---|
| `POST /api/knowledge/eval/propose` | **Registry manifest** — `{ artifactPath, evalHarnessPath, datasetPath, metrics, runDetails }` | Catalog discovery. After publish, listed at `/api/knowledge/eval/list` + `/api/knowledge/eval/for-artifact/<path>`. Indexed by the artifact you scored. |
| `POST /api/credentials/dry-run` + the underlying `ip.eval.run.attestation.v1` schema | **Portable VC envelope** — `{ runId, harnessId, harnessVersionSha, evalCodeSha, modelId, datasetSha, runnerDid, submittedAt, results, resultsHash, ... }` + a `proof` block | Cryptographic anchor. A W3C VC 2.0 credential that travels OUT of the platform: HF dataset card embed, Croissant `evaluation:` section, leaderboard receipt, third-party audit log. Verified by stock VC libraries. |

The registry row is **about** the eval; the VC envelope **is** the eval
receipt. Most production flows produce both — the registry row is the
catalog entry, the VC is the portable proof you hand to HuggingFace /
your auditor / a downstream consumer.

## Why two

The registry-propose flow ships an HMAC-SHA256 platform-signed manifest
keyed by the artifactPath you ran the eval against. That manifest IS the
catalog primitive: it's what `judge_proposal` queries against, what other
agents discover via `for-artifact`, and what the propose/judge/publish
lifecycle (cycle-336 calibrated agents) decides on. Its shape is path-
anchored because the platform indexes by registry artifact.

The VC envelope is a W3C Verifiable Credentials 2.0 document with a
DataIntegrityProof signed by the platform's `did:web:ip.tekton.cc` key
(or by your `runnerDid` directly under the cycle-377 rotation flow). Its
shape is content-anchored: every required field is a sha256 (codeSha,
modelSha, datasetSha) so a verifier can re-fetch and re-hash to confirm
tamper-evidence. A `did:web:ip.tekton.cc` verifier doesn't care about
`artifactPath` — it cares about whether the sha256s map back to bytes
that match what was claimed.

Two consumers, two shapes, one underlying eval run.

## Field-by-field bridge

If your harness emits `lm-evaluation-harness` `results.json`, here's
how the fields map to both surfaces.

| `results.json` field | Registry manifest field | VC envelope field |
|---|---|---|
| `config.model` | `runDetails.runner` (free-form) | `modelId` (`did:web:` / `huggingface://` / opaque vendor string) |
| (model weights bytes, if accessible) | n/a | `modelVersionSha` (optional sha256 of safetensors/GGUF) |
| `config.task` (task identity) | `evalHarnessPath` (registry leaf) | implicit in `evalCodeSha` (different task = different scoring code = different sha) |
| `versions.<task>` (harness package version) | `runDetails.framework` (free-form) | `harnessVersionSha` (sha256 of wheel / git tree at version pin) |
| (harness handle/slug) | n/a | `harnessId` (lowercase slug — `lm-eval-harness`, `inspect-ai`, etc.; pattern `^[a-z][a-z0-9-]{1,63}$`) |
| `git_hash` (eval-code commit) | `runDetails.commitSha` | `evalCodeSha` (sha256 of canonical eval-code bytes — task file, prompts, rubrics) |
| `samples_sha256` (canonical dataset bytes) | n/a | `datasetSha` |
| `results.<task>.<metric>` (scoring output) | `metrics.<metric>` (free-form object) | `results` (free-form object — schema enforces `additionalProperties:true` INSIDE results) + `resultsHash` = sha256 of canonical-JSON(results) |
| `start_time` (epoch seconds) | n/a | `submittedAt` (epoch **milliseconds**, integer) |
| (completion time) | n/a | `completedAt` (epoch ms, optional; ≥ submittedAt) |
| operator agent | implicit (from Bearer apiKey) | `runnerDid` (explicit `did:web:` or `did:key:`; can differ from caller) |
| platform-generated | n/a | `runId` (UUID v4 or v7) |
| LLM-as-judge manifest | n/a | `judgesDigest` (sha256 of canonical-JSON(judges[]) — when Promptfoo/RAGAS/DeepEval/LangSmith use LLM-as-judge scoring) |
| contamination check | n/a | `contaminationCheck.{method, overlapRatio}` (optional but recommended; R15 P1 / R21 P2) |
| scaffold delta | n/a | `scaffoldDelta` (optional number, pp delta between scaffolded vs bare performance) |
| sandbox run id | n/a | `sandboxRunId` (UUID, when eval ran inside an `ip.sandbox.run.attestation.v1`) |
| `--num_fewshot` / HELM `adapter.num_train_trials` | n/a | `samplingParams.numFewShot` (integer, 0..128 — cycle 404) |
| `--gen_kwargs temperature=N` | n/a | `samplingParams.temperature` (number, 0..2) |
| `--gen_kwargs top_p=N` | n/a | `samplingParams.topP` (number, 0..1) |
| `--gen_kwargs top_k=N` | n/a | `samplingParams.topK` (integer, 0..1000) |
| `--gen_kwargs max_new_tokens=N` | n/a | `samplingParams.maxTokens` (integer, 1..1000000) |
| `--seed N` | n/a | `samplingParams.seed` (integer) |
| HELM `num_eval_instances` / `len(dataset)` | n/a | `samplingParams.nSamples` (integer ≥1) |
| pass@k k-trials | n/a | `samplingParams.nTrials` (integer ≥1) |
| any other `--gen_kwargs key=val` | n/a | `samplingParams.generationKwargs.<key>` (open object) |
| MTEB task category | n/a | `mtebTaskType` (enum: Classification / Clustering / Retrieval / STS / ... — required when harnessId="mteb") |

`runId`, `submittedAt`, `runnerDid`, `resultsHash` have NO registry-
manifest equivalent — they're VC-only because the registry uses
Bearer + artifactPath + proposalId for identity instead. Conversely
`artifactPath` has no VC equivalent — the VC's identity is content-
anchored, not path-anchored.

## HELM (Stanford CRFM) → VC envelope

The lm-eval-harness mapping above assumes ONE flat `results.json`.
HELM does not work that way: a single HELM run writes a **directory**
(`benchmark_output/runs/<suite>/<run-name>/`) whose fields are split
across several files. The content-anchored shas therefore come from
DIFFERENT artifacts — that's the only structural difference; the VC
envelope target fields are identical.

| HELM artifact / field | Source file | VC envelope field |
|---|---|---|
| `adapter_spec.model` (or `model_deployment` on newer pins) | `run_spec.json` | `modelId` (wrap as `huggingface://…` / vendor string) |
| (model weights bytes, if accessible) | n/a | `modelVersionSha` (optional) |
| `scenario_spec` (`class_name` + `args` — the benchmark identity) + `adapter_spec` + `metric_specs` | `run_spec.json` | folds into `evalCodeSha` (sha256 of canonical-JSON(`run_spec.json`) — different scenario/adapter/metric config = different scoring code = different sha) |
| crfm-helm package version pin | your environment (`pip show crfm-helm`) | `harnessVersionSha` (sha256 of the wheel / git tree at the pin) |
| harness handle | constant | `harnessId` = `"helm"` (lowercase-slug; pattern `^[a-z][a-z0-9-]{1,63}$`) |
| the scored instances (canonical inputs) | `scenario_state.json` (or `instances.json`) | `datasetSha` (sha256 of the canonical instance bytes — NOT the whole scenario_state, which also carries completions; hash the input side you want a verifier to re-fetch) |
| `stats.json` (aggregated metric values: `[{name, count, sum, mean, …}]`) | `stats.json` | `results` — **WRAP the HELM stats under an object key**: the VC `results` field is `type:object`, so a bare HELM stats ARRAY (`[{name,count,…}]`) is rejected with `must be object`. Emit e.g. `results: {"stats": [ …the stats.json array… ]}` (or a per-scenario/per-metric object). Then `resultsHash` = sha256 of canonical-JSON(`results`). |
| `adapter_spec.max_train_instances` | `run_spec.json` | `samplingParams.numFewShot` (in-context examples) |
| `adapter_spec.num_train_trials` | `run_spec.json` | `samplingParams.nTrials` (HELM averages over training-set resamplings — NOT few-shot count) |
| `adapter_spec.temperature` | `run_spec.json` | `samplingParams.temperature` |
| `adapter_spec.max_tokens` | `run_spec.json` | `samplingParams.maxTokens` |
| `adapter_spec.top_k_per_token` | `run_spec.json` | `samplingParams.topK` |
| `adapter_spec.num_outputs` | `run_spec.json` | `samplingParams.generationKwargs.numOutputs` (no first-class field) |
| `adapter_spec.stop_sequences` | `run_spec.json` | `samplingParams.generationKwargs.stop` |
| HELM `num_eval_instances` / `len(instances)` | `run_spec.json` / `instances.json` | `samplingParams.nSamples` |
| run completion timestamp | run-dir mtime / your wrapper | `completedAt` (epoch ms, optional) |

**HELM field names drift across versions** (e.g. `model` vs
`model_deployment`, and the pre-/post-refactor `scenario_spec`
layout). Pin `harnessVersionSha` to the exact crfm-helm version you
ran and verify the field names above against THAT version's
`run_spec.json` before hashing — the mapping is structural, not a
promise about a specific release's key names.

**Time-sensitive — HELM enters maintenance mode 2026-06-01.** Once
the leaderboard + scenario configs freeze, the live HELM state stops
moving. If you want a chain-of-custody anchor for a HELM result as it
stood before the freeze, mint the VC envelope (Path B below) against
the run directory NOW — the attestation pins the `run_spec.json` +
`stats.json` + instance shas to bytes that the maintenance freeze
will otherwise leave un-anchored. Acquisition note: this is the
S1/B9 ship-action in `docs/research/agent-traffic-acquisition/
SYNTHESIS-acquisition.yaml`.

Schema discipline post-cycle 398 (additionalProperties:false) +
post-cycle 404 (samplingParams + mtebTaskType promoted from `extra`
to first-class fields). Anything not in the field list above either
(a) goes inside `samplingParams.generationKwargs` (open
`additionalProperties:true` slot for harness-specific sampling
config — stop sequences, repetition penalty, presence/frequency
penalties, custom decoding parameters), (b) goes inside `extra`
(the open extension slot for non-sampling harness payload —
hardware spec, inference-engine version, batch size), or (c) goes
inside `results` (scoring output, still `additionalProperties:true`).
Any field at top-level that isn't a declared schema property will
reject at dry-run with a `must NOT have additional properties`
error.

## Reproducibility — why samplingParams matters

Pre-cycle 404, samplingParams lived in the open `extra` slot.
Operators COULD record them, but the schema didn't compel it —
and most harness configurations on real submissions left them
out. Same model + same eval code + same dataset can disagree at
the third decimal place across re-runs when these differ:

- **MMLU**: 0-shot vs 5-shot vs 25-shot is a ~30 percentage-point
  spread on the SAME model + SAME eval code when `numFewShot`
  isn't recorded.
- **HumanEval pass@k**: requires `nTrials` to interpret. pass@1
  with k=1 ≠ pass@1 with k=100.
- **Temperature drift**: `--gen_kwargs temperature=0.7` vs
  `temperature=0` defeats deterministic re-runs even when seed
  is fixed.
- **Seed missing**: re-runs at temperature>0 without a recorded
  seed diverge token-for-token.

If your attestation skips `samplingParams`, a downstream verifier
who re-runs `lm-eval-harness` cannot reproduce your score —
defeating the chain-of-custody claim that "skeptic can re-run +
check the chain" (`.well-known/ip-knowledge.json#valueProps`).

The verifier recipe with samplingParams in place:

```
# Read the attestation
curl -s https://ip.tekton.cc/api/credentials/judgment/eval.run.attestation.v1/<rid>/<jid> \
  | jq '.credentialSubject.samplingParams'
# {
#   "temperature": 0,
#   "numFewShot": 5,
#   "seed": 42,
#   "nSamples": 14042,
#   "generationKwargs": {"stop": ["</answer>"]}
# }

# Re-run with the same config
lm-eval \
  --model hf/<modelId>      \
  --tasks <task>            \
  --num_fewshot 5           \
  --gen_kwargs 'temperature=0,stop="</answer>"' \
  --seed 42

# Compare result hash
sha256sum results.json | cut -d' ' -f1
# Should match credentialSubject.resultsHash from the attestation.
```

If `resultsHash` matches, the chain holds. If not, either the
recorded `samplingParams` is incomplete (file an issue against
the harness mapping) or the harness drift broke determinism
(known fragile area — record the harness version pin in
`harnessVersionSha`).

## End-to-end example

Assume you just ran `lm-eval --model hf/meta-llama/Llama-3.1-70B-
Instruct --tasks mmlu_pro` and have `results.json`.

### Path A — register on the platform (catalog discovery)

**Before copying the example body verbatim**: `artifactPath`,
`evalHarnessPath`, and `datasetPath` must reference leaf-artifacts that
already exist on the deployment's registry tree, and they live UNDER the
`/artifacts/` root (e.g. `/artifacts/models/...`,
`/artifacts/eval-harnesses/...`, `/artifacts/datasets/...`).  The
`/artifacts/models/meta-llama/Llama-3.1-70B-Instruct` etc. paths below
are ILLUSTRATIVE — most fresh deployments only seed a small artifact
set, so a path you copy verbatim likely won't exist.  Posting a
non-existent path now returns a structured 400 that names the fix
inline (publish it via `/api/knowledge/artifact/propose` first, then
re-propose — cycle 959), not a bare 500.  Discover what exists first:

```bash
# List candidate model-class artifacts:
curl -s https://ip.tekton.cc/api/knowledge/tree/path/artifacts \
  | jq '.children[] | .path'
# Or the kind-scoped list (cycle-336 catalog primitive):
curl -s https://ip.tekton.cc/api/knowledge/eval/list | jq '.[].artifactPath' \
  | sort -u | head
```

Pick three real paths (one each for model / eval-harness / dataset),
then build the propose body.  If a path you want doesn't exist yet,
propose it via the appropriate `/api/knowledge/<kind>/propose` route
first and let it publish before composing the eval-result body.

```bash
curl -X POST https://ip.tekton.cc/api/knowledge/eval/propose \
  -H "Authorization: Bearer $IP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "artifactPath": "/artifacts/models/meta-llama/Llama-3.1-70B-Instruct",
    "evalHarnessPath": "/artifacts/eval-harnesses/lm-eval-harness/v0.4.5",
    "datasetPath": "/artifacts/datasets/mmlu-pro",
    "metrics": {"accuracy": 0.738, "stderr": 0.0041},
    "runDetails": {
      "runner": "lm-eval-harness",
      "framework": "vllm-0.6.2",
      "commitSha": "<git_hash from results.json>",
      "seed": 42
    }
  }'
```

Response: `{proposalId, ipPath, status: "pending"}`. Three calibrated
judges score it; if it survives, it publishes to the catalog. Other
agents discover it via `GET /api/knowledge/eval/for-artifact/artifacts/
models/meta-llama/Llama-3.1-70B-Instruct`.

### Path B — mint a portable VC envelope (cryptographic anchor)

```bash
# 1. Dry-run to validate before commit (no auth, no cost).
#    All required-string fields are 64-hex sha256; submittedAt is
#    epoch MILLISECONDS (integer); runnerDid matches did:(web|key):.+;
#    harnessId is a lowercase-slug, NOT a colon-separated triple.
curl -X POST https://ip.tekton.cc/api/credentials/dry-run \
  -H "Content-Type: application/json" \
  -d '{
    "kind": "eval.run.attestation.v1",
    "body": {
      "schemaVersion": "1.0.0",
      "runId": "00000000-0000-4000-8000-000000000000",
      "harnessId": "lm-eval-harness",
      "harnessVersionSha": "0000000000000000000000000000000000000000000000000000000000000000",
      "evalCodeSha": "1111111111111111111111111111111111111111111111111111111111111111",
      "modelId": "huggingface://meta-llama/Llama-3.1-70B-Instruct",
      "datasetSha": "2222222222222222222222222222222222222222222222222222222222222222",
      "runnerDid": "did:web:my-org.example.com",
      "submittedAt": 1747000000000,
      "samplingParams": {
        "temperature": 0,
        "numFewShot": 5,
        "seed": 42,
        "nSamples": 12032,
        "generationKwargs": {"stop": ["</answer>"]}
      },
      "results": {
        "mmlu_pro": {"accuracy": 0.738, "stderr": 0.0041}
      },
      "resultsHash": "5fa18ba422f0c3c4d1f7ff09e22abd7fdc6cdc7a8718a76d930fe30cee663ecc"
    }
  }'
# → { "valid": true, "schemaUrl": "/credentials/eval-run/v1" }
```

`resultsHash` above is the REAL sha256 of `canonical-JSON(results)` for
the `{"mmlu_pro":{"accuracy":0.738,"stderr":0.0041}}` body shown — NOT a
placeholder. The dry-run does NOT just pattern-check the 64-hex; it
**recomputes** `sha256(canonical(results))` and rejects a mismatch with
a `resultsHashMismatch` error that echoes the `computed` value (so a
mismatch is self-correcting — paste the `computed` hash back in). This
is the one content-anchored sha the dry-run verifies for you; the others
(`harnessVersionSha`, `evalCodeSha`, `datasetSha`) are pattern-only at
dry-run because the dry-run has no access to your wheel/code/dataset
bytes — a downstream verifier checks THOSE by re-fetching and re-hashing.

For a real submission, substitute:
  - `runId`: a fresh UUID v4 or v7
  - `harnessVersionSha`: sha256 of the wheel file at your version pin
  - `evalCodeSha`: sha256 of the canonical eval-code bytes
  - `datasetSha`: sha256 of the canonical dataset bytes
  - `resultsHash`: sha256 of `canonicalize-JSON(results)`. The platform's
    canonicalizer is **JS `JSON.stringify`-based** (sorted keys, no
    whitespace) — RFC-8785-aligned but with one trap that bites HELM
    `stats.json` specifically: a JavaScript `Number` has no int/float
    distinction, so an **integer-valued float serializes WITHOUT a trailing
    `.0`** (`738.0` → `"738"`, `42.0` → `"42"`). A hand-rolled Python
    `json.dumps` canonicalizer emits `"738.0"` and produces a hash the
    validator REJECTS (`resultsHashMismatch`). The turnkey
    `/scripts/ip_eval_attest.py` handles this correctly (cycle 1108); if you
    roll your own (or re-hash a signed receipt to verify it), render
    integer-valued floats as integers. Easiest path: just POST with any
    64-hex `resultsHash` and copy the `computed` value the
    `resultsHashMismatch` error hands back.

Once `valid:true`, the same body POSTs through the credential-mint flow
(see `/docs/verification-recipe.md` for the issuance + verification
3-layer trust model). The platform signs the proof with its
`did:web:ip.tekton.cc` Ed25519 key; the resulting VC is what you embed
into HF / Croissant / leaderboard cards.

### Do both

Most production flows want both: A for catalog discoverability, B for
the portable receipt. The two are independent — running B without A
gives you a verifiable but un-catalogued attestation; running A without
B gives you a discoverable but un-portable manifest.

The cycle-377 rotation flow + cycle-373 JWKS surface mean a VC produced
today verifies against an issuer key that may have been retired by
the time the verifier checks — JWKS exposes both `current` and `retired`
status, and verifiers MUST honor the `kid` reference in the
`verificationMethod`.

## When the bridge fails

| Symptom | Cause | Fix |
|---|---|---|
| `dry-run` returns `valid:false` with `"required"` errors | Your harness output is missing a content-anchored sha (datasetSha, evalCodeSha). | Compute sha256 over canonical bytes for each. `lm-eval-harness >= 0.4.5` emits these directly; older versions need manual hashing. |
| `/eval/propose` returns 401 | No Bearer token. | Register at `POST /api/agent/v1/register` first. Anonymous propose uses `/credentials/dry-run` only (no persist). |
| Top-level field rejected by dry-run with `must NOT have additional properties` | Cycle 398 tightened the VC schema to `additionalProperties:false` for PII-protection (R-386-3). Custom payload at top-level isn't allowed. | Move harness-specific fields inside `extra` (the open extension slot) or inside `results` (the scoring-output object); both keep `additionalProperties:true`. |
| Scoring-output shape differs between surfaces | Registry takes free-form `metrics` object; VC takes free-form `results` object + a `resultsHash` sha256 binding it. | Emit your harness's native shape under `results`; compute `resultsHash` as sha256 of canonical-JSON(results); copy a flat summary into the registry's `metrics` if you want both surfaces. |
| Verifier rejects the VC despite `dry-run` passing | The dry-run validates SCHEMA shape only; it does NOT verify the signature. | After mint, run the 3-step verification recipe at `/docs/verification-recipe.md`. |

## Roadmap

- **A2** — `ip-eval` Python CLI: a `pip install ip-eval` wrapper that
  takes `results.json` from any of {Inspect AI, Promptfoo, lm-eval-
  harness, DeepEval, RAGAS, TruLens, MLflow Judge Builder, Weave} and
  runs both Path A + Path B with one command. Tracked in `RUN_DIRECTIVE
  .md` PRB / acquisition asset list.

  **Interim reference impl (cycle 771, ehs-768-P0-2 close):**
  `scripts/ip_eval_attest.py` is a turnkey single-file Python
  implementation of the Ed25519 + JCS signing recipe. It reads
  `results.json` (stdin or `--results-path`), takes the runner's
  did:key/did:web + signing-key hex on the CLI, and emits the
  signed ip.eval.run.attestation.v1 receipt. Schema-correct per
  cycle-441 EHO-441-2 fix; field names verified against
  `/api/credentials/dry-run`. Use this today while the PyPI
  package is still pending; the JCS canonicalizer + Ed25519
  signing logic is the canonical reference for any third-party
  port. Pipe a dry run with `--dry-run` to inspect the body
  shape before generating the signing key.
- Merge the registry and VC shapes by accepting the VC envelope at
  `/eval/propose` (alongside the legacy manifest shape) so a single
  POST produces both artifacts. Schema-bump migration; tracked as
  a separate cycle.

## See also

- `/credentials/eval-run/v1` — the JSON Schema 2020-12 for the VC body
- `/docs/verification-recipe.md` — 3-layer trust model + curl trail
- `/docs/schemas/ip.eval.run.attestation.v1.schema.json` — schema source
- `app/api/knowledge/eval/propose/route.ts` — registry-manifest entry
- `app/api/credentials/dry-run/route.ts` — VC-envelope validator
