Experiments ledger: every run, dated, with honest verdicts

7 model versions + 6 critical-path fixes · 2026-05 → 06 · click a verdict to filter · all on the identical frozen eval

The arc: v3 → v4 → v5 → v6 → v7

This is the chronological record of every checkpoint we trained and every methodology fix that changed a number. The shape of the project is a single, sobering throughline: every time we engineered the supervision target (surface coverage, cross-attention, pair-coverage), the model regressed or no-gained. The only intervention that moved out-of-distribution recall was making the synthetic corpus more realistic. The bottleneck is corpus diversity, not loss design. Below: the version arc, the ledger with verdicts, the six fixes that were load-bearing, and the honesty machinery that let us trust (and discard) results.

All recall numbers are surface-coverage recall@K=1, 5 seeds, n_cand=20, hybrid_top_k=12, mean ± SE across seeds, on three frozen splits: test_locked (3 residential), held_out (7 GNI/heatpump), ifc-bench OOD (10 scenes across a duplex, an office, a fire-damaged hospital). Hybrid = the shipped policy: the learned ranker scores all candidates, takes the top-12, then runs a 2-step lookahead on those 12. The learned ranker alone never beats greedy_coverage. The classical wrapper is what ships.

Where each version landed

The chart plots hybrid surface recall on the ifc-bench OOD split, the hardest, most decision-relevant cut, since it is the only split with a building type held out entirely. The oracle (myopic Δ ceiling) is the dashed line; the classical OctoMap-IG info-gain baseline (Bircher et al., ICRA 2016) is the floor we have to clear. v7_realsynth is the only version that beats the shipped v4 here, and it does it by fixing data, not the target.

hybrid surface recall@K=1 · ifc-bench OOD (10 scenes, 3 buildings) · 5 seeds, mean±SE · ★ = SHIP · dashed = oracle ceiling 0.093 · red = OctoMap-IG 0.026

How the bet split: data vs. target

Read the versions as two competing hypotheses about why the learned ranker plateaus. The target-engineering camp said the supervision signal was too coarse, so it tried richer targets (surface coverage, K=2 pair-coverage) and a richer head (cross-attention). The data camp said the model had already saturated on a thin corpus, so it added realism-fixed synthetic scenes. Three target experiments lost; the one data experiment won.

target hypothesis · LOST 0/3

v5 (Δ surface-coverage target), v7_attn (2.6M-param cross-attention head), v7_k12 (K=2 + 0.1·K1 combined target). All regressed or no-gained on hybrid OOD. Spearman ρ peaked at 0.13 and the eval didn't follow.

data hypothesis · WON 1/1

v7_realsynth: 5 P0 realism fixes to the procedural synth + 30 new scenes. ρ 0.156 (best of any version), beats v4 on ifc-bench OOD by +0.0088 (2.6σ), 0 regressions across all 3 buildings.

the shipped policy

v4 (mep_recall target, PC-NBV-style joint head + ranking loss) is the default checkpoint: hybrid 0.541 on test_locked, 0.147 on held_out. v7_realsynth ships alongside it as the OOD specialist.

the metric ceiling

OOD recall is absolutely low (oracle is only 0.093) because the buildings are genuinely unseen and the GT is scene-full. The interesting quantity is the gap to oracle and the margin over OctoMap-IG, not the raw number.

The full ledger

The six critical-path fixes

These aren't model versions. They're methodology fixes, each of which changed a headline number (and one of which invalidated an entire prior result). They are why the recall numbers above are trustworthy.

The throughline

Plot the interventions by what they touched. Every target-engineering move failed: v5's surface target regressed on every cell; v7_attn's cross-attention helped the pure-learned argmax on test_locked but regressed hybrid everywhere else for 10× the parameters; v7_k12's pair-coverage target hit a mathematically unavoidable symmetry degeneracy, and the combined fix broke ties but lost the eval (ρ 0.10, ifc-bench hybrid 0.054 vs. v4's 0.063). Only the data-diversity move won: v7_realsynth's realism-fixed synth lifted training Spearman to 0.156 and was the sole version to beat v4 on OOD. The honest reading is that the regression-on-rollout-deltas target has a noise floor set by corpus diversity, not loss design: you cannot squeeze more signal out of a thin corpus by reshaping the loss, and richer targets just overfit a target that the data can't support. The lever is more, more-varied buildings.

Anti-pollution / honesty guarantees

Half of what made this trustworthy is what we refused to claim. Six guarantees enforced on every run:

① 5-seed paired CIs

Every headline diff is a paired difference across seeds {0..4} with a standard error, not a single-seed draw. The v7_realsynth OOD win is +0.0088 ± 0.0035: a 2.6σ effect, not a lucky cut.

② per-cell win/loss/tie

We report not just the mean but the count over every (scene × seed) cell. v7_realsynth over v4 on OOD: 10 wins / 0 losses / 40 ties across 50 paired cells. The win is real and never reverses.

③ the v4-never-eval'd catch

The exciting OOD win only counted once we caught that v4 had never been evaluated on ifc-bench: the comparison was apples-to-oranges. We re-ran v4 on the same scenes (0.063) before claiming the +2.6σ. Match the eval before you believe the result.

④ reproduce-or-it-didn't-happen

No number is reported until it reproduces from a committed script: per-step recall curves re-aggregated from scratch, not read off an agent transcript. The metric pathology was caught exactly because the inflated oracle would not reproduce against scene-full ground truth.

⑤ frozen test_locked split

test_locked (3 residential scenes) is frozen in code: no hyperparameter was ever tuned against it. It is the in-distribution anchor that every version is measured against on the identical protocol.

⑥ honest SHIP/MEH/DROP

Sub-agents returned verdicts, not hype. Three target-engineering experiments that looked exciting (v5, v7_attn, v7_k12) were tagged DROP under the same CI discipline that validated v7_realsynth. The graveyard is the point.

Verdicts: SHIP = adopted (default or specialist) · MEH = real but ties the incumbent · DROP = regressed / no-gain / off-domain. Eval held fixed across the study: surface-coverage recall@K=1 · 5 seeds · n_cand=20 · hybrid_top_k=12 · mean ± SE. Splits: test_locked (3 residential) · held_out (7 GNI/heatpump) · ifc-bench OOD (10 scenes, 3 buildings). Hybrid = LearnedPlusLookaheadBaseline. Classical baseline: OctoMap-IG (Bircher et al., ICRA 2016). Corpus: 297 manifest scenes plus IFC-Bench v2 (sylvainHellin/ifc-bench, CC-BY, 93 sub-scenes).