The shipped policy is a hybrid: the learned ranker scores M candidates, the top-K=12 go into a 2-step lookahead. On every in-distribution surface split the learned_joint head alone is below greedy_coverage, e.g. v6 learned 0.430 vs greedy_coverage 0.563 on test_locked. The classical lookahead wrapper is what makes it ship; the learning buys you a smaller, better candidate set, not a standalone planner.
compute_mep_recall originally divided captured instances by the instances visible in the partial cloud, using the partial cloud as its own ground truth. So observing one instance scored 1.0 and oracle inflated 5-50×. Re-anchored to scene-full GT (scene.instance_class): on gni_model_173 oracle dropped from 1.0 → 0.0093. This bug likely contaminates prior NBV numbers scored against partial reconstructions.
Three target-engineering experiments all regressed or no-gain: v5 (surface target), v7_attn (2.6M-param cross-attention head), and v7_k12 (K=2 combined target). One data-diversity experiment won OOD: v7_realsynth (realism-fixed procedural synth). The bottleneck is corpus diversity, not loss formulation. → full ledger
On ifc-bench OOD (10 scenes across a duplex, an office, and a fire-damaged hospital), v7_realsynth beats shipped v4 by +0.0088 (2.6σ): 10 wins, 0 losses, 40 ties over 50 paired cells, zero regressions across all three building types. But it loses on in-distribution (test_locked hybrid 0.486 vs the v6 0.507 leader). A specialist for unseen buildings, not a drop-in replacement.
| Policy | test_locked 3 residential |
held_out 7 GNI / heatpump |
ifc-bench OOD 10 scenes, 3 buildings |
|---|---|---|---|
| greedy_coverage classical | 0.563 ±.004 | 0.062 ±.002 | 0.037 ±.007 |
| octomap_ig Bircher 2016 | 0.500 ±.006 | 0.054 ±.002 | 0.026 ±.004 |
| greedy_lookahead_1 2-step, true extractor | 0.494 ±.017 | 0.129 ±.010 | 0.070 ±.006 |
| v4 mep_recall · SHIPPED | 0.541 | 0.147 | 0.063 ±.002 |
| v6 pure pairwise rank | 0.507 ±.016 | 0.129 ±.009 | 0.059 ±.008 |
| v7_attn cross-attention | 0.496 ±.012 | 0.088 ±.008 | 0.067 ±.011 |
| v7_realsynth realism-fixed synth | 0.486 ±.019 | 0.094 ±.009 | 0.071 ±.003 |
| v7_k12 K=2 combined · failed | 0.474 ±.016 | 0.086 ±.011 | 0.054 ±.005 |
| oracle myopic Δ ceiling | 0.572 ±.003 | 0.185 ±.017 | 0.093 ±.006 |
Three target-engineering bets died. v5 (surface-coverage target) regressed below v4. v7_attn (a 2.6M-param cross-attention head, ~10× the parameters) bought no gain (ρ tied v6 at 0.13, in-distribution hybrid 0.496 ≤ v6 0.507). v7_k12 hit a mathematically unavoidable K=2 degeneracy: pair-coverage target g_star_k2 has exactly 0 top-2 gap. Pair (i,j)=(j,i) symmetry ties the two members, zero gradient. The combined K2+0.1·K1 fix broke ties (10/10 groups) but eval regressed: ρ 0.10, ifc-bench hybrid 0.054, losing to v4's 0.063 and v7_realsynth's 0.071. → full ledger with dates, motivation & what broke
The user set direction in plain language try a+b secondary audit think about scope is this interesting more target signal match the eval. An orchestrating LLM decomposed each into self-contained briefs for parallel sub-agents that dispatched Modal training, ran paired-bootstrap eval, audited the metric, and committed results with honest SHIP / KEEP / DROP verdicts. → the calls that shaped each experiment