mep

Scoreboard: the whole result in one screen

the at-a-glance TL;DR · every block expanded elsewhere (links inline) · surface-coverage recall@K=1 · 5 seeds · n_cand=20 · hybrid_top_k=12 · mean ± SE across seeds

Headline

Hybrid surface · test_locked

0.541

Learned top-12 + 2-step lookahead. The shipped policy: learned ranker alone sits below greedy_coverage.

Hybrid over OctoMap-IG

+5.4σ

vs the classical info-gain planner (Bircher 2016). +5.4σ test_locked · +5.0σ held-out · +5.6σ OOD.

v7_realsynth · ifc-bench OOD

0.071

vs v4's 0.063 → +0.009 (2.6σ). Best version on unseen building types, the OOD win.

Best training Spearman ρ

0.156

v7_realsynth, the realism-fixed synth corpus. Ranking signal is weak everywhere: the metric is hard.

Four findings

1 · Composition wins, not the model alone SHIPS

The shipped policy is a hybrid: the learned ranker scores M candidates, the top-K=12 go into a 2-step lookahead. On every in-distribution surface split the learned_joint head alone is below greedy_coverage, e.g. v6 learned 0.430 vs greedy_coverage 0.563 on test_locked. The classical lookahead wrapper is what makes it ship; the learning buys you a smaller, better candidate set, not a standalone planner.

2 · Metric pathology: oracle inflated 5-50× FIXED

compute_mep_recall originally divided captured instances by the instances visible in the partial cloud, using the partial cloud as its own ground truth. So observing one instance scored 1.0 and oracle inflated 5-50×. Re-anchored to scene-full GT (scene.instance_class): on gni_model_173 oracle dropped from 1.0 → 0.0093. This bug likely contaminates prior NBV numbers scored against partial reconstructions.

3 · Corpus diversity > target engineering KEY

Three target-engineering experiments all regressed or no-gain: v5 (surface target), v7_attn (2.6M-param cross-attention head), and v7_k12 (K=2 combined target). One data-diversity experiment won OOD: v7_realsynth (realism-fixed procedural synth). The bottleneck is corpus diversity, not loss formulation. → full ledger

4 · v7_realsynth is an OOD specialist, not a universal winner SPECIALIST

On ifc-bench OOD (10 scenes across a duplex, an office, and a fire-damaged hospital), v7_realsynth beats shipped v4 by +0.0088 (2.6σ): 10 wins, 0 losses, 40 ties over 50 paired cells, zero regressions across all three building types. But it loses on in-distribution (test_locked hybrid 0.486 vs the v6 0.507 leader). A specialist for unseen buildings, not a drop-in replacement.

Every policy × split → progress + ledger

surface-coverage recall@K=1, hybrid policy · mean ± SE, 5 seeds · bold green = best non-oracle in column · oracle = myopic Δ ceiling (greyed)

Policy	test_locked 3 residential	held_out 7 GNI / heatpump	ifc-bench OOD 10 scenes, 3 buildings
greedy_coverage classical	0.563 ±.004	0.062 ±.002	0.037 ±.007
octomap_ig Bircher 2016	0.500 ±.006	0.054 ±.002	0.026 ±.004
greedy_lookahead_1 2-step, true extractor	0.494 ±.017	0.129 ±.010	0.070 ±.006
v4 mep_recall · SHIPPED	0.541	0.147	0.063 ±.002
v6 pure pairwise rank	0.507 ±.016	0.129 ±.009	0.059 ±.008
v7_attn cross-attention	0.496 ±.012	0.088 ±.008	0.067 ±.011
v7_realsynth realism-fixed synth	0.486 ±.019	0.094 ±.009	0.071 ±.003
v7_k12 K=2 combined · failed	0.474 ±.016	0.086 ±.011	0.054 ±.005
oracle myopic Δ ceiling	0.572 ±.003	0.185 ±.017	0.093 ±.006

Best-in-column is among the non-oracle policies where directly comparable: v6 on test_locked, v7_realsynth on held_out and OOD. Oracle is the myopic Δ ceiling, greyed.

Training Spearman ρ: ranking signal by version

peak ρ between learned score and true Δ-recall target on the train set · v7_realsynth leads · all weak (target is hard)

0.060

0.130

v7_attn

0.130

v7_k12

0.100

v7_realsynth

0.156

The graveyard

Three target-engineering bets died. v5 (surface-coverage target) regressed below v4. v7_attn (a 2.6M-param cross-attention head, ~10× the parameters) bought no gain (ρ tied v6 at 0.13, in-distribution hybrid 0.496 ≤ v6 0.507). v7_k12 hit a mathematically unavoidable K=2 degeneracy: pair-coverage target g_star_k2 has exactly 0 top-2 gap. Pair (i,j)=(j,i) symmetry ties the two members, zero gradient. The combined K2+0.1·K1 fix broke ties (10/10 groups) but eval regressed: ρ 0.10, ifc-bench hybrid 0.054, losing to v4's 0.063 and v7_realsynth's 0.071. → full ledger with dates, motivation & what broke

Shipped deployable

Default policy

v4 + hybrid

mep_recall target, learned top-12 + 2-step lookahead. 0.541 test_locked · 0.147 held_out. The everyday operating point.

OOD specialist

v7_realsynth

Realism-fixed procedural synth. 0.071 on unseen building types (+2.6σ over v4), 10W/0L/40T. Reach for it off-distribution.

Method human-steered, agent-executed

The user set direction in plain language try a+b secondary audit think about scope is this interesting more target signal match the eval. An orchestrating LLM decomposed each into self-contained briefs for parallel sub-agents that dispatched Modal training, ran paired-bootstrap eval, audited the metric, and committed results with honest SHIP / KEEP / DROP verdicts. → the calls that shaped each experiment

Eval protocol held fixed across the study: surface-coverage recall@K=1 · 5 seeds · n_cand=20 · hybrid_top_k=12 · mean ± SE across seeds. Splits: test_locked (3 residential) · held_out (7 GNI / heatpump) · ifc-bench OOD (10 scenes across duplex_mep, wbdg_office_mep, west_riverside_hospital_fire). Hybrid = LearnedPlusLookaheadBaseline (model scores M candidates, takes top-K=12, runs 2-step lookahead). Classical baseline: OctoMap-IG (Bircher et al., ICRA 2016). Corpus: 297 manifest scenes plus IFC-Bench v2 (sylvainHellin/ifc-bench, CC-BY, 93 sub-scenes).