Next-best-view planning for MEP discovery — the mechanical, electrical, and plumbing buried inside a BIM. The shipped policy is a learned ranker + classical lookahead hybrid; the honest finding is that corpus diversity beats target engineering; the whole thing was run as a human-steered, agent-executed research loop.
To scan a building you need to know where to point the scanner next. The classical answer is OctoMap-IG (Bircher et al., ICRA 2016) — pick the view with maximum expected information gain. Our hybrid beats it by +5.4σ on test_locked, +5.0σ held-out, and +5.6σ on out-of-distribution buildings — every split, not one lucky cut. Better view selection means fewer scans to find the hidden MEP.
The hard test is a building type the model never trained on. On ifc-bench OOD — 10 scenes across a duplex, an office, and a fire-damaged hospital — v7_realsynth scores 0.0714 ± 0.003, the best of any version, beating the shipped v4's 0.0626 by +0.0088 (2.6σ): 10 wins, 0 losses, 40 ties over 50 paired cells. The realism-fixed synthetic corpus transfers where target tweaks don't.
The recall metric originally divided captured instances by the instances visible in the partial cloud — using the partial cloud as its own ground truth. So observing one instance scored 1.0, and oracle inflated 5–50×. We re-anchored it to the scene-full GT (scene.instance_class). On gni_model_173 oracle dropped from 1.0 to 0.0093. This bug likely contaminates prior NBV numbers that score against partial reconstructions.
Training and eval ran as detached Modal jobs on persistent Volumes so a 5-seed sweep finished across laptop sleeps and crashes. A separate pose-sampling fix mattered too: the scanner was once parked 30m below the MEP on gni_model_202 (oracle=0); anchoring scanner-z to the floor with a raycast filter recovered it to 0.0417.
We tried a pair-coverage target g_star_k2 to teach two-view synergy. It has a mathematically unavoidable degeneracy: pair (i,j)=(j,i) symmetry means the two members of every winning pair tie — exactly 0 top-2 gap, zero gradient. The combined K2+0.1·K1 fix broke ties (10/10 groups) but eval regressed: ρ 0.10, ifc-bench hybrid 0.054, losing to v4's 0.063.
The exciting v7_realsynth OOD win only counted once we caught that v4 had never been evaluated on ifc-bench — the comparison was apples-to-oranges. We re-ran v4 on the same OOD scenes (0.063) before claiming the +2.6σ. The rule "match the eval before you believe the result" is what makes this number trustworthy — and three exciting target-engineering wins died under it.
The user set direction in plain language — try a+bsecondary auditthink about scopeis this interestingget more target signalmatch the eval — and an orchestrating LLM decomposed each into self-contained briefs for specialized sub-agents that dispatched Modal training, ran paired-bootstrap eval, audited the metric, and committed results with honest SHIP/KEEP/DROP verdicts. The directions page traces the specific calls that shaped the outcome.