mep

Can a learned policy beat classical baselines at scanning a building's hidden plumbing?

Next-best-view planning for MEP discovery — the mechanical, electrical, and plumbing buried inside a BIM. The shipped policy is a learned ranker + classical lookahead hybrid; the honest finding is that corpus diversity beats target engineering; the whole thing was run as a human-steered, agent-executed research loop.

0.541
hybrid surface recall · test_locked
+5.4σ
hybrid over OctoMap-IG · classical info-gain
0.156
best training Spearman ρ · v7_realsynth
5 / 7
model versions · pages of honest negatives
The headline isn't a SOTA number — it's that (a) the learned ranker alone never beats greedy_coverage; the classical lookahead wrapper is what actually ships. (b) We found and fixed a metric pathology that inflates oracle 5–50× and likely contaminates prior NBV numbers. (c) Three target-engineering tricks failed and one data-diversity fix won — proving the bottleneck is corpus, not loss.

Explore

Why this is useful in the real world

📡It beats the classical info-gain planner

+5.4σ / +5.0σ / +5.6σ

To scan a building you need to know where to point the scanner next. The classical answer is OctoMap-IG (Bircher et al., ICRA 2016) — pick the view with maximum expected information gain. Our hybrid beats it by +5.4σ on test_locked, +5.0σ held-out, and +5.6σ on out-of-distribution buildings — every split, not one lucky cut. Better view selection means fewer scans to find the hidden MEP.

🏥It generalizes to unseen building types

0.0714 OOD · 0 regressions

The hard test is a building type the model never trained on. On ifc-bench OOD — 10 scenes across a duplex, an office, and a fire-damaged hospital — v7_realsynth scores 0.0714 ± 0.003, the best of any version, beating the shipped v4's 0.0626 by +0.0088 (2.6σ): 10 wins, 0 losses, 40 ties over 50 paired cells. The realism-fixed synthetic corpus transfers where target tweaks don't.

What was technically hard (and required thought)

🔬The metric pathology

oracle 1.0 → 0.0093

The recall metric originally divided captured instances by the instances visible in the partial cloud — using the partial cloud as its own ground truth. So observing one instance scored 1.0, and oracle inflated 5–50×. We re-anchored it to the scene-full GT (scene.instance_class). On gni_model_173 oracle dropped from 1.0 to 0.0093. This bug likely contaminates prior NBV numbers that score against partial reconstructions.

🛰️Autoresearch that survives reality

detached Modal jobs

Training and eval ran as detached Modal jobs on persistent Volumes so a 5-seed sweep finished across laptop sleeps and crashes. A separate pose-sampling fix mattered too: the scanner was once parked 30m below the MEP on gni_model_202 (oracle=0); anchoring scanner-z to the floor with a raycast filter recovered it to 0.0417.

📐The K=2 pair-coverage degeneracy

exactly 0 top-2 gap

We tried a pair-coverage target g_star_k2 to teach two-view synergy. It has a mathematically unavoidable degeneracy: pair (i,j)=(j,i) symmetry means the two members of every winning pair tie — exactly 0 top-2 gap, zero gradient. The combined K2+0.1·K1 fix broke ties (10/10 groups) but eval regressed: ρ 0.10, ifc-bench hybrid 0.054, losing to v4's 0.063.

⚖️A discipline that refuted its own findings

v4 never eval'd on ifc-bench

The exciting v7_realsynth OOD win only counted once we caught that v4 had never been evaluated on ifc-bench — the comparison was apples-to-oranges. We re-ran v4 on the same OOD scenes (0.063) before claiming the +2.6σ. The rule "match the eval before you believe the result" is what makes this number trustworthy — and three exciting target-engineering wins died under it.

The method: human-steered, agent-executed

The user set direction in plain language — try a+bsecondary auditthink about scopeis this interestingget more target signalmatch the eval — and an orchestrating LLM decomposed each into self-contained briefs for specialized sub-agents that dispatched Modal training, ran paired-bootstrap eval, audited the metric, and committed results with honest SHIP/KEEP/DROP verdicts. The directions page traces the specific calls that shaped the outcome.

Eval protocol held fixed across the whole study: surface-coverage recall@K=1 · 5 seeds · n_cand=20 · hybrid_top_k=12 · mean ± SE across seeds. Splits: test_locked (3 residential) · held_out (7 GNI/heatpump) · ifc-bench OOD (10 scenes, 3 buildings). Hybrid = LearnedPlusLookaheadBaseline (model scores M candidates, takes top-K=12, runs 2-step lookahead). Classical baseline: OctoMap-IG (Bircher et al., ICRA 2016). Corpus: 297 manifest scenes plus IFC-Bench v2 (sylvainHellin/ifc-bench, CC-BY).