mep

Explorer: slice the 5-way comparison yourself

interactive · every policy × split · surface-coverage recall@K=1 · 5 seeds · n_cand=20 · hybrid_top_k=12

Pick a split and a model version; the bar chart, the delta panel, and the per-version readout below all re-render from the committed numbers. The metric is locked to surface coverage: mep_recall is too sparse on the OOD set to compare meaningfully (most scenes capture 0 instances, so per-scene recall is mostly zeros). The honest read: the learned ranker alone never tops greedy_coverage; what ships is the hybrid: learned scores → top-K=12 → 2-step lookahead.

metric

split

model

Every policy on the selected split

hybrid (selected version) baseline learned-only (selected version) oracle ceiling (myopic Δ)

what to notice

Hybrid vs the baselines that matter

For the selected split + version, the hybrid minus three reference policies: greedy_coverage (the strongest classical heuristic), octomap_ig (the classical info-gain planner, Bircher et al., ICRA 2016), and greedy_lookahead_1 (the same 2-step lookahead the hybrid wraps, but with a true extractor and no learned prefilter). Combined SE = √(SEₐ² + SE_b²); σ = Δ / combined SE.

how to read it

Positive Δ and σ ≳ 2 means a real win at this sample size. The headline cross-split fact: hybrid beats octomap_ig by +5.4σ test_locked · +5.0σ held_out · +5.6σ OOD: every split, not one lucky cut.

Model version: training signal vs eval

the finding

Higher training Spearman ρ does not buy eval wins. Three target-engineering bumps, namely v5 surface-target, v7_attn's 2.6M-param cross-attention head (ρ 0.13), and v7_k12's combined K2+0.1·K1 target, all regressed or no-gained. The one win, v7_realsynth (ρ 0.156), came from a realism-fixed procedural synth corpus, not a loss tweak. The bottleneck is corpus diversity, not target formulation.

Hybrid top_k: cost vs quality (documented sweep)

A static, documented finding (not re-derived here): how many of the model's top candidates you hand to the lookahead wrapper. More candidates → higher recall, but the true extractor runs on every pair, so extractor calls scale K². Quality saturates around K=12. Past that you pay quadratic compute for a recall that flattens (and dips at K=20).

recall@K=1 (surface) vs hybrid_top_k · extractor cost ∝ K² shown in the right column

why K=12 is the shipped default

K=1 (0.443) is just the raw ranker: no lookahead. By K=5 (0.597) most of the gain is in; K=12 (0.615) is the knee; K=20 (0.598) actually regresses while costing ~2.8× the extractor calls of K=12. So the shipped hybrid_top_k=12 sits exactly at the cost/quality knee.