mep

Research directions: how plain-language steering shaped the result

the specific calls the human made · what each one triggered in the agent fleet · what came back

The agents did the mechanical work; the direction was human. The user set direction in plain language ("try a+b", "secondary audit", "think about scope", "is this interesting"), and an orchestrating LLM decomposed each call into self-contained briefs for specialized sub-agents: Modal training jobs, paired-bootstrap eval fleets, independent code audits, lit surveys. They ran detached and reported back with honest SHIP / KEEP / DROP verdicts. A few of these calls didn't just tweak a knob: they changed what counted as a result, and the discipline behind them refuted the project's own most exciting findings. The ⭑ entries are the ones that mattered most.

The throughline: the loop was built to distrust its own convenient results. The OOD win survived only because a later call forced us to run the missing baseline. The "more target signal" idea was killed by its own eval. No result counted until it reproduced from a committed script.

The calls that shaped the result

expansion · 4 parallel agents

try this new stuff, a+b

An open-ended push to widen the bet at once: chase real-world energy-efficiency data and stand up a procedural-synth pipeline and build the hybrid policy and audit it. The orchestrator fanned this into four parallel sub-agents: an EnEff data hunt, a v2 procedural-synth generator, the learned-ranker + lookahead hybrid, and a secondary code audit.

→ Honest mixed bag. The EnEff data hunt yielded 0 usable scenes: license / geometry mismatch. But the hybrid policy was implemented (model scores M candidates → top-K=12 → 2-step lookahead), and the audit pass surfaced 5 real bugs in the freshly-written code. Breadth bought signal; not all of it was good news, which is the point.

⭑ secondary audit · independent reviewer

secondary auditing for bugs introduced

Distrust the agent that wrote the patch. Spun up an independent reviewer agent with no stake in the original implementation, told to re-derive correctness from scratch rather than trust the first pass.

→ Confirmed all 5 original patches and found 4 more (M1-M4). The reviewer caught the worst one independently: sample_feasible_poses parked the scanner 30m below the MEP on gni_model_202 (oracle = 0). Anchoring scanner-z to the MEP floor + a raycast floor filter recovered it to 0.0417. A single agent reviewing itself would have shipped these.

scope · the honest reckoning

lets think about scope

A deliberate pause to name the fork before sinking more compute: is this a ship-demo (one polished scanner policy), a generalization result (works on unseen buildings), or a benchmark contribution (a clean eval others can use)? Each implies a different bar.

→ The reckoning was laid out plainly, and the user picked all three. That decision is why the project carries a shipped v4 default, an OOD held-out split, and a re-anchored metric, instead of optimizing one number for a leaderboard.

⭑ honesty · forced reframe

is this an actually interesting finding

The hardest question to ask of your own work. It forced an honest reframe away from a SOTA-number framing toward what was actually defensible given n = 3 buildings and a small magnitude.

→ The reframe stuck: the headline is not a SOTA recall number. The real findings are (a) the metric pathology: recall divided captured instances by instances visible in the partial cloud, so observing one instance scored 1.0 and oracle inflated 5-50× (on gni_model_173, oracle 1.0 → 0.0093 after re-anchoring to scene-full GT). The second is (b) the target-signal ceiling: the bottleneck is corpus diversity, not loss formulation. Both are more interesting than a fractional win.

target signal · a clean negative

can we get more target signal

The training Spearman ρ was low (v4 = 0.06), so the natural lever was a richer learning target. The idea: a K=2 pair-coverage target g_star_k2 that rewards two-view synergy, giving the ranker more to learn from.

→ A clean negative, twice over. The pair-coverage target is mathematically degenerate: pair (i,j)=(j,i) symmetry means the two members of every winning pair tie: exactly 0 top-2 gap, zero gradient. The combined K2 + 0.1·K1 fix broke ties (10/10 groups) but eval regressed: ρ 0.10, ifc-bench hybrid 0.054, losing to v4's 0.063 and v7_realsynth's 0.071. The idea died by its own eval, exactly as it should.

⭑ rigor · per-cell agreement audit

make sure results are interesting not weird

Distrust the convenient win. With v7_realsynth ahead by a small margin, the call demanded a per-cell agreement audit: not just the aggregate delta, but whether the improvement was real signal or a few lucky scenes.

→ The win held up and the audit caught a methodology hole. On the cells where the filter actually changes the pick, v7_realsynth wins 84% of the time, and all 3 / 3 buildings improve: signal, not noise. Critically, the audit caught that v4 had never been evaluated on ifc-bench; the comparison was apples-to-oranges until we re-ran v4 on the same OOD scenes (0.063). Only then did the +0.0088 ± 0.0035 (2.6σ), 10 wins / 0 losses / 40 ties count.

delivery · style parity

style parity / build the frontend

Make the result legible. Replicate the Gridfloor design system exactly: same tokens, fonts, card components, provenance-next-to-every-number voice, so the mep site reads as one coherent body of work.

→ This site, plus the scoreboard, flowchart, experiments ledger, explorer, and literature pages, every number traceable to a committed script.

What each call cost and bought

the call	what it triggered	what came back
try a+b	EnEff data hunt · v2 procedural synth · hybrid policy · secondary audit (4 parallel agents)	EnEff = 0 usable scenes; hybrid shipped; 5 audit bugs found
secondary audit	independent reviewer agent, no stake in the original code	all 5 patches confirmed + 4 more (M1-M4); 30m-below-MEP pose bug → 0.0417
think about scope	the ship-vs-generalize-vs-benchmark fork named explicitly	user picked all three → shipped v4 + OOD split + re-anchored metric
is this interesting	honest-finding reframe under n=3, small magnitude	real findings = metric pathology + target-signal ceiling, not a SOTA number
more target signal	K=2 pair-coverage target → combined K2+0.1·K1 target	degenerate (0 top-2 gap), then empirically regressed: a clean negative
interesting not weird	per-cell agreement audit on the OOD win	84% win-rate where the filter changes the pick; 3/3 buildings; v4 baseline gap caught
style parity	build the frontend to Gridfloor spec	this site, every number traceable to a committed script

The discipline

The loop refuted its own exciting findings. The v7_realsynth OOD win only survived because a later call sent us back to run the v4 baseline that had never been evaluated on ifc-bench. Without that, the comparison was apples-to-oranges and the headline would have been an artifact. The "more target signal" idea looked promising and was killed by its own eval the moment K=2 turned out to be mathematically degenerate and the combined target regressed. The pattern that makes those outcomes trustworthy is one rule: no result counts until it reproduces from a committed script, not an agent's transcript, not a single seed, not an uncommitted notebook. The agents supplied breadth and throughput across the four-parallel fan-out; the direction supplied the skepticism that an autonomous loop lacks on its own.

It is also why three target-engineering experiments failed and one data-diversity experiment won: v5 surface-target, v7_attn cross-attention (2.6M params), and v7_k12 combined-target all regressed or no-gained, while v7_realsynth, a realism-fixed procedural-synth corpus, was the only winner. The bottleneck was never the loss formulation. It was corpus diversity, and the steering kept pointing back at it.

The recurring directives

try a+b secondary audit think about scope is this interesting get more target signal interesting not weird match the eval before you believe it reproduce from a committed script style parity

Quotes are the working-session calls, lightly normalized. Every outcome reproduces from a committed script. Eval protocol held fixed across the study (surface-coverage recall@K=1 · 5 seeds · n_cand=20 · hybrid_top_k=12 · mean ± SE). The session was a human-steered, agent-executed loop: the human set direction in plain language, an orchestrating LLM decomposed each call into briefs, and parallel sub-agents (Modal training, eval, audits, lit) ran detached and returned honest verdicts.