Execution used to be the bottleneck. So the product manager became an optimiser of throughput — a writer of PRDs, a chair of rituals, a shepherd of handoffs. That era is over. Judgment is the bottleneck now. The PM at Tabi is the person with judgment.
"Most PMs were never actually bottlenecked by execution. They were bottlenecked by taste and judgment. Team capacity functioned as a governor that prevented bad ideas from shipping. Remove that governor and you discover who was driving and who was just steering."
Tokyo HQ · reports to CEO · owns the Verification agent and the "walked by us" standard
We are hiring an operator. Someone who ships production code in Cursor or Claude Code. Who writes their own eval suites in Braintrust before the Verification agent runs against live inventory. Who reads a LangSmith trace without asking for help — and, when the agent mis-classifies a property as "still verified" after a renovation, opens the trace, finds the prompt weakness, and ships the fix themselves.
Above all they have taste — the judgment to know what's worth shipping when capacity is infinite. The verification signal is the foundation of every downstream feature we sell. If this layer slips, Replace-in-place loses its match set, Language Bridge loses its context, Memory loses its accuracy. The PM who owns it is the person we trust to hold that line.
They are fluent across the whole stack — from foundation models through prompt and context engineering, through prototyping in Cursor or Bolt, through evals in Braintrust and Arize, through orchestration of six agents, through LangSmith observability, all the way up to product strategy. Not one layer deep. They can talk to a Tokyo concierge about an edge case on a ryokan walk and to a foundation-model researcher about context window degradation, in the same morning.
One number. Percentage of properties on Tabi with a Verification visit in the last 180 days.
Target: 98%. Nothing else on the scorecard.
They prototype the next Verification signal in Claude Code — a check that flags any ryokan whose breakfast-menu photo is more than 120 days old. Working prototype runs against a slice of production inventory by lunch. No PRD is written.
They write 24 new evals in Braintrust against last week's failure logs — false-positive drift flags on properties that haven't actually changed, false-negative misses on properties that quietly dropped a floor. They define what "verified" means before the agent runs against production.
They ship the signal to 10% of live inventory. A one-minute Loom replaces the status update. The regional concierges in Kyoto and Fukuoka get a Linear issue each: three properties flagged for re-walk within the week.
They review the eval deltas in LangSmith. One branch is regressing on older ryokan with seasonal menu rotations — the signal is noisy in October. They kill that branch. The pricing agent holds the earlier version in production. Two hours of work, no meeting.
They call three concierges who hit the failure mode — Sapporo, Osaka, Naha — and watch them walk two properties over FaceTime. The Saturday commit fixes the prompt, raises the threshold on seasonal menus, and pushes Verification coverage from 96.4% to 97.1%.
Prototype in Claude Code instead. Evidence precedes documentation. If the Verification signal works in a prototype by Tuesday afternoon, the prototype is the spec. A PRD written in advance is a hallucination about a product that does not yet exist.
Stakeholder theatre. Ship the thing, then show the thing. A one-minute Loom of the live agent beats a sixty-slide plan of agents that might exist by Q3.
The ritual was a coordination tax paid to prevent bad ideas from reaching production. Judgment is the governor now, and evals enforce it. Standups optimise for coordination. Tabi optimises for eval scores.
PM → Figma → ticket → QA → launch is a broken telephone game. The PM builds in the codebase. There is no handoff to drift through.
Capacity is infinite. Taste is the bottleneck. A PM who needs a six-week engineering slot to test a hypothesis is a PM who hasn't learned the new cycle yet.
None of these tools are bad. They simply presume a world in which execution is scarce and coordination is the point. That world is gone.
Not "managing AI features." Delegating to agents, defining failure modes, owning the evals, reading the traces. Tuesday, 14 October 2025:
The Verification agent ships a 3% expansion to "freshness" signals — anything not re-checked in 90 days gets flagged a shade earlier.
The PM opens Braintrust. A regression: the signal over-triggers on properties with seasonal menu rotations — autumn kaiseki is triggering a false "stale" flag.
They write four new evals in Braintrust capturing the failure mode. Seasonality-aware freshness. No meeting. No spec doc.
They ship a fix in Cursor that passes the evals. A pull request reviewed by one engineer. Merged by 12:40.
They re-run traffic. A Loom for the team: "Shipped seasonality-aware freshness. Regression gone. Coverage held."
They review LangSmith traces. Confirm the regression is gone. The Saturday eval suite grows by four tests. That is the day.
This is the job. A single person defined a failure mode, wrote the evals, shipped the fix, verified the result — between coffee and the evening walk. A PM who waits four weeks for an engineering slot to do the same thing is not a PM. They are a project manager with a product title.