An essay · The role, reopened

The PM at Tabi is not what you think.

Execution used to be the bottleneck. So the product manager became an optimiser of throughput — a writer of PRDs, a chair of rituals, a shepherd of handoffs. That era is over. Judgment is the bottleneck now. The PM at Tabi is the person with judgment.

"Most PMs were never actually bottlenecked by execution. They were bottlenecked by taste and judgment. Team capacity functioned as a governor that prevented bad ideas from shipping. Remove that governor and you discover who was driving and who was just steering."

What we're looking for

Product Manager — Verified Inventory

Tokyo HQ · reports to CEO · owns the Verification agent and the "walked by us" standard

We are hiring an operator. Someone who ships production code in Cursor or Claude Code. Who writes their own eval suites in Braintrust before the Verification agent runs against live inventory. Who reads a LangSmith trace without asking for help — and, when the agent mis-classifies a property as "still verified" after a renovation, opens the trace, finds the prompt weakness, and ships the fix themselves.

Above all they have taste — the judgment to know what's worth shipping when capacity is infinite. The verification signal is the foundation of every downstream feature we sell. If this layer slips, Replace-in-place loses its match set, Language Bridge loses its context, Memory loses its accuracy. The PM who owns it is the person we trust to hold that line.

They are fluent across the whole stack — from foundation models through prompt and context engineering, through prototyping in Cursor or Bolt, through evals in Braintrust and Arize, through orchestration of six agents, through LangSmith observability, all the way up to product strategy. Not one layer deep. They can talk to a Tokyo concierge about an edge case on a ryokan walk and to a foundation-model researcher about context window degradation, in the same morning.

The outcome they own

One number. Percentage of properties on Tabi with a Verification visit in the last 180 days.

Target: 98%. Nothing else on the scorecard.


Tasks · a week in the life

Prototype. Evaluate. Ship. Review. Repeat.

Monday

They prototype the next Verification signal in Claude Code — a check that flags any ryokan whose breakfast-menu photo is more than 120 days old. Working prototype runs against a slice of production inventory by lunch. No PRD is written.

Tuesday

They write 24 new evals in Braintrust against last week's failure logs — false-positive drift flags on properties that haven't actually changed, false-negative misses on properties that quietly dropped a floor. They define what "verified" means before the agent runs against production.

Wednesday

They ship the signal to 10% of live inventory. A one-minute Loom replaces the status update. The regional concierges in Kyoto and Fukuoka get a Linear issue each: three properties flagged for re-walk within the week.

Thursday

They review the eval deltas in LangSmith. One branch is regressing on older ryokan with seasonal menu rotations — the signal is noisy in October. They kill that branch. The pricing agent holds the earlier version in production. Two hours of work, no meeting.

Friday

They call three concierges who hit the failure mode — Sapporo, Osaka, Naha — and watch them walk two properties over FaceTime. The Saturday commit fixes the prompt, raises the threshold on seasonal menus, and pushes Verification coverage from 96.4% to 97.1%.

Habits · what they refuse to do

The rituals were governors on bad ideas. Taste is the governor now.

No PRDs.

Prototype in Claude Code instead. Evidence precedes documentation. If the Verification signal works in a prototype by Tuesday afternoon, the prototype is the spec. A PRD written in advance is a hallucination about a product that does not yet exist.

No quarterly roadmap decks.

Stakeholder theatre. Ship the thing, then show the thing. A one-minute Loom of the live agent beats a sixty-slide plan of agents that might exist by Q3.

No sprint ceremonies.

The ritual was a coordination tax paid to prevent bad ideas from reaching production. Judgment is the governor now, and evals enforce it. Standups optimise for coordination. Tabi optimises for eval scores.

No handoff drift.

PM → Figma → ticket → QA → launch is a broken telephone game. The PM builds in the codebase. There is no handoff to drift through.

No waiting on engineering capacity.

Capacity is infinite. Taste is the bottleneck. A PM who needs a six-week engineering slot to test a hypothesis is a PM who hasn't learned the new cycle yet.

Tools · the stack

The tools they live in. And the ones they don't.

The stack
Claude Code · Cursor
Prototyping features themselves. The PM writes the code, runs it against a slice of production data, and reads the results before anyone else.
Bolt · v0
Spinning up UI, internal tools, and marketing demos in an afternoon.
Braintrust · Arize
Writing evals, catching hallucinations, defining what "verified" means before an agent runs against production.
LangSmith
Observability. Reading agent traces, debugging prompts, finding where budget burns.
Linear
Issues, not sprints. A place to record what shipped, not what might.
Loom + phone
User research artefacts. A concierge walking a ryokan on FaceTime beats a survey every time.
Claude · GPT
A thinking partner, used directly. Not a feature to be shipped.
Explicitly not in the stack
Jira ticket relay
Confluence docs about docs
Roadmap decks quarterly theatre
PRD templates pre-built hallucinations
Figma handoff no handoff

None of these tools are bad. They simply presume a world in which execution is scarce and coordination is the point. That world is gone.

Processes · how they work with agents

A single day, end to end.

Not "managing AI features." Delegating to agents, defining failure modes, owning the evals, reading the traces. Tuesday, 14 October 2025:

Overnight

The Verification agent ships a 3% expansion to "freshness" signals — anything not re-checked in 90 days gets flagged a shade earlier.

09:00

The PM opens Braintrust. A regression: the signal over-triggers on properties with seasonal menu rotations — autumn kaiseki is triggering a false "stale" flag.

10:30

They write four new evals in Braintrust capturing the failure mode. Seasonality-aware freshness. No meeting. No spec doc.

12:00

They ship a fix in Cursor that passes the evals. A pull request reviewed by one engineer. Merged by 12:40.

14:00

They re-run traffic. A Loom for the team: "Shipped seasonality-aware freshness. Regression gone. Coverage held."

16:00

They review LangSmith traces. Confirm the regression is gone. The Saturday eval suite grows by four tests. That is the day.

This is the job. A single person defined a failure mode, wrote the evals, shipped the fix, verified the result — between coffee and the evening walk. A PM who waits four weeks for an engineering slot to do the same thing is not a PM. They are a project manager with a product title.