Tabi · The PM of the Future

What we're looking for

Product Manager — Verified Inventory

Tokyo HQ · reports to CEO · owns the Verification agent and the "walked by us" standard

We are hiring an operator. Someone who ships production code in Cursor or Claude Code. Who writes their own eval suites in Braintrust before the Verification agent runs against live inventory. Who reads a LangSmith trace without asking for help — and, when the agent mis-classifies a property as "still verified" after a renovation, opens the trace, finds the prompt weakness, and ships the fix themselves.

Above all they have taste — the judgment to know what's worth shipping when capacity is infinite. The verification signal is the foundation of every downstream feature we sell. If this layer slips, Replace-in-place loses its match set, Language Bridge loses its context, Memory loses its accuracy. The PM who owns it is the person we trust to hold that line.

They are fluent across the whole stack — from foundation models through prompt and context engineering, through prototyping in Cursor or Bolt, through evals in Braintrust and Arize, through orchestration of six agents, through LangSmith observability, all the way up to product strategy. Not one layer deep. They can talk to a Tokyo concierge about an edge case on a ryokan walk and to a foundation-model researcher about context window degradation, in the same morning.

The outcome they own

One number. Percentage of properties on Tabi with a Verification visit in the last 180 days.

Target: 98%. Nothing else on the scorecard.

Tasks · a week in the life

Prototype. Evaluate. Ship. Review. Repeat.

Monday

月

They prototype the next Verification signal in Claude Code — a check that flags any ryokan whose breakfast-menu photo is more than 120 days old. Working prototype runs against a slice of production inventory by lunch. No PRD is written.

Tuesday

火

They write 24 new evals in Braintrust against last week's failure logs — false-positive drift flags on properties that haven't actually changed, false-negative misses on properties that quietly dropped a floor. They define what "verified" means before the agent runs against production.

Wednesday

水

They ship the signal to 10% of live inventory. A one-minute Loom replaces the status update. The regional concierges in Kyoto and Fukuoka get a Linear issue each: three properties flagged for re-walk within the week.

Thursday

木

They review the eval deltas in LangSmith. One branch is regressing on older ryokan with seasonal menu rotations — the signal is noisy in October. They kill that branch. The pricing agent holds the earlier version in production. Two hours of work, no meeting.

Friday

金

They call three concierges who hit the failure mode — Sapporo, Osaka, Naha — and watch them walk two properties over FaceTime. The Saturday commit fixes the prompt, raises the threshold on seasonal menus, and pushes Verification coverage from 96.4% to 97.1%.

Habits · what they refuse to do

The rituals were governors on bad ideas. Taste is the governor now.

No PRDs.

Prototype in Claude Code instead. Evidence precedes documentation. If the Verification signal works in a prototype by Tuesday afternoon, the prototype is the spec. A PRD written in advance is a hallucination about a product that does not yet exist.

No quarterly roadmap decks.

Stakeholder theatre. Ship the thing, then show the thing. A one-minute Loom of the live agent beats a sixty-slide plan of agents that might exist by Q3.

No sprint ceremonies.

The ritual was a coordination tax paid to prevent bad ideas from reaching production. Judgment is the governor now, and evals enforce it. Standups optimise for coordination. Tabi optimises for eval scores.

No handoff drift.

PM → Figma → ticket → QA → launch is a broken telephone game. The PM builds in the codebase. There is no handoff to drift through.

No waiting on engineering capacity.

Capacity is infinite. Taste is the bottleneck. A PM who needs a six-week engineering slot to test a hypothesis is a PM who hasn't learned the new cycle yet.

Tools · the stack

The tools they live in. And the ones they don't.

The stack

Claude Code · Cursor

Prototyping features themselves. The PM writes the code, runs it against a slice of production data, and reads the results before anyone else.

Bolt · v0

Spinning up UI, internal tools, and marketing demos in an afternoon.

Braintrust · Arize

Writing evals, catching hallucinations, defining what "verified" means before an agent runs against production.

LangSmith

Observability. Reading agent traces, debugging prompts, finding where budget burns.

Linear

Issues, not sprints. A place to record what shipped, not what might.

Loom + phone

User research artefacts. A concierge walking a ryokan on FaceTime beats a survey every time.

Claude · GPT

A thinking partner, used directly. Not a feature to be shipped.

Explicitly not in the stack

Jira ticket relay

Confluence docs about docs

Roadmap decks quarterly theatre

PRD templates pre-built hallucinations

Figma handoff no handoff

None of these tools are bad. They simply presume a world in which execution is scarce and coordination is the point. That world is gone.

Processes · how they work with agents

A single day, end to end.

Not "managing AI features." Delegating to agents, defining failure modes, owning the evals, reading the traces. Tuesday, 14 October 2025:

Overnight

The Verification agent ships a 3% expansion to "freshness" signals — anything not re-checked in 90 days gets flagged a shade earlier.