How I run 4โ8 parallel coding agents with tmux and markdown specs
I’ve been running 4โ8 parallel coding agents as my primary development setup since January 2025. No orchestrator, no custom infrastructure. Just tmux, markdown files, bash aliases, and six slash commands.
Each agent takes on a role:
| Agent | What it does |
|---|---|
| Planner | Explore code, design features, iterate on specs |
| Worker | Implement from a finished spec |
| PM | Backlog grooming, prioritization, idea dumping |
These are not formal subagents. No special system prompts, no skill definitions, no subagent configs. Just a naming convention.
The core idea: every unit of work gets a written spec (I call them Feature Designs, or FDs) before any agent writes code. Agents pick up specs, execute them, and close them. This separation of planning and execution is what makes parallelism work. In the last two months, this has produced 300+ completed FDs (small-to-medium scoped changes; one spec, one implementation pass) across three projects.
/fd-init bootstraps the full system into any repo. This article walks through how it works.
Feature Design tracking
Each FD is:
- A numbered spec file (FD-001, FD-002, etc.) with the problem, solution, files to change, and verification plan
- Tracked in an index across all FDs
- Managed through slash commands for the full lifecycle
Each FD file lives in docs/features/ and moves through 8 stages:
| Stage | What it means |
|---|---|
| Planned | Identified, not yet designed |
| Design | Actively designing the solution |
| Open | Designed, ready for implementation |
| In Progress | Currently being implemented |
| Pending Verification | Code complete, awaiting runtime verification |
| Complete | Verified working, ready to archive |
| Deferred | Postponed indefinitely |
| Closed | Won’t fix |
Six slash commands handle the lifecycle:
| Command | What it does |
|---|---|
/fd-new |
Create a new FD from an idea dump |
/fd-status |
Show the index: what’s active, pending verification, and done |
/fd-explore |
Bootstrap a session: load architecture docs, dev guide, FD index |
/fd-deep |
Launch 4 parallel Opus agents to explore a hard design problem |
/fd-verify |
Proofread code, propose a verification plan, commit |
/fd-close |
Archive the FD, update the index, update the changelog |
Every commit ties back to its FD: FD-049: Implement incremental embedding refresh. The changelog accumulates automatically as FDs complete.
A typical FD file looks like this:
FD-051: Topic Hierarchy Deduplication
Status: Open Priority: Medium
Effort: Medium Impact: Clean topic hierarchy for downstream queries
## Problem
Topic extraction generates near-duplicate topics across runs:
"customer onboarding", "onboarding flow", "user onboarding"
all exist as separate entries.
## Solution
Add a merge step after extraction:
1. Cluster by cosine similarity (>=0.90)
2. Pick the most frequent as canonical, and alias the rest.
3. For ambiguous matches (>0.5 and <0.90), use an LLM to verify fit before merging.
4. Store alias mappings so downstream queries resolve automatically.
## Files to Modify
- src/topics/merge.py (new: LLM + clustering logic)
- src/topics/aliases.py (new: alias resolution for queries)
- sql/01_schema.sql (add topic_aliases table)
- sql/06_merge_procedure.sql (new: scheduled merge after extraction)
## Verification
1. Deploy to test env, run merge on live topic table
2. Verify no errors in operation log, run health checks
3. Query topic_aliases: "customer onboarding" cluster has expected mappings
4. Run tests, confirm alias resolution in downstream queries
The FEATURE_INDEX.md tracks status across all FDs:
## Active Features
| FD | Title | Status | Effort | Priority |
|--------|--------------------------------|----------------------|--------|----------|
| FD-051 | Topic hierarchy deduplication | Open | Medium | Medium |
| FD-052 | Streaming entity extraction | In Progress | Large | High |
| FD-050 | Confidence-based routing | Pending Verification | Medium | High |
## Completed
| FD | Title | Completed | Notes |
|--------|--------------------------------|------------|---------------------|
| FD-049 | Incremental embedding refresh | 2026-02-20 | 45 min โ 2 min |
| FD-048 | Operation log cleanup | 2026-02-18 | |
Portable: /fd-init
The original project’s FD system was built slowly over months. I wanted the same structure in every new project without repeating that work, so I packaged it as a slash command.
Run /fd-init in any repo and it:
- Infers project context from CLAUDE.md, package configs, and git log
- Creates the directory structure (
docs/features/,docs/features/archive/) - Generates a
FEATURE_INDEX.mdcustomized to the project - Creates an FD template
- Installs the slash commands (
/fd-new,/fd-status,/fd-explore,/fd-deep,/fd-verify,/fd-close) - Appends FD lifecycle conventions to the project’s CLAUDE.md
* FD System Initialized
Files Created
- docs/features/FEATURE_INDEX.md โ Feature index
- docs/features/TEMPLATE.md โ FD file template
- docs/features/archive/ โ Archive directory
- CHANGELOG.md โ Changelog (Keep a Changelog format)
- CLAUDE.md โ Project conventions with FD management section
- .claude/commands/fd-new.md โ Create new FD
- .claude/commands/fd-explore.md โ Project exploration
- .claude/commands/fd-deep.md โ Deep parallel analysis
- .claude/commands/fd-status.md โ Status and grooming
- .claude/commands/fd-verify.md โ Verification workflow
- .claude/commands/fd-close.md โ Close and archive FD with changelog update
Next Steps
1. Run /fd-new to create your first feature design
2. Run /fd-status to check the current state
The development loop
How I plan
I spend most of the time working with Planners. Each one starts with /fd-explore, which loads codebase context and past work so the agent doesn’t start from zero: architecture docs, dev guide, readmes, FD index. I customize it per project as it grows.
From there, I work through the FD design:
on fd14 - can we use airflow? how is the legacy batch notify system scheduled on airflow? service user?
In Boris Tane’s How I Use Claude Code, he describes how he uses inline annotations to give Claude feedback. I adapted this pattern for complex FDs where conversational back-and-forth can be imprecise. I edit the FD file directly in Cursor and add inline annotations prefixed with %%:
## Solution
Replace cron-based notify scheduling with Snowflake Tasks.
EVALUATE_ALERTS_SP runs as SYSTEM via user_email().
%% system user won't run as main modeling role? can we run a quick experiment to check
Batch window: 15-min intervals, drain in-flight notifications before cutover.
Route failures to the existing dead-letter queue in autonotify.
%% what happens to in-flight notifications during cutover? need to confirm drain behavior
Then in Claude Code:
fd14 - check %% notes.
Claude revises the design, removes the annotations, and the cycle repeats until the design is solid.
For critical designs, I may do two things:
-
I cross-check the plan in Cursor with GPT or Gemini to catch blind spots.
-
/fd-deeplaunches 4 Opus agents in parallel to explore different angles:
are you sure setting this up with the Airflow service user will bypass the RAP policy on this table? use
/fd-deep.
Each agent runs in read-only Explore mode with a specific angle to investigate (algorithmic, structural, incremental, environmental, or whatever fits the problem). Once they report back, the orchestrator verifies their factual claims (file paths, function signatures, behavioral assumptions), flags contradictions, and synthesizes a ranked recommendation with confidence levels, tradeoffs, and a concrete first step.
The pattern borrows from GPT Pro’s parallel test-time compute1, adapted for design questions where there’s no single correct answer.
Worker execution
Once an FD is marked Open, a Worker picks it up. I point it at the FD, turn on plan mode so Claude builds a line-level implementation plan, review it, then switch to accept edits and let it run. Most FDs are self-contained: one design, one implementation pass, working on a dev branch. When a feature needs isolation, I tell the agent to create a git worktree. Claude Code handles it natively. The finished FD contains all the files and details, so even after compaction the Worker stays on track.
Verification
When Workers finished, I kept typing the same things:
proofread your code end to end, must be airtight
check for edge cases again
commit now, then create a verification plan on live test deployment.
It’s well known that agents consistently find more issues when prompted to review their own work. So I built /fd-verify. It does a proofread pass, proposes a verification plan, and commits.
Some projects go further with dedicated slash commands like /test-cli that run full verification against real deployments. These aren’t traditional test suites. There’s no test runner and no assert statements. The agent reads markdown instructions, executes commands against real infrastructure, reasons about whether the results are correct, and writes structured results: markdown files with tables, timestamps, and diagnostic notes.
When something fails, the agent can investigate on the spot rather than just flagging it. By the end, the result comes back diagnosed. For systems that are inherently async and run on real data, an LLM following markdown instructions is a more natural verification harness than pytest.
One cycle
Putting it all together:
PM window:
1. /fd-status โ What's active, what's pending, what's done
2. Pick an FD (or /fd-new) โ Groom the backlog or dump a new idea
Planner window (new agent session):
3. /fd-explore โ Load project context
4. Design the FD โ /fd-deep if stuck, cross-check in Cursor
5. FD status โ Open โ Design is solid, ready for implementation
Worker window (fresh agent session):
6. /fd-explore โ Fresh context load
7. "Implement FD-XXX" (plan mode) โ Claude builds a line-level implementation plan
8. Implement with atomic commits โ FD-XXX: description
9. /fd-verify โ Proofread, verification plan
10. Test on real deployment โ Verification skills or manual
11. /fd-close โ Archive, update index, changelog
The Planner and Worker are separate sessions on purpose. Planning can burn through multiple context windows as the agent explores the codebase, and compaction tends to drop files the Planner still needs. I always start Workers fresh with just the FD, or with /fd-explore when they need broader project context.
Where the decisions live
FD files as decision traces
The development loop produces a trail of FD files. Each one captures more than the task itself. It records what was considered, what was chosen, what was rejected, and why. In practice, when a new agent picks up an FD, it may launch an Explore subagent that (unprompted to do so) finds past FDs with related work on its own. The agent arrives with context about prior decisions. The FD archive is institutional memory that accumulates with every completed feature.
The dev guide
Every project accumulates practical lessons. The dev guide (docs/dev_guide/) captures these as short entries. Agents read a summary on session start and can go deeper into any specific entry when it’s relevant to the task. Unlike the FD system (which bootstraps in seconds via /fd-init), the dev guide grows organically. Each lesson becomes a new entry as it comes up.
Some examples from one project:
| Entry | What it covers |
|---|---|
| No silent fallback values | Config errors fail loudly instead of hiding behind defaults |
| DRY: extract helpers and utilities | Don’t rewrite the same parser or validation logic twice |
| No backwards compatibility | All deployments are test environments, no migration code necessary |
| Operation log naming conventions | Uniform operation types across all features |
| Embedding handling | Always normalize with parse_embedding(), never trust raw format from Snowpark |
| Deployment safety | Destructive ops must wait for running tasks to complete before modifying procedures |
| LLM JSON parsing | Use parse_llm_json() with strict=False and regex fallback, never raw json.loads() |
The dev guide is separate from CLAUDE.md on purpose. CLAUDE.md loads into every session, so it stays lean: commit style, tool preferences, hard guardrails. The dev guide entries are denser, often with inline code examples, and load on demand via /fd-explore when they’re relevant to the current task.
Two-tier CLAUDE.md
Claude Code loads a CLAUDE.md file at the start of every session. I split this into two tiers:
Global (~/.claude/CLAUDE.md) sets rules that apply everywhere: no AI attribution in commits, Python and SQL conventions, and never bypass denied commands.
Project-level (<repo>/CLAUDE.md) adds project conventions and FD lifecycle rules (written by /fd-init).
How it compounds
Past FDs reduce future decision points. For example, an early FD ran performance optimization experiments on a hotpath and documented what worked and what didn’t. Months later, a new agent working on a feature that touched that same hotpath found the old FD and asked whether benchmarking was needed before writing any code. Without that FD in the archive, I would have had to catch that myself or the agent would have just gone ahead.
The physical setup
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ
โ Cursor (IDE) โ Ghostty Terminal 1 โ Ghostty Terminal 2 โ
โ โ tmux โ tmux โ
โ โ โ โ
โ Visual browsing โ Window 1: PM โ Window 1: Worker โ
โ Hand edits โ Window 2: Planner โ Window 2: Worker โ
โ Cross-model checks โ Window 3: Planner โ Window 3: Worker โ
โ โ Window 4: Planner โ Window 4: bash โ
โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโ
Three panels across an ultrawide monitor:
- Cursor (left) for visual code browsing, hand edits, and cross-checking plans with other models like GPT or Gemini.
- Two Ghostty terminals (middle and right), each running a tmux session.
Two coding agents across the terminals:
- Claude Code is my daily driver for general-purpose coding.
- Cortex Code is Snowflake’s coding agent. It runs the latest Opus model, loads the same
CLAUDE.mdfile, and comes bundled with internal custom profiles that connect to MCPs and prebuilt skills tailored to Snowflake workflows.
I use mostly vanilla tmux to navigate: Ctrl-b n/p to cycle windows, Ctrl-b , to rename them (planner, worker-fd038, PM), Ctrl-b c to spin up a new agent, Ctrl-b s to browse sessions. A few custom additions: Shift-Left/Right to reorder windows, m to move a window between sessions, and renumber-windows on so closing a tab doesn’t leave gaps.
Every project gets a g* alias (“go to”) for instant navigation:
| Alias | Project |
|---|---|
gllmt |
~/workspace/snowflake/llmt |
gautonotify |
~/workspace/snowflake/autonotify |
gdatakit |
~/workspace/datakit |
gclaude |
~/.claude |
Claude reads them too. I tell Claude “run the eval in gllmt” and it resolves the alias to the actual path.
When an agent finishes, the tmux tab changes color. Two config layers make this work:
| Layer | File | What it does |
|---|---|---|
| Claude Code | ~/.claude/settings.json |
Notification hook (matcher: idle_prompt) sends bell (\a) to terminal |
| tmux | ~/.tmux.conf |
monitor-bell on, bell-action any, window-status-bell-style reverse |
Agent goes idle, Claude Code fires the hook, tmux catches the bell and inverts the tab color.
What’s hard
With 6+ agents running, there’s always something waiting for me, like a Planner with design questions or a Worker ready for verification. Managing that is where the system starts to strain.
Cognitive load is the real ceiling. Around 8 agents is my practical max. Past that, I lose track of what each one is doing and design decisions suffer.
Not everything parallelizes. Some features have sequential dependencies. Forcing parallelism on inherently serial work creates merge conflicts and wasted effort.
Context window limits. Planners burn through context windows fast. When compaction kicks in, it can drop files the agent needs to continue the design. I’ve learned to checkpoint FD progress before compaction hits.
Sandbox whack-a-mole. I deny destructive commands (rm, git reset --hard, DROP). The agent finds creative alternatives: unlink, python -c "import os; os.remove()", find ... -delete. The permission system has evaluation order quirks where blanket allows override specific denies. My CLAUDE.md now says “If a command is denied, that’s the answer. Ask the user to do it.”
Translating business context into FDs is still manual. Jira tickets, Slack threads, meeting notes, product decisions. I’m the bridge between all of that and a well-scoped FD. A dedicated subagent profile would close this gap.
- OpenAI describes GPT-5 Pro as using "scaled but efficient parallel test-time compute." Nathan Lambert on Lex Fridman #490 discusses the broader pattern of inference-time scaling: giving models more compute at generation time to explore multiple reasoning paths. โฉ
If you try this, I'd love to hear what you change.