Terminal Benchmarking
mux ships with a headless adapter for Terminal-Bench. The adapter runs the Electron backend without opening a window and exercises it through the same IPC paths we use in integration tests. This page documents how to launch benchmarks from the repository tree.
Prerequisites
- Docker must be installed and running. Terminal-Bench executes each task inside a dedicated Docker container.
uvis available in the nixdevShell(provided viaflake.nix), or install it manually from https://docs.astral.sh/uv/.- Standard provider API keys (e.g.
ANTHROPIC_API_KEY,OPENAI_API_KEY) should be exported so mux can stream responses.
Optional environment overrides:
| Variable | Purpose | Default |
|---|---|---|
MUX_AGENT_REPO_ROOT | Path copied into each task container | repo root inferred from the agent file |
MUX_TRUNK | Branch checked out when preparing the project | main |
MUX_WORKSPACE_ID | Workspace identifier used inside mux | mux-bench |
MUX_MODEL | Preferred model (supports provider/model syntax) | anthropic/claude-sonnet-4-5 |
MUX_THINKING_LEVEL | Optional reasoning level (off, low, medium, high) | high |
MUX_MODE | Starting mode (plan or exec) | exec |
MUX_TIMEOUT_MS | Optional stream timeout in milliseconds | no timeout |
MUX_CONFIG_ROOT | Location for mux session data inside the container | /root/.mux |
MUX_APP_ROOT | Path where the mux sources are staged | /opt/mux-app |
MUX_PROJECT_PATH | Explicit project directory inside the task container | auto-detected from common paths |
Running Terminal-Bench
All commands below should be run from the repository root.
Quick smoke test (single task)
uvx terminal-bench run \
--dataset terminal-bench-core==0.1.1 \
--agent-import-path benchmarks.terminal_bench.mux_agent:MuxAgent \
--n-tasks 1
This downloads the Terminal-Bench runner, copies the mux sources into the container, and validates the adapter against the first task only. Use this before attempting a full sweep.
Full dataset
uvx terminal-bench run \
--dataset terminal-bench-core==0.1.1 \
--agent-import-path benchmarks.terminal_bench.mux_agent:MuxAgent
Results (pass/fail, token usage, wall-clock) are printed at the end of the run. Terminal-Bench also writes per-task logs under the current working directory; review them when diagnosing failures.
You can also use make:
TB_CONCURRENCY=6 TB_LIVESTREAM=1 \
make benchmark-terminal TB_ARGS="--n-tasks 3 --model anthropic/claude-sonnet-4-20250514 --agent-kwarg mode=plan --agent-kwarg thinking_level=medium"
TB_DATASET defaults to terminal-bench-core==0.1.1, but can be overridden (e.g. make benchmark-terminal TB_DATASET=terminal-bench-core==head).
Use --agent-kwarg mode=plan to exercise the plan/execute workflow—the CLI will gather a plan first, then automatically approve it and switch to execution. Leaving the flag off (or setting mode=exec) skips the planning phase.
Use TB_CONCURRENCY=<n> to control --n-concurrent (number of concurrently running tasks) and TB_LIVESTREAM=1 to stream log output live instead of waiting for the run to finish. These map to Terminal-Bench’s --n-concurrent and --livestream flags.
How the Adapter Works
The adapter lives in benchmarks/terminal_bench/mux_agent.py. For each task it:
- Copies the mux repository (package manifests +
src/) into/tmp/mux-appinside the container. - Ensures Bun exists, then runs
bun install --frozen-lockfile. - Launches
src/cli/debug/agentSessionCli.tsto prepare workspace metadata and stream the instruction, storing state underMUX_CONFIG_ROOT(default/root/.mux).
MUX_MODEL accepts either the mux colon form (anthropic:claude-sonnet-4-5) or the Terminal-Bench slash form (anthropic/claude-sonnet-4-5); the adapter normalises whichever you provide.
Troubleshooting
command not found: bun– ensure the container can reach Bun’s install script, or pre-install Bun in your base image. The adapter aborts if the install step fails.- Workspace creation errors – set
MUX_PROJECT_PATHto the project directory inside the task container if auto-discovery misses it. - Streaming timeouts – pass
--n-tasks 1while iterating on fixes, or setMUX_TIMEOUT_MS=180000to reinstate a timeout if needed.