Install Docker for Evals
Use Docker Desktop to give the eval runner a real isolation boundary. Without Docker, the runner falls back to local mode, which is best-effort and has known leak paths.
When to Use This
Section titled âWhen to Use Thisâ- You plan to run trigger evals (local mode can leak host skills into the workspace)
- You want runs to be reproducible across machines
- You publish a module and want the same eval verdicts other developers see
- You want a guaranteed-empty
HOMEso global memory cannot influence results
When to Skip This
Section titled âWhen to Skip Thisâ- One-off iteration on artifact evals where local fallback is good enough for now
- A constrained environment where installing Docker is not feasible. The runner falls back to local mode and tells you it is doing so.
Why Docker
Section titled âWhy DockerâThe eval runner needs to start each run from a clean slate. It is trying to measure the skill, not the hostâs accumulated state. Without isolation, three things contaminate the result.
- Global memory and CLAUDE.md. Your
~/.claude/CLAUDE.mdand auto-memory load on every Claude Code invocation. They influence outputs in ways the skill author cannot control. - Ancestor configuration. A
CLAUDE.mdanywhere above the skill in the directory tree gets discovered and loaded. - Host-installed skills. When
claude -pruns in a directory with.claude/skills/somewhere up the tree, those skills are discoverable and can fire instead of (or alongside) the skill under test. This is especially harmful for trigger evals.
Docker solves all three. The container has its own filesystem, its own HOME, and its own .claude/. Local mode patches HOME and creates a temp directory but cannot prevent ancestor discovery.
Step 1: Install Docker Desktop
Section titled âStep 1: Install Docker DesktopâDownload Docker Desktop for your platform:
| Platform | Where to Get It |
|---|---|
| macOS | docker.com/products/docker-desktop |
| Windows | docker.com/products/docker-desktop |
| Linux | Docker Engine via your distributionâs package manager, or Docker Desktop for Linux |
Follow the installerâs prompts. On macOS, drag the Docker app to Applications and launch it. On Windows, the installer enables WSL 2 if needed.
Step 2: Start Docker Desktop
Section titled âStep 2: Start Docker DesktopâLaunch Docker Desktop. Wait for the whale icon to indicate Docker is running. The eval runner shells out to the docker CLI; if Docker is not running, the runner falls back to local mode and tells you why.
Step 3: Verify Installation
Section titled âStep 3: Verify InstallationâConfirm Docker is reachable from your terminal:
docker infoA successful response means the eval runner can use Docker. An error means Docker is not running, or the CLI cannot reach the daemon.
Step 4: Let the Runner Build the Image
Section titled âStep 4: Let the Runner Build the ImageâThe first time you invoke the eval runner with --isolation docker (or auto when Docker is available), the runner builds bmad-eval-runner:latest from a Dockerfile shipped with the skill. This takes a few minutes once. Subsequent runs reuse the cached image.
The image is a minimal Node 20 base with Claude Code, Python 3, and standard tools. Nothing skill-specific or user-specific lives in the image. Your credentials are mounted in at run time, not baked in.
What You Get
Section titled âWhat You Getâ- Reproducible runs: the same eval produces the same workspace state on any machine with the image
- Real
HOMEisolation: the containerâs/home/evaluatoris empty, not just overridden - Trigger evals you can trust: only the synthetic skill staged for the test is discoverable, not your hostâs installed skills
- Network can be locked down per run if your evals do not need internet access
- Rebuild the image with
python3 scripts/docker_setup.py --rebuildif you ever need to reset it - Per-eval container resource use is small (a few hundred MB). Parallel workers each spin up their own container.
- If
docker infoworks in one terminal but not in your editorâs integrated terminal, your shell PATH probably differs. Open a fresh terminal session.
Next Steps
Section titled âNext StepsâRun the eval runner against a skill: see Run Evals Against a Skill. For isolation internals, see the eval-runner skillâs references/isolation.md.