Running GitHub Agentic Workflows (gh-aw) on self-hosted runners#
GitHub Agentic Workflows (gh-aw) are markdown workflow files compiled to GitHub Actions via gh aw compile. They run a live AI agent (Claude, Copilot, Codex, …) inside a sandboxed Docker stack that decides what tools to call and what steps to take.
1. Why agentic runners use container mode#
gh-aw was designed for GitHub-hosted runners, where every job gets a fresh, single-tenant VM. Each compiled workflow hardcodes machine-global resources:
- the runtime tree
/tmp/gh-aw(~80 references per workflow, including a-v /tmp/gh-aw:/tmp/gh-aw:rwmount and arm -rf /tmp/gh-awduring setup), - fixed host-network ports for the MCP gateway and helper servers,
- fixed Docker/AWF names (
gh-aw-mcpg,awf-net,awf-squid, …) and theDOCKER-USERiptables chain, - shared
$HOMEengine state (~/.copilot,~/.claude) and the Docker socket.
If several agentic jobs share one host, these collide. To make concurrent agentic runners stable on a single machine, gh sr runs every profile: agentic runner in container mode: each runner instance is a privileged Docker-in-Docker (DinD) container with its own inner dockerd, network namespace, MCP gateway port, iptables, and /tmp/gh-aw.
profile: agenticalways impliesrunner_mode: container.gh srrejectsprofile: agenticwithrunner_mode: native, because native mode cannot isolate the resources above. (The old per-instanceagentic_mcp_ports/gh-sr-mcp-<port>label scheme and the host-level dnsmasq/sudoers//opt/hostedtoolcachesetup are gone — they are unnecessary in container mode.)
Pristine per job#
Each long-lived runner container makes the inner environment pristine before and after every job via the official Actions runner hooks (ACTIONS_RUNNER_HOOK_JOB_STARTED / ACTIONS_RUNNER_HOOK_JOB_COMPLETED):
- before a job — remove any leftover inner containers, prune networks, flush AWF
iptables, delete/tmp/gh-aw, and (re)assert the AWF service-routing bypass; verify the innerdockerdis healthy; - after a job — tear down the gh-aw / AWF containers it created, prune networks, flush AWF rules, and delete
/tmp/gh-aw.
The inner Docker image-layer cache (/runner-state/docker-data) is the only state preserved across jobs, so resets never re-pull gh-aw’s images. This eliminates the entire class of “leftover from a previous/crashed job” failures (stale MCP gateway on the gateway port, orphan AWF containers, stale iptables) without any timing-based cleanup.
flowchart TD
subgraph host [One Linux host]
subgraph slotA ["gh-sr-myrepo-1 (privileged DinD)"]
hookA1["JOB_STARTED: reset to clean"]
jobA["agentic job: gh-aw + AWF + MCP gateway"]
hookA2["JOB_COMPLETED: tear down"]
hookA1 --> jobA --> hookA2 --> hookA1
end
subgraph slotB ["gh-sr-myrepo-2 (privileged DinD)"]
jobB["agentic job"]
end
end2. Host requirements#
The host runs only the outer runner containers, so its requirements are minimal. Everything gh-aw needs (gh-aw CLI, AWF, Node/Python tooling, Docker CE, dnsmasq, browser packages, DNS config, sudoers) lives inside the image that gh sr builds.
| Requirement | Details |
|---|---|
| OS | Linux only (Ubuntu/Debian recommended). macOS/Windows hosts are not supported for agentic. |
| Architecture | amd64 or arm64 |
| Docker | Docker Engine installed and running on the host, with the runner user able to run docker. |
--privileged | The host must allow privileged containers (required for the inner dockerd). |
Install Docker on the host:
sudo apt-get update && sudo apt-get install -y docker.io
sudo systemctl enable --now docker
sudo usermod -aG docker "$USER" # log out/in (or: newgrp docker)
docker run --rm --privileged alpine echo privileged-ok # must print privileged-okThat is all the host setup gh sr cannot do for you. There is no host dnsmasq, /etc/hosts, sudoers-for-iptables, RUNNER_TEMP, or /opt/hostedtoolcache configuration to perform — those concerns are handled inside the image.
3. Configuration#
hosts:
my-linux-host:
addr: user@192.168.1.10 # or "local"
runners:
- name: my-agentic
repo: owner/repo
host: my-linux-host
count: 3 # 3 concurrent agentic jobs, each fully isolated
profile: agentic # implies runner_mode: container (DinD)count: Ngives N isolated runner containers (gh-sr-my-agentic-1…-N) — that is your same-host concurrency. No port or label juggling is required.- Optional extra image packages: set a global
container_runner_image.extra_apt_packageslist (Debian package names) inrunners.yml; the image tag gains a suffix so Docker rebuilds. - Reduced-MTU networks (cloud overlay / VPN / nested virt) are handled automatically —
gh srdetects the host egress MTU and pins the container’s inner/outer Docker MTU to it. Override withcontainer_runner_image.mtuonly when the host NIC hides a smaller path MTU (see §4).
Workflow frontmatter only needs standard self-hosted targeting:
---
on: issues
runs-on: [self-hosted, linux, x64, agentic]
safe-outputs:
create-issue: {}
---4. What the runner image provides#
gh sr setup builds gh-sr/agentic-runner:<runner-version> locally (Ubuntu 24.04). It bundles:
Docker CE (the inner
dockerdfor DinD) with a baked daemon config that pins the inner default bridge gateway to10.200.0.1and points container DNS at a bundled dnsmasq, sohost.docker.internalresolves deterministically for inner containers. The innerdockerdstarts exactly once — there is no runtime DNS rewrite or daemon restart. (10.200.0.1is deliberately outside the host’s default Docker bridge subnet172.17.0.0/16so the inner bridge cannot collide with the outer runner container’s own network.)Why not the default
172.17.0.0/16? The outer runner container itself normally sits on the host’s default Docker bridge172.17.0.0/16(gateway172.17.0.1). If the inner bridge were pinned to a172.17.xaddress it would duplicate the container’s own default-gateway IP and capture that route, black-holing all outbound traffic into the inner bridge — the host network stays fine, but every connection made inside the runner (agent → model provider,git, package installs) crawls or times out.10.200.0.0/24avoids this because it lies outside Docker’s default address pools (172.16.0.0/12and192.168.0.0/16).As defence in depth,
entrypoint.shvalidates the baked gateway against the runner container’s actual interfaces just before the singledockerdstart. If a custom host happens to attach the runner container to a subnet that overlaps10.200.0.0/24, the entrypoint automatically falls back to the next free candidate (10.201.0.1,10.210.0.1, …) and rewritesdaemon.json+ the dnsmasq config once, beforedockerdstarts (still a single start, no restart). AWARNING: baked inner-bridge gateway … overlaps an existing interfaceline indocker logs <runner-container>indicates the fallback was used.Agent-sandbox DNS (why the runner’s own
/etc/resolv.confis repointed). After dnsmasq is up,entrypoint.shrewrites the runner container’s/etc/resolv.confto a singlenameserver= the inner-bridge gateway. This matters beyond the runner itself: AWF auto-detects the agent sandbox’s DNS servers by parsing this/etc/resolv.conf. Pinning it to dnsmasq — which answershost.docker.internalauthoritatively with the gateway IP that AWF’s sandbox firewall exempts from its transparent Squid redirect — makes the agent reach the MCP gateway directly and deterministically. Without the pin the sandbox inherits the outer host resolver and intermittently mapshost.docker.internalto a non-exempt IP, so the MCP gateway POST is force-proxied into Squid (which rejects the origin-form request withERR_INVALID_URL→ HTTP 400) and the agent reportsMCP server(s) failed to launch. dnsmasq still forwards every other name to the runner container’s original upstream resolvers (so enterprise/VPC DNS keeps working);no-resolvprevents dnsmasq from looping back through the repointedresolv.conf.Reduced-MTU pinning for hosts whose egress path MTU is below Docker’s 1500 default (cloud overlay networks — GCP defaults to 1460 — VPN/WireGuard, nested virtualisation). At
docker createtimegh srdetects the host’s primary egress-interface MTU and injects it asGH_SR_HOST_MTU;entrypoint.shthen pins both the innerdockerdbridge (daemon.jsonmtu) and the outer container’seth0to it (plus amangleMSS clamp for forwarded inner traffic), strictly before the singledockerdstart. This makes TCP advertise a matching MSS in both directions so large packets fit.Why this is needed. The outer runner container sits on the host’s default Docker bridge (MTU 1500) and the inner bridge also defaults to 1500. When the real host path MTU is smaller and PMTUD is black-holed (ICMP “fragmentation needed” filtered — very common), small packets pass (DNS, TCP
SYN/ACK) so connections open, but large packets are silently dropped. TLS handshakes (theServerHello+ certificate chain span several full-size segments) then stall and the socket is torn down mid-handshake, surfacing asClient network socket disconnected before secure TLS connection was established— exactly howactions/setup-gofails on such hosts while the host itself downloads fine (its real NIC never emits oversized frames). Auto-detection covers the common case where the host NIC MTU is already below 1500; if the host NIC reports 1500 but a deeper tunnel lowers the real path MTU, force it withcontainer_runner_image.mtuinrunners.ymlandgh sr rebuild.The actions runner, gh-aw CLI, and AWF (
github/gh-aw-firewall), plusgh, Node, Python (uv), Chromium/chromium-chromedriver, and common build/system libraries.The per-job reset hooks at
/opt/gh-sr/hooks/job-started.shand/opt/gh-sr/hooks/job-completed.sh.A minimal Docker CLI shim at
/opt/gh-sr/docker-shim/docker. When gh-aw launches the MCP gateway (docker run … ghcr.io/github/gh-aw-mcpg:*) it injects two deterministic flags, thenexecs the realdocker:--hostname gh-aw-mcpg(so the gateway’s self-inspect does not fail under--network hostin DinD) and, fordocker run, a--name gh-aw-mcpg-ghsr-*(so a leftover gateway is always reapable by the per-job reset hooks andgh sr doctor). It does not supervise the gateway or rewrite MCP config URLs — baked DNS and the job hooks make those unnecessary.
5. Operations#
gh sr setup # build the image (first run) and create runner containers
gh sr up # start runners; waits for each to be ready (inner dockerd up, registered)
gh sr status # show local + GitHub status, image, and BUILD freshness
gh sr logs my-agentic # recent logs for an instance
gh sr down # stop the runner containers
gh sr rebuild my-agentic # rebuild the image after a gh-sr upgrade and restart containers
gh sr remove my-agentic # deregister from GitHub and remove container + stateEach instance runs as a Docker container named gh-sr-<instance> with --restart unless-stopped, so it auto-starts on host boot and auto-restarts between jobs. gh sr up health-gates startup: it reports a runner as ready only once the container is running, the inner dockerd responds, and the actions runner is registered (a slow first boot is a warning, not a failure).
The BUILD column in gh sr status compares the image’s baked layout revision with the one your current gh sr expects: ok means current, stale means run gh sr rebuild, ? means the image predates revision labels.
6. Health checks (gh sr doctor)#
For each Linux host with container-mode runners, gh sr doctor checks:
- host Docker CLI/daemon and that a short
--privilegedtest container runs (required for DinD); - for each instance: the
gh-sr-<instance>container exists and is running, the innerdockerdresponds, and.runneris present (registered); - for agentic instances:
awfis available viasudoinside the container,host.docker.internalresolves to a non-loopback address from inner containers, the runner’s/etc/resolv.confis pinned to the bundled dnsmasq (so the agent sandbox inherits an AWF-exempt resolver — flagged ascontainer-inner-resolvotherwise), the AWF service-routing bypass is installed, no Docker interface MTU exceeds the host egress MTU (flagged ascontainer-mtuon a stale, pre-MTU-pinning image when the host path MTU is below 1500), and there are no orphangh-aw/awf-/gh-aw-mcpg-*containers or staleDOCKER-USERrules in the inner Docker.
gh sr doctor --host my-linux-host7. State layout#
Each runner container bind-mounts $HOME/.gh-sr/runners/<instance> at /runner-state:
| Path | Contents |
|---|---|
/runner-state/docker-data/ | Inner Docker image-layer cache — persistent across jobs/restarts. |
/runner-state/_work/ | Runner job workspace. |
/runner-state/_temp/ | RUNNER_TEMP (kept off /tmp so it never collides with /tmp/gh-aw). |
/runner-state/dockerd.log | Inner dockerd log. |
The gh-aw runtime tree /tmp/gh-aw lives inside the container rootfs and is wiped before/after every job by the reset hooks — it is per-job scratch, never the cache.
8. Troubleshooting#
| Symptom | Likely cause | Fix |
|---|---|---|
gh sr setup errors: runner_mode: container is only supported on Linux | agentic/container runner on a non-Linux host | Use a Linux host. |
Validation error: profile: agentic is no longer supported with runner_mode: native | old config pinned native mode | Remove runner_mode: native (agentic uses container automatically) or set runner_mode: container. |
Validation error: agentic_mcp_ports / agentic_mcp_port_base have been removed | old per-instance MCP port config | Delete those fields; container mode isolates the port per runner. |
docker --privileged fails on the host | privileged containers blocked (userns-remap, security policy) | Allow privileged containers, or use a Sysbox runtime (see below). |
Inner dockerd not responding / runner not registered | slow first boot or a broken container | gh sr logs <name>, then gh sr doctor; gh sr rebuild <name> if persistent. |
AWF agent → workflow services: (postgres/redis) Connection refused | AWF service-routing bypass missing in the container | gh sr rebuild <name> (the hooks/entrypoint install it), or apply live: docker exec gh-sr-<instance> iptables -t nat -I PREROUTING -s 172.30.0.0/24 -m addrtype --dst-type LOCAL -j RETURN. |
mcp_servers failed inside AWF | inner DNS or a stale environment | Confirm host.docker.internal resolves inside the container (docker exec gh-sr-<instance> docker run --rm alpine getent hosts host.docker.internal); recompile workflows with a current gh aw. |
MCP server(s) failed to launch intermittently (succeeds on a later run); agent log shows a Squid ERR_INVALID_URL page for host.docker.internal | runner /etc/resolv.conf not pinned to dnsmasq (stale pre-fix image), so the agent sandbox resolved host.docker.internal via the outer host to a non-exempt IP and got force-proxied into Squid | gh sr rebuild <name>, then verify docker exec gh-sr-<instance> cat /etc/resolv.conf shows a single nameserver equal to the inner docker0 gateway (e.g. 10.200.0.1). gh sr doctor reports this as container-inner-resolv. |
actions/setup-go / setup-node (or any download) fails with Client network socket disconnected before secure TLS connection was established, retries, then errors — but the host downloads the same URL fine | container MTU (1500) exceeds the host’s real egress path MTU (cloud overlay/VPN/nested virt) with PMTUD black-holed, so large TLS handshake packets are dropped | gh sr rebuild <name> to pick up MTU pinning (auto-detects the host egress MTU). If the host NIC reports 1500 but a tunnel lowers the real path MTU, set container_runner_image.mtu: <value> in runners.yml and rebuild. gh sr doctor reports a mismatch as container-mtu; verify with docker exec gh-sr-<instance> cat /sys/class/net/eth0/mtu. |
Inspect a running runner:
docker exec -it gh-sr-<instance> bash
docker info # inner dockerd
docker ps # agent/AWF containers for the current job
tail -f /runner-state/dockerd.logSecurity: --privileged and Sysbox#
Container-mode runners use --privileged because the inner dockerd needs full Linux capabilities. This suits trusted infrastructure. For privilege-free DinD, install Sysbox and run the runner container with --runtime sysbox-runc instead of --privileged (not auto-configured by gh sr).
9. Migrating from native agentic / port labels#
If you previously ran profile: agentic in native mode (with agentic_mcp_ports / agentic_mcp_port_base and gh-sr-mcp-<port> labels and sandbox.mcp.port in workflow frontmatter):
- In
runners.yml, deleterunner_mode: native,agentic_mcp_ports, andagentic_mcp_port_basefrom agentic runners. Keepprofile: agenticand setcountto the concurrency you want. - Remove the
gh-sr-mcp-<port>entries from your workflows’runs-onand any customsandbox.mcp.port; recompile withgh aw compile. - Run
gh sr setup(builds the image), thengh sr up. - Optional: drop the host-level dnsmasq/sudoers/
/opt/hostedtoolcachetweaks the old native path created — they are no longer used.