Skip to content

Flaky awf-api-proxy unhealthy failure in downstream Copilot workflow #28898

@theletterf

Description

@theletterf

Summary

A downstream elastic/docs-content agentic workflow failed before the agent could run because the awf-api-proxy sidecar became unhealthy during docker compose up. The user reports seeing this flaky failure twice recently; this run is one concrete example.

Run: https://github.com/elastic/docs-content/actions/runs/25046213576
Failed job: Docs AI / triage / agent
Failed step: Execute GitHub Copilot CLI
Workflow: Docs AI menu, event issue_comment, branch main.

This looks like a recurrence of the failure pattern from #27888, but that issue is closed and this example occurred later on gh-aw v0.71.1 with firewall 0.25.28.

Observed failure

The agent never starts and no model/tool work is performed. The uploaded agent_output.json contains an empty items array.

The decisive lines are in c5bf1190-agent/agent-stdio.log:

[INFO] Starting containers...
 Container awf-squid  Starting
 Container awf-api-proxy  Starting
 Container awf-api-proxy  Started
 Container awf-squid  Started
 Container awf-squid  Waiting
 Container awf-api-proxy  Waiting
 Container awf-squid  Healthy
 Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never

The generated compose file shows the api proxy health check is:

api-proxy:
  healthcheck:
    test:
      - CMD
      - curl
      - '-f'
      - http://localhost:10000/health
    interval: 1s
    timeout: 1s
    retries: 5
    start_period: 2s

Environment from the failing artifact

  • GH_AW_VERSION: v0.71.1.
  • COPILOT_MODEL: claude-sonnet-4.6.
  • Firewall images:
    • ghcr.io/github/gh-aw-firewall/agent:0.25.28@sha256:a8834e285807654bf680154faa710d43fe4365a0868142f5c20e48c85e137a7a.
    • ghcr.io/github/gh-aw-firewall/api-proxy:0.25.28@sha256:93290f2393752252911bd7c39a047f776c0b53063575e7bde4e304962a9a61cb.
    • ghcr.io/github/gh-aw-firewall/squid:0.25.28@sha256:844c18280f82cd1b06345eb2f4e91966b34185bfc51c9f237c3e022e848fb474.
  • Runner image: ubuntu24, ImageVersion: 20260426.100.1.
  • Triggering repo: elastic/docs-content.

Diagnostic gap

The artifact says API proxy logs are available at /tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs, and the compose file mounts that path into /var/log/api-proxy, but the downloaded artifact only includes these firewall log files:

sandbox/firewall/logs/access.log
sandbox/firewall/logs/audit.jsonl
sandbox/firewall/logs/cache.log

There are no api-proxy-logs files in the artifact, so it is not possible to tell whether the proxy failed to bind, crashed, timed out during initialization, or failed its /health route for another reason.

Why this seems flaky

Requested follow-up

Could gh-aw either reopen/continue the previous investigation or track this as a new recurrence on v0.71.1?

Useful fixes would be:

  1. Capture awf-api-proxy container logs when docker compose up fails before removing containers.
  2. Include the mounted api-proxy-logs directory in uploaded artifacts even if it is empty, so absence is explicit.
  3. Consider increasing the api proxy health check grace period/retries or retrying docker compose up when only the api proxy health check flakes.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions