Flaky awf-api-proxy unhealthy failure in downstream Copilot workflow

### Summary

A downstream `elastic/docs-content` agentic workflow failed before the agent could run because the `awf-api-proxy` sidecar became unhealthy during `docker compose up`. The user reports seeing this flaky failure twice recently; this run is one concrete example.

Run: https://github.com/elastic/docs-content/actions/runs/25046213576
Failed job: `Docs AI / triage / agent`
Failed step: `Execute GitHub Copilot CLI`
Workflow: `Docs AI menu`, event `issue_comment`, branch `main`.

This looks like a recurrence of the failure pattern from #27888, but that issue is closed and this example occurred later on gh-aw `v0.71.1` with firewall `0.25.28`.

### Observed failure

The agent never starts and no model/tool work is performed. The uploaded `agent_output.json` contains an empty `items` array.

The decisive lines are in `c5bf1190-agent/agent-stdio.log`:

```text
[INFO] Starting containers...
 Container awf-squid  Starting
 Container awf-api-proxy  Starting
 Container awf-api-proxy  Started
 Container awf-squid  Started
 Container awf-squid  Waiting
 Container awf-api-proxy  Waiting
 Container awf-squid  Healthy
 Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never
```

The generated compose file shows the api proxy health check is:

```yaml
api-proxy:
  healthcheck:
    test:
      - CMD
      - curl
      - '-f'
      - http://localhost:10000/health
    interval: 1s
    timeout: 1s
    retries: 5
    start_period: 2s
```

### Environment from the failing artifact

- `GH_AW_VERSION`: `v0.71.1`.
- `COPILOT_MODEL`: `claude-sonnet-4.6`.
- Firewall images:
  - `ghcr.io/github/gh-aw-firewall/agent:0.25.28@sha256:a8834e285807654bf680154faa710d43fe4365a0868142f5c20e48c85e137a7a`.
  - `ghcr.io/github/gh-aw-firewall/api-proxy:0.25.28@sha256:93290f2393752252911bd7c39a047f776c0b53063575e7bde4e304962a9a61cb`.
  - `ghcr.io/github/gh-aw-firewall/squid:0.25.28@sha256:844c18280f82cd1b06345eb2f4e91966b34185bfc51c9f237c3e022e848fb474`.
- Runner image: `ubuntu24`, `ImageVersion: 20260426.100.1`.
- Triggering repo: `elastic/docs-content`.

### Diagnostic gap

The artifact says API proxy logs are available at `/tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs`, and the compose file mounts that path into `/var/log/api-proxy`, but the downloaded artifact only includes these firewall log files:

```text
sandbox/firewall/logs/access.log
sandbox/firewall/logs/audit.jsonl
sandbox/firewall/logs/cache.log
```

There are no `api-proxy-logs` files in the artifact, so it is not possible to tell whether the proxy failed to bind, crashed, timed out during initialization, or failed its `/health` route for another reason.

### Why this seems flaky

- The user reports this has happened twice recently.
- The same workflow family generally succeeds, but this run failed pre-inference.
- The failure signature matches earlier intermittent reports in #27888, including `awf-squid` becoming healthy while `awf-api-proxy` times out as unhealthy.

### Requested follow-up

Could gh-aw either reopen/continue the previous investigation or track this as a new recurrence on `v0.71.1`?

Useful fixes would be:

1. Capture `awf-api-proxy` container logs when `docker compose up` fails before removing containers.
2. Include the mounted `api-proxy-logs` directory in uploaded artifacts even if it is empty, so absence is explicit.
3. Consider increasing the api proxy health check grace period/retries or retrying `docker compose up` when only the api proxy health check flakes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky awf-api-proxy unhealthy failure in downstream Copilot workflow #28898

Summary

Observed failure

Environment from the failing artifact

Diagnostic gap

Why this seems flaky

Requested follow-up

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Flaky awf-api-proxy unhealthy failure in downstream Copilot workflow #28898

Description

Summary

Observed failure

Environment from the failing artifact

Diagnostic gap

Why this seems flaky

Requested follow-up

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions