Troubleshooting Akuity agent websocket errors & missing argocd-tls-certs-cm
How to troubleshoot Akuity agent websocket and argocd-tls-certs-cm errors
Two separate but related symptoms were observed when connecting clusters via Terraform:
-
akuity-agentlogs show:client: Connection error: websocket: bad handshake
client: Give up -
argocd-notifications-controllerpods fail to start with:MountVolume.SetUp failed for volume "tls-certs" : configmap "argocd-tls-certs-cm" not found
Both symptoms point to the agent not being able to establish a successful connection to the Akuity control plane. The control plane provides cluster-specific config (including the TLS cert ConfigMap), so until the agent is connected those in-cluster resources will be absent and other pods may also fail to initialize.
What’s happening!
-
The agent uses a persistent connection (websocket/tunnel) to the control plane to fetch its configuration and resources.
-
If network access is blocked or the control plane rejects the connection, the websocket handshake fails (
bad handshake) and the agent gives up. -
While disconnected, the agent cannot pull the configuration that causes the control plane to create/mount things like
argocd-tls-certs-cm. As a result, controllers depending on that ConfigMap fail to start.
A common cause (and the one that frequently arises on Akuity) is applying an IP Allow List that blocks the public IPs used by the cluster agents to reach the control plane.
Recommended step-by-step recovery & fixes
-
Allow the agent IPs instead of disabling allow list completely:
-
Add the public NAT IPs of each cluster (the IPs used by the agents to talk to the control plane) to the instance IP allow list so the control plane will accept websocket connections from agents.
-
-
If clusters are behind NATs that change IPs, consider a stable egress IP or use a static NAT IP allocation or VPN so the control plane can reliably allow those addresses.
-
Confirm TLS & DNS:
-
Ensure the control-plane hostname resolves and TLS certs are valid from the cluster network.
-
-
Pre-create the ConfigMap (optional):
-
If immediate pod startup is needed before agent connection, a pre-created
argocd-tls-certs-cmcan be applied via declarative specs or Terraform — but normally this is handled by the control plane once the agent connects.
-
-
Resource sizing:
-
If agents repeatedly fail/restart, ensure in-cluster resources (CPU/memory) are sufficient; frequent OOMs can cause flaky connections or corrupted state.
-
Likely resolution in this specific case
-
Toggling the allow list for cluster agents resolved the issue previously — that confirms the IP allow list was blocking the websocket handshake.
-
After allowing the agents (or adding their public IPs to the allow list), the agent will successfully connect, the control plane will provide the missing ConfigMap, and the
argocd-notifications-controllerand other dependent pods will start normally.