Hermes Agent Crash Analysis: an asyncio Cleanup Error Triggers a Gateway Restart Loop

Checo5/20/26About 3 min

Hermes Agent Crash Analysis: an asyncio Cleanup Error Triggers a Gateway Restart Loop

After a Hermes gateway had been online for about 48 hours, the CLI process threw an exception during its shutdown cleanup, which cascaded into the gateway being repeatedly restarted by SIGTERM, forming a vicious "restart interrupts session -> user reconnects -> interrupted again" loop that left the CLI down for about 12 minutes. This post breaks down the full crash chain and root cause.

Background

Item	Info
Version	Hermes Agent v0.14.0 (2026.5.16)
Environment	macOS, Apple M4 Mac mini, Python 3.11.15
Incident window	2026-05-20 22:50 ~ 23:02 (Beijing time)

Crash timeline

Time	Event
05-18 14:25	Gateway started (PID 70093), then ran stably for ~48.5 hours
05-20 22:50:36	User started a CLI session, triggering title-generation and other auxiliary tasks
05-20 22:51:59	🔴 CLI process crashed — KeyboardInterrupt during `asyncio.run()` shutdown
05-20 22:52:00	Cascade: 16 async generators failed to close + 15 pending Tasks force-destroyed
05-20 22:52:11	User's first reconnect attempt (session `225211`), typed "hi"
05-20 22:53:40	API call began
05-20 22:55:01	🔴 Gateway received SIGTERM, shut down (systemd `Restart=on-failure` restarted it)
05-20 22:55:02	Gateway restarted (new PID 37043)
05-20 22:55:13	Session `225211`'s API stream interrupted (`stream_interrupt_abort`)
05-20 22:55:38	Retry interrupted again (SIGHUP received), session failed for good
05-20 22:56:42	User's second reconnect attempt (session `225642`), typed "hi"
05-20 22:58:54	Session `225642` interrupted (`interrupted_during_api_call`), SIGTERM received
05-20 23:01:46	🔴 Gateway received SIGTERM again, shut down
05-20 23:01:57	Gateway restarted (current instance), stable since
05-20 23:02:44	User's third attempt (current session `230244`), recovered successfully ✅

Root-cause analysis

Primary cause: CLI asyncio cleanup error

File "cli.py", line 10098, in _signal_handler
    from tools.voice_mode import play_audio_file
KeyboardInterrupt

RuntimeError: aclose(): asynchronous generator is already running

During shutdown, _signal_handler tried to import the voice_mode module, which triggered a KeyboardInterrupt. Because 16 prompt_toolkit in_terminal async generators were running at the time, asyncio's cleanup couldn't close them, leading to:

16 async_generator raising RuntimeError: aclose(): asynchronous generator is already running
15 prompt_toolkit render Tasks force-destroyed (Task was destroyed but it is pending)

This shows a conflict between a module import inside a signal handler and the running async event loop.

Gateway cascade

After the CLI crash, the gateway went through two SIGTERM restart loops:

Each restart killed the CLI session waiting for an API response
The killed session triggered stream_interrupt_abort
systemd's Restart=on-failure policy auto-restarted the gateway
The user's reconnect got interrupted by the next restart -> vicious loop

Side issue: Telegram instability

A flood of Telegram polling conflict errors (concentrated over ~2 hours)
httpx.ConnectTimeout — connecting to the Telegram API timed out
Cause: an unstable proxy port, or multiple-instance conflicts

Impact

Impact	Detail
CLI downtime	~12 minutes (22:50 ~ 23:02)
Failed sessions	2 (`225211`, `225642`)
Gateway restarts	2
Uptime before crash	~48.5 hours
Telegram messages	could not send/receive during the outage

Suggested fixes

1. Signal-handler safety (high priority)

cli.py's _signal_handler performs a module import inside a signal handler, which is unsafe in an async context. Recommendations:

Change from tools.voice_mode import play_audio_file to lazy loading or pre-import at startup
In a signal handler, only set a flag/marker; do no actual I/O or imports

2. Graceful gateway shutdown (medium priority)

On SIGTERM the gateway should wait for active API requests to finish (or apply a sensible timeout) instead of immediately interrupting in-flight streaming responses.

3. systemd restart policy (low priority)

Consider adding a restart delay to Restart=on-failure (e.g. RestartSec=3) to avoid the vicious loop caused by rapid restarts.

4. Telegram connection stability (low priority)

Confirm the proxy port's stability
Check whether multiple Hermes instances share one Telegram Bot Token

Summary

This crash is a textbook "single-point exception amplified": a module import in the CLI signal handler triggered an asyncio cleanup failure that should have stayed process-local, but combined with the gateway's Restart=on-failure auto-restart, each restart became an interruption to an in-flight API session, forming a restart-interrupt-reconnect-reinterrupt loop. The root is that signal handlers shouldn't do imports; the amplifier is the lack of graceful shutdown and restart throttling.

For this class of problem, first lay out the full timeline (each log line's timestamp + PID + signal), and the causal chain becomes visible: which event triggered which, which steps are cascading and which are independent. Here the CLI cleanup error and the gateway SIGTERM loop are two independent mechanisms overlapping in time; neither is fatal alone, but stacked they caused 12 minutes of downtime.