Hermes Agent Crash Analysis: an asyncio Cleanup Error Triggers a Gateway Restart Loop
Hermes Agent Crash Analysis: an asyncio Cleanup Error Triggers a Gateway Restart Loop
After a Hermes gateway had been online for about 48 hours, the CLI process threw an exception during its shutdown cleanup, which cascaded into the gateway being repeatedly restarted by SIGTERM, forming a vicious "restart interrupts session -> user reconnects -> interrupted again" loop that left the CLI down for about 12 minutes. This post breaks down the full crash chain and root cause.
Background
| Item | Info |
|---|---|
| Version | Hermes Agent v0.14.0 (2026.5.16) |
| Environment | macOS, Apple M4 Mac mini, Python 3.11.15 |
| Incident window | 2026-05-20 22:50 ~ 23:02 (Beijing time) |
Crash timeline
| Time | Event |
|---|---|
| 05-18 14:25 | Gateway started (PID 70093), then ran stably for ~48.5 hours |
| 05-20 22:50:36 | User started a CLI session, triggering title-generation and other auxiliary tasks |
| 05-20 22:51:59 | 🔴 CLI process crashed — KeyboardInterrupt during asyncio.run() shutdown |
| 05-20 22:52:00 | Cascade: 16 async generators failed to close + 15 pending Tasks force-destroyed |
| 05-20 22:52:11 | User's first reconnect attempt (session 225211), typed "hi" |
| 05-20 22:53:40 | API call began |
| 05-20 22:55:01 | 🔴 Gateway received SIGTERM, shut down (systemd Restart=on-failure restarted it) |
| 05-20 22:55:02 | Gateway restarted (new PID 37043) |
| 05-20 22:55:13 | Session 225211's API stream interrupted (stream_interrupt_abort) |
| 05-20 22:55:38 | Retry interrupted again (SIGHUP received), session failed for good |
| 05-20 22:56:42 | User's second reconnect attempt (session 225642), typed "hi" |
| 05-20 22:58:54 | Session 225642 interrupted (interrupted_during_api_call), SIGTERM received |
| 05-20 23:01:46 | 🔴 Gateway received SIGTERM again, shut down |
| 05-20 23:01:57 | Gateway restarted (current instance), stable since |
| 05-20 23:02:44 | User's third attempt (current session 230244), recovered successfully ✅ |
Root-cause analysis
Primary cause: CLI asyncio cleanup error
File "cli.py", line 10098, in _signal_handler
from tools.voice_mode import play_audio_file
KeyboardInterrupt
RuntimeError: aclose(): asynchronous generator is already runningDuring shutdown, _signal_handler tried to import the voice_mode module, which triggered a KeyboardInterrupt. Because 16 prompt_toolkit in_terminal async generators were running at the time, asyncio's cleanup couldn't close them, leading to:
- 16
async_generatorraisingRuntimeError: aclose(): asynchronous generator is already running - 15
prompt_toolkitrender Tasks force-destroyed (Task was destroyed but it is pending)
This shows a conflict between a module import inside a signal handler and the running async event loop.
Gateway cascade
After the CLI crash, the gateway went through two SIGTERM restart loops:
- Each restart killed the CLI session waiting for an API response
- The killed session triggered
stream_interrupt_abort - systemd's
Restart=on-failurepolicy auto-restarted the gateway - The user's reconnect got interrupted by the next restart -> vicious loop
Side issue: Telegram instability
- A flood of
Telegram polling conflicterrors (concentrated over ~2 hours) httpx.ConnectTimeout— connecting to the Telegram API timed out- Cause: an unstable proxy port, or multiple-instance conflicts
Impact
| Impact | Detail |
|---|---|
| CLI downtime | ~12 minutes (22:50 ~ 23:02) |
| Failed sessions | 2 (225211, 225642) |
| Gateway restarts | 2 |
| Uptime before crash | ~48.5 hours |
| Telegram messages | could not send/receive during the outage |
Suggested fixes
1. Signal-handler safety (high priority)
cli.py's _signal_handler performs a module import inside a signal handler, which is unsafe in an async context. Recommendations:
- Change
from tools.voice_mode import play_audio_fileto lazy loading or pre-import at startup - In a signal handler, only set a flag/marker; do no actual I/O or imports
2. Graceful gateway shutdown (medium priority)
On SIGTERM the gateway should wait for active API requests to finish (or apply a sensible timeout) instead of immediately interrupting in-flight streaming responses.
3. systemd restart policy (low priority)
Consider adding a restart delay to Restart=on-failure (e.g. RestartSec=3) to avoid the vicious loop caused by rapid restarts.
4. Telegram connection stability (low priority)
- Confirm the proxy port's stability
- Check whether multiple Hermes instances share one Telegram Bot Token
Summary
This crash is a textbook "single-point exception amplified": a module import in the CLI signal handler triggered an asyncio cleanup failure that should have stayed process-local, but combined with the gateway's Restart=on-failure auto-restart, each restart became an interruption to an in-flight API session, forming a restart-interrupt-reconnect-reinterrupt loop. The root is that signal handlers shouldn't do imports; the amplifier is the lack of graceful shutdown and restart throttling.
For this class of problem, first lay out the full timeline (each log line's timestamp + PID + signal), and the causal chain becomes visible: which event triggered which, which steps are cascading and which are independent. Here the CLI cleanup error and the gateway SIGTERM loop are two independent mechanisms overlapping in time; neither is fatal alone, but stacked they caused 12 minutes of downtime.
