Keeping a Trading Robot Alive: Monitoring, Failover and 24/7 Operations

Keeping a Trading Robot Alive: Monitoring, Failover and 24/7 Operations

There is a popular guide on this board about building a trading robot — its architecture, execution and risk controls. This is the sequel nobody writes: keeping the thing running. A profitable strategy that is offline during the one move it was built for is a losing strategy. Live operations are where most home-built bots quietly die, and almost none of it is about the strategy.

The uncomfortable truth
Your bot will run on imperfect hardware, over an imperfect internet connection, against a broker that has maintenance windows and disconnects. The question is never "will something fail?" but "what does the bot do when it does?" If you do not have an answer, the market will eventually supply one for you.

Run it where it can stay up

Not your laptop. A laptop that sleeps, reboots for updates, or loses Wi-Fi is not an execution venue. Use a VPS or server close to the broker to cut latency and stay online.
Process supervision. Run under a supervisor (systemd, a process manager, a container restart policy) so a crash restarts the bot automatically instead of leaving positions unmanaged.
Time sync. Keep the clock disciplined with NTP. A bot whose clock drifts will misalign bars and timestamp orders wrong — subtle and corrosive.

Heartbeats and watchdogs
A bot that has silently frozen looks identical to a bot with nothing to do. Distinguish them:

Heartbeat. Have the bot emit a regular "I am alive and here is my state" signal. If the heartbeat stops, something is wrong even if no error was thrown.
External watchdog. A second, dumb process that watches the heartbeat and alerts (or restarts) when it goes quiet. Do not let the thing that crashed be the thing responsible for noticing it crashed.
Data-staleness checks. If the last tick is older than it should be, the feed is dead — stop trading rather than acting on a frozen price.

State, recovery and reconciliation
This is the part that turns a small outage into a blown account. When the bot restarts, it must answer one question correctly: what positions do I actually have open right now?

Persist state continuously, so a restart does not forget open positions, pending orders, or stops.
Reconcile on startup. On every restart, query the broker for the true account state and reconcile it against what the bot believes. Trust the broker, not your memory.
Idempotent orders. Use client order IDs so a reconnect-and-retry cannot accidentally double a position.

Alerting and the kill switch

Alert on the things that matter: disconnects, rejected orders, breached daily loss limit, heartbeat loss. Send them somewhere you will actually see at 3am.
A human kill switch you can hit from your phone, plus an automatic one that flattens and halts on repeated errors or a connection loss.
Log every decision with enough context to reconstruct what the bot saw and why it acted. When something goes wrong — and it will — the log is the only witness.

Bottom line
The strategy is maybe half the battle; staying alive, knowing your true state after a crash, and failing safe is the other half — and it is the half that actually loses real money when neglected. Engineer the operations as carefully as the signal: supervise the process, watch it with a heartbeat, reconcile against the broker on every restart, and make sure the worst case is "flat and halted," never "trading blind."

What is the first thing that broke when your bot went live 24/7 — and how do you watch it now? Share your war stories below.