If you self-host LLMs, at some point, you’ll likely experience a stuck GPU consuming significant electricity for long periods of time. This is my summary of discovering the problem & then implementing automation that corrects the issue. I want automation to discover & correct these issues. I don’t want to be the first line of defense against unnecessary electricity consumption.
My Self Hosted Setup
I run a large language model on hardware at home — an Ollama server on a mini PC (AMD Ryzen AI MAX+ 395 with an integrated GPU). Self-hosting gives me privacy, no subscription, and a model that works without an internet connection. The tradeoff is that I’m my own ops team. When something goes wrong, there’s no provider watching for outages.
This weekend, something went wrong.
I noticed the cooling fan on my AI server running at full speed without letting up. I hadn’t been using it- and my family’s use of it is pretty limited. There likely were no active requests for the system. Either the server was compromised and mining cryptocurrency, or a process was hung and burning power for nothing. The system draws up to 85W — leaving it pinned at full power indefinitely would show up on my electricity bill.
I needed to do two things. First, determine whether the machine was doing real work or stuck. Second, build a solution that catches this class of problem automatically: Monitoring for long running fan activity has an unreliable Mean Time To Resolution.
Diagnosis. I checked what the model was doing with ollama ps and found a model stuck in a Stopping... state — it was supposed to unload and free the GPU, but the underlying process never exited. I confirmed the conditions by reading the GPU utilization gauge from the kernel (/sys/class/drm/card1/device/gpu_busy_percent): ~89% busy, ~85W. It had been in this state for roughly 20 hours with zero inference requests. The normal shutdown command (ollama stop) had no effect. A service restart (sudo systemctl restart ollama) cleared it. The root cause is a known Ollama bug where the GPU stays at full utilization with no work to do.
Watchdog. At this point, I confirmed that the Ollama system was hung and needed to be reset. But this problem has happened before! It would likely happen again. I needed to stop relying on my ability to detect unexpected zephyrs emanating off my server. I created a watchdog that alerts me only when this specific failure occurs. The logic of the watchdog needs to be narrow to avoid false positives. The basics of the system are as follows:
- A cron job samples the system every 5 minutes with two reads: is a model process loaded, and how busy is the GPU?
- A sample only counts as unhealthy when both conditions are true: a model is loaded and the GPU is at or above 70% utilization. High GPU usage during inference is normal; this targets high usage while idle.
- The unhealthy state must hold continuously for 15 minutes before the watchdog alerts. That’s long enough to rule out any legitimate request.
- Any healthy reading resets the counter, so the watchdog only fires on a sustained stuck condition, not a transient spike.
- When it fires, it sends a push notification to my phone via ntfy.sh with the diagnosis and the one-line fix command.
- After sending, it suppresses further alerts for one hour. I want one notification- not a stream of them.
There are two design choices worth noting:
- The watchdog alerts but never restarts the model automatically. A person can tell a stuck runner from a legitimate long job in seconds. I didn’t want automation killing real work.
- If the alert fails to send, the watchdog doesn’t mark the event as handled, so the next run retries instead of going silent.
An aside for my friends in Telco: I originally planned to send alerts as SMS through a AT&T’s email-to-SMS gateway, but that service was shut down in mid-2025. I switched to ntfy.sh push notifications, which turned out simpler and requires no stored credentials. The telco industry whiffed badly. Push notifications should be carrier infrastructure- but instead it’s an OTT service. The industry could figure out how to route calls and SMS across different networks, but it couldn’t figure out how to route push notifications across them? A pox on all your houses. IMS should have been more than SIP routing!
Watchdog Results
The stuck Ollama runner was discovered, evaluated and reset. Until last week- I ran the risk that a hung process would run silently for 20+ hours in ways that could raise my electricity bill. The system now self-reports within 15 minutes of high usage. When a model pins the GPU again, I’ll get a phone notification with clear guidance on how to fix it, whether I’m at the machine or not. The watchdog is free to run and is very simple. A cron job and two timestamp files that survive reboots.
Cron Setup:
The cron setup lives in /etc/cron.d/ollama-watchdog (a system cron drop-in, not a user crontab). The installer writes it as:
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
# m h dom mon dow user command
*/5 * * * * USER /opt/ollama-watchdog/watchdog.sh >> /opt/ollama-watchdog/state/watchdog.log 2>&1
A few operational notes:
- No crontab command involved. Because it’s a drop-in file, you manage it by editing/removing /etc/cron.d/ollama-watchdog directly.
It’s picked up automatically — no reload needed. File should be mode 644, owned root. - Reinstalling (sudo bash /home/$USER/install-ollama-watchdog.sh) rewrites this file each time, so edit the installer if you want to
change the schedule permanently. For a one-off tweak, edit the cron file directly. - Detection cadence vs. alert latency: the 5-min interval sets the resolution. Combined with the 15-min sustain threshold, worst-case
time-to-alert after a runner wedges is roughly 15–20 minutes (you need ~3 consecutive bad samples to cross 900s). Verify it’s installed and watch it work:
cat /etc/cron.d/ollama-watchdog # confirm the entry
tail -f /opt/ollama-watchdog/state/watchdog.log # watch each 5-min decision
Watchdog Installation Script:
#!/usr/bin/env bash
#
# install-ollama-watchdog.sh
# ---------------------------
# Installs a cron-driven watchdog that pushes a phone notification (via ntfy.sh)
# when an `ollama runner` pegs the GPU for a sustained period (the
# "permaspinning fan" failure seen ).
#
# - Samples every 5 min (cron).
# - Alerts only after the bad condition has held continuously for >15 min.
# - Rate-limits to at most one push per hour while it stays stuck.
# - ALERT ONLY: it never restarts ollama for you.
#
# Transport: ntfy.sh. The topic name IS the secret/credential, so make it
# deliberately opaque. Subscribe to it in the ntfy app or at
# https://ntfy.sh/<topic>. No account, no API key, no .env needed.
#
# Run with: sudo bash install-ollama-watchdog.sh
#
set -euo pipefail
BASE=/opt/ollama-watchdog
STATE="$BASE/state"
RUN_USER=_USER_ # cron job runs as this user (owns state/log). Add your user account!
if [ "$(id -u)" -ne 0 ]; then
echo "This installer must be run as root: sudo bash $0" >&2
exit 1
fi
echo "==> Creating $BASE"
mkdir -p "$BASE" "$STATE"
# ---------------------------------------------------------------------------
# watchdog.conf (settings; edit thresholds/topic here, then no reinstall)
# ---------------------------------------------------------------------------
echo "==> Writing $BASE/watchdog.conf"
cat > "$BASE/watchdog.conf" <<'CONF_EOF'
# ollama-watchdog configuration (sourced by watchdog.sh)
# --- where the alert goes (ntfy.sh) ---
# The topic name is opaque on purpose: anyone who knows this URL can read AND
# spoof your alerts, and it reveals nothing about what it monitors. Treat it
# like a password. Subscribe to this same topic in the ntfy phone app.
NTFY_URL="https://ntfy.sh/_YOURNTFY.SH_TOPIC"
NTFY_PRIORITY="high" # min|low|default|high|urgent (urgent can bypass Do Not Disturb)
NTFY_TITLE="Ollama ALERT"
NTFY_TAGS="rotating_light,fire" # emoji shown on the notification
# --- detection ---
GPU_BUSY_FILE="/sys/class/drm/card1/device/gpu_busy_percent"
GPU_BUSY_THRESHOLD=70 # percent; "bad" sample if at/above this
SUSTAIN_SECS=900 # must stay bad this long before alerting (15 min)
REALERT_SECS=3600 # min seconds between repeat pushes (1 hr)
STATE_DIR="/opt/ollama-watchdog/state"
CONF_EOF
# ---------------------------------------------------------------------------
# watchdog.sh (the monitor itself)
# ---------------------------------------------------------------------------
echo "==> Writing $BASE/watchdog.sh"
cat > "$BASE/watchdog.sh" <<'WD_EOF'
#!/usr/bin/env bash
# ollama-watchdog: push a phone alert (via ntfy.sh) when an ollama runner pegs
# the GPU for a sustained period. Alert-only; no auto-remediation. See
# watchdog.conf.
set -uo pipefail
export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
CONF=/opt/ollama-watchdog/watchdog.conf
[ -r "$CONF" ] && . "$CONF"
# defaults (overridable in watchdog.conf)
: "${NTFY_URL:=https://ntfy.sh/_YOURNTFY.SH_URL}"
: "${NTFY_PRIORITY:=high}"
: "${NTFY_TITLE:=YOUR_TITLE ALERT}"
: "${NTFY_TAGS:=rotating_light,fire}"
: "${GPU_BUSY_FILE:=/sys/class/drm/card1/device/gpu_busy_percent}"
: "${GPU_BUSY_THRESHOLD:=70}"
: "${SUSTAIN_SECS:=900}"
: "${REALERT_SECS:=3600}"
: "${STATE_DIR:=/opt/ollama-watchdog/state}"
BAD_SINCE="$STATE_DIR/bad_since"
LAST_ALERT="$STATE_DIR/last_alert"
mkdir -p "$STATE_DIR"
now=$(date +%s)
send_push() { # title body
local title="$1" body="$2"
curl --silent --show-error --fail \
-H "Title: $title" \
-H "Priority: $NTFY_PRIORITY" \
-H "Tags: $NTFY_TAGS" \
-d "$body" \
"$NTFY_URL"
}
# --- test mode: send one push now and exit ---
if [ "${1:-}" = "--test" ]; then
if send_push "ai.local test" "ollama-watchdog test $(date '+%H:%M'). If you got this, alerts work."; then
echo "test push sent to $NTFY_URL"; exit 0
else
echo "test push FAILED" >&2; exit 1
fi
fi
# --- sample ---
runner_pid=$(pgrep -f 'ollama runner' | head -1 || true)
gpu_busy=$(cat "$GPU_BUSY_FILE" 2>/dev/null || echo "")
bad=0
if [ -n "$runner_pid" ] && [ -n "$gpu_busy" ] \
&& [ "$gpu_busy" -ge "$GPU_BUSY_THRESHOLD" ] 2>/dev/null; then
bad=1
fi
if [ "$bad" -eq 0 ]; then
rm -f "$BAD_SINCE" "$LAST_ALERT" # healthy: reset
echo "$(date -Is) ok gpu=${gpu_busy:-NA}% runner=${runner_pid:-none}"
exit 0
fi
[ -f "$BAD_SINCE" ] || echo "$now" > "$BAD_SINCE"
since=$(cat "$BAD_SINCE" 2>/dev/null || echo "$now")
elapsed=$(( now - since )); mins=$(( elapsed / 60 ))
echo "$(date -Is) bad gpu=${gpu_busy}% runner_pid=${runner_pid} sustained=${mins}m"
[ "$elapsed" -lt "$SUSTAIN_SECS" ] && exit 0 # not sustained 15 min yet
last=0; [ -f "$LAST_ALERT" ] && last=$(cat "$LAST_ALERT" 2>/dev/null || echo 0)
[ $(( now - last )) -lt "$REALERT_SECS" ] && exit 0 # already pushed within the hour
# context for the message
tctl=$(sensors 2>/dev/null | awk '/^Tctl:/{print $2; exit}')
ppt=$(sensors 2>/dev/null | awk '/PPT:/{print $2" "$3; exit}')
model=$(ollama ps 2>/dev/null | awk 'NR==2{print $1; exit}'); [ -z "$model" ] && model="?"
body="ollama pegged GPU ${gpu_busy}% for ${mins}m. Tctl ${tctl:-?} PPT ${ppt:-?}. model=${model}. Fix: sudo systemctl restart ollama"
if send_push "$NTFY_TITLE" "$body"; then
echo "$now" > "$LAST_ALERT"
echo "$(date -Is) ALERT sent: $body"
else
echo "$(date -Is) ALERT send FAILED" >&2
fi
WD_EOF
chmod 755 "$BASE/watchdog.sh"
chmod 644 "$BASE/watchdog.conf"
# state dir must be writable by the cron user (job runs as $RUN_USER)
chown -R "$RUN_USER":"$RUN_USER" "$STATE"
# ---------------------------------------------------------------------------
# cron.d entry (runs as $RUN_USER every 5 minutes)
# ---------------------------------------------------------------------------
echo "==> Installing /etc/cron.d/ollama-watchdog"
cat > /etc/cron.d/ollama-watchdog <<CRON_EOF
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
# m h dom mon dow user command
*/5 * * * * $RUN_USER /opt/ollama-watchdog/watchdog.sh >> /opt/ollama-watchdog/state/watchdog.log 2>&1
CRON_EOF
chmod 644 /etc/cron.d/ollama-watchdog
echo
echo "==> Installed."
echo " Monitor : $BASE/watchdog.sh (every 5 min via /etc/cron.d/ollama-watchdog)"
echo " Config : $BASE/watchdog.conf"
echo " Log : $STATE/watchdog.log"
echo
echo "Subscribe in the ntfy app to topic: YOUR_NTFY.SH_URL"
echo "Then send yourself a test push:"
echo " $BASE/watchdog.sh --test"

