Fixing a stuck Ollama runner and building a GPU watchdog

If you self-host LLMs, at some point, you’ll likely experience a stuck GPU consuming significant electricity for long periods of time. This is my summary of discovering the problem & then implementing automation that corrects the issue. I want automation to discover & correct these issues. I don’t want to be the first line of defense against unnecessary electricity consumption.

My Self Hosted Setup

I run a large language model on hardware at home — an Ollama server on a mini PC (AMD Ryzen AI MAX+ 395 with an integrated GPU). Self-hosting gives me privacy, no subscription, and a model that works without an internet connection. The tradeoff is that I’m my own ops team. When something goes wrong, there’s no provider watching for outages.

This weekend, something went wrong.

I noticed the cooling fan on my AI server running at full speed without letting up. I hadn’t been using it- and my family’s use of it is pretty limited. There likely were no active requests for the system. Either the server was compromised and mining cryptocurrency, or a process was hung and burning power for nothing. The system draws up to 85W — leaving it pinned at full power indefinitely would show up on my electricity bill.

I needed to do two things. First, determine whether the machine was doing real work or stuck. Second, build a solution that catches this class of problem automatically: Monitoring for long running fan activity has an unreliable Mean Time To Resolution.

Diagnosis. I checked what the model was doing with ollama ps and found a model stuck in a Stopping... state — it was supposed to unload and free the GPU, but the underlying process never exited. I confirmed the conditions by reading the GPU utilization gauge from the kernel (/sys/class/drm/card1/device/gpu_busy_percent): ~89% busy, ~85W. It had been in this state for roughly 20 hours with zero inference requests. The normal shutdown command (ollama stop) had no effect. A service restart (sudo systemctl restart ollama) cleared it. The root cause is a known Ollama bug where the GPU stays at full utilization with no work to do.

Watchdog. At this point, I confirmed that the Ollama system was hung and needed to be reset. But this problem has happened before! It would likely happen again. I needed to stop relying on my ability to detect unexpected zephyrs emanating off my server. I created a watchdog that alerts me only when this specific failure occurs. The logic of the watchdog needs to be narrow to avoid false positives. The basics of the system are as follows:

A cron job samples the system every 5 minutes with two reads: is a model process loaded, and how busy is the GPU?
A sample only counts as unhealthy when both conditions are true: a model is loaded and the GPU is at or above 70% utilization. High GPU usage during inference is normal; this targets high usage while idle.
The unhealthy state must hold continuously for 15 minutes before the watchdog alerts. That’s long enough to rule out any legitimate request.
Any healthy reading resets the counter, so the watchdog only fires on a sustained stuck condition, not a transient spike.
When it fires, it sends a push notification to my phone via ntfy.sh with the diagnosis and the one-line fix command.
After sending, it suppresses further alerts for one hour. I want one notification- not a stream of them.

There are two design choices worth noting:

The watchdog alerts but never restarts the model automatically. A person can tell a stuck runner from a legitimate long job in seconds. I didn’t want automation killing real work.
If the alert fails to send, the watchdog doesn’t mark the event as handled, so the next run retries instead of going silent.

An aside for my friends in Telco: I originally planned to send alerts as SMS through a AT&T’s email-to-SMS gateway, but that service was shut down in mid-2025. I switched to ntfy.sh push notifications, which turned out simpler and requires no stored credentials. The telco industry whiffed badly. Push notifications should be carrier infrastructure- but instead it’s an OTT service. The industry could figure out how to route calls and SMS across different networks, but it couldn’t figure out how to route push notifications across them? A pox on all your houses. IMS should have been more than SIP routing!

Watchdog Results

The stuck Ollama runner was discovered, evaluated and reset. Until last week- I ran the risk that a hung process would run silently for 20+ hours in ways that could raise my electricity bill. The system now self-reports within 15 minutes of high usage. When a model pins the GPU again, I’ll get a phone notification with clear guidance on how to fix it, whether I’m at the machine or not. The watchdog is free to run and is very simple. A cron job and two timestamp files that survive reboots.

Cron Setup:

The cron setup lives in /etc/cron.d/ollama-watchdog (a system cron drop-in, not a user crontab). The installer writes it as:

SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
# m h dom mon dow user command
*/5 * * * * USER /opt/ollama-watchdog/watchdog.sh >> /opt/ollama-watchdog/state/watchdog.log 2>&1

A few operational notes:

No crontab command involved. Because it’s a drop-in file, you manage it by editing/removing /etc/cron.d/ollama-watchdog directly.
It’s picked up automatically — no reload needed. File should be mode 644, owned root.
Reinstalling (sudo bash /home/$USER/install-ollama-watchdog.sh) rewrites this file each time, so edit the installer if you want to
change the schedule permanently. For a one-off tweak, edit the cron file directly.
Detection cadence vs. alert latency: the 5-min interval sets the resolution. Combined with the 15-min sustain threshold, worst-case
time-to-alert after a runner wedges is roughly 15–20 minutes (you need ~3 consecutive bad samples to cross 900s). Verify it’s installed and watch it work:

cat /etc/cron.d/ollama-watchdog # confirm the entry
tail -f /opt/ollama-watchdog/state/watchdog.log # watch each 5-min decision

Watchdog Installation Script:

#!/usr/bin/env bash
#
# install-ollama-watchdog.sh
# ---------------------------
# Installs a cron-driven watchdog that pushes a phone notification (via ntfy.sh)
# when an `ollama runner` pegs the GPU for a sustained period (the
# "permaspinning fan" failure seen ).
#
#   - Samples every 5 min (cron).
#   - Alerts only after the bad condition has held continuously for >15 min.
#   - Rate-limits to at most one push per hour while it stays stuck.
#   - ALERT ONLY: it never restarts ollama for you.
#
# Transport: ntfy.sh. The topic name IS the secret/credential, so make it
# deliberately opaque. Subscribe to it in the ntfy app or at
# https://ntfy.sh/<topic>. No account, no API key, no .env needed.
#
# Run with:  sudo bash install-ollama-watchdog.sh
#
set -euo pipefail

BASE=/opt/ollama-watchdog
STATE="$BASE/state"
RUN_USER=_USER_           # cron job runs as this user (owns state/log).  Add your user account!

if [ "$(id -u)" -ne 0 ]; then
  echo "This installer must be run as root:  sudo bash $0" >&2
  exit 1
fi

echo "==> Creating $BASE"
mkdir -p "$BASE" "$STATE"

# ---------------------------------------------------------------------------
# watchdog.conf  (settings; edit thresholds/topic here, then no reinstall)
# ---------------------------------------------------------------------------
echo "==> Writing $BASE/watchdog.conf"
cat > "$BASE/watchdog.conf" <<'CONF_EOF'
# ollama-watchdog configuration (sourced by watchdog.sh)

# --- where the alert goes (ntfy.sh) ---
# The topic name is opaque on purpose: anyone who knows this URL can read AND
# spoof your alerts, and it reveals nothing about what it monitors. Treat it
# like a password. Subscribe to this same topic in the ntfy phone app.
NTFY_URL="https://ntfy.sh/_YOURNTFY.SH_TOPIC"
NTFY_PRIORITY="high"                 # min|low|default|high|urgent (urgent can bypass Do Not Disturb)
NTFY_TITLE="Ollama ALERT"
NTFY_TAGS="rotating_light,fire"      # emoji shown on the notification

# --- detection ---
GPU_BUSY_FILE="/sys/class/drm/card1/device/gpu_busy_percent"
GPU_BUSY_THRESHOLD=70                 # percent; "bad" sample if at/above this
SUSTAIN_SECS=900                      # must stay bad this long before alerting (15 min)
REALERT_SECS=3600                     # min seconds between repeat pushes (1 hr)

STATE_DIR="/opt/ollama-watchdog/state"
CONF_EOF

# ---------------------------------------------------------------------------
# watchdog.sh  (the monitor itself)
# ---------------------------------------------------------------------------
echo "==> Writing $BASE/watchdog.sh"
cat > "$BASE/watchdog.sh" <<'WD_EOF'
#!/usr/bin/env bash
# ollama-watchdog: push a phone alert (via ntfy.sh) when an ollama runner pegs
# the GPU for a sustained period. Alert-only; no auto-remediation. See
# watchdog.conf.
set -uo pipefail
export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

CONF=/opt/ollama-watchdog/watchdog.conf
[ -r "$CONF" ] && . "$CONF"

# defaults (overridable in watchdog.conf)
: "${NTFY_URL:=https://ntfy.sh/_YOURNTFY.SH_URL}"
: "${NTFY_PRIORITY:=high}"
: "${NTFY_TITLE:=YOUR_TITLE ALERT}"
: "${NTFY_TAGS:=rotating_light,fire}"
: "${GPU_BUSY_FILE:=/sys/class/drm/card1/device/gpu_busy_percent}"
: "${GPU_BUSY_THRESHOLD:=70}"
: "${SUSTAIN_SECS:=900}"
: "${REALERT_SECS:=3600}"
: "${STATE_DIR:=/opt/ollama-watchdog/state}"

BAD_SINCE="$STATE_DIR/bad_since"
LAST_ALERT="$STATE_DIR/last_alert"
mkdir -p "$STATE_DIR"
now=$(date +%s)

send_push() { # title body
  local title="$1" body="$2"
  curl --silent --show-error --fail \
    -H "Title: $title" \
    -H "Priority: $NTFY_PRIORITY" \
    -H "Tags: $NTFY_TAGS" \
    -d "$body" \
    "$NTFY_URL"
}

# --- test mode: send one push now and exit ---
if [ "${1:-}" = "--test" ]; then
  if send_push "ai.local test" "ollama-watchdog test $(date '+%H:%M'). If you got this, alerts work."; then
    echo "test push sent to $NTFY_URL"; exit 0
  else
    echo "test push FAILED" >&2; exit 1
  fi
fi

# --- sample ---
runner_pid=$(pgrep -f 'ollama runner' | head -1 || true)
gpu_busy=$(cat "$GPU_BUSY_FILE" 2>/dev/null || echo "")

bad=0
if [ -n "$runner_pid" ] && [ -n "$gpu_busy" ] \
   && [ "$gpu_busy" -ge "$GPU_BUSY_THRESHOLD" ] 2>/dev/null; then
  bad=1
fi

if [ "$bad" -eq 0 ]; then
  rm -f "$BAD_SINCE" "$LAST_ALERT"          # healthy: reset
  echo "$(date -Is) ok gpu=${gpu_busy:-NA}% runner=${runner_pid:-none}"
  exit 0
fi

[ -f "$BAD_SINCE" ] || echo "$now" > "$BAD_SINCE"
since=$(cat "$BAD_SINCE" 2>/dev/null || echo "$now")
elapsed=$(( now - since )); mins=$(( elapsed / 60 ))
echo "$(date -Is) bad gpu=${gpu_busy}% runner_pid=${runner_pid} sustained=${mins}m"

[ "$elapsed" -lt "$SUSTAIN_SECS" ] && exit 0   # not sustained 15 min yet

last=0; [ -f "$LAST_ALERT" ] && last=$(cat "$LAST_ALERT" 2>/dev/null || echo 0)
[ $(( now - last )) -lt "$REALERT_SECS" ] && exit 0   # already pushed within the hour

# context for the message
tctl=$(sensors 2>/dev/null | awk '/^Tctl:/{print $2; exit}')
ppt=$(sensors 2>/dev/null | awk '/PPT:/{print $2" "$3; exit}')
model=$(ollama ps 2>/dev/null | awk 'NR==2{print $1; exit}'); [ -z "$model" ] && model="?"

body="ollama pegged GPU ${gpu_busy}% for ${mins}m. Tctl ${tctl:-?} PPT ${ppt:-?}. model=${model}. Fix: sudo systemctl restart ollama"

if send_push "$NTFY_TITLE" "$body"; then
  echo "$now" > "$LAST_ALERT"
  echo "$(date -Is) ALERT sent: $body"
else
  echo "$(date -Is) ALERT send FAILED" >&2
fi
WD_EOF

chmod 755 "$BASE/watchdog.sh"
chmod 644 "$BASE/watchdog.conf"

# state dir must be writable by the cron user (job runs as $RUN_USER)
chown -R "$RUN_USER":"$RUN_USER" "$STATE"

# ---------------------------------------------------------------------------
# cron.d entry  (runs as $RUN_USER every 5 minutes)
# ---------------------------------------------------------------------------
echo "==> Installing /etc/cron.d/ollama-watchdog"
cat > /etc/cron.d/ollama-watchdog <<CRON_EOF
SHELL=/bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
# m h dom mon dow user command
*/5 * * * * $RUN_USER /opt/ollama-watchdog/watchdog.sh >> /opt/ollama-watchdog/state/watchdog.log 2>&1
CRON_EOF
chmod 644 /etc/cron.d/ollama-watchdog

echo
echo "==> Installed."
echo "    Monitor : $BASE/watchdog.sh   (every 5 min via /etc/cron.d/ollama-watchdog)"
echo "    Config  : $BASE/watchdog.conf"
echo "    Log     : $STATE/watchdog.log"
echo
echo "Subscribe in the ntfy app to topic:  YOUR_NTFY.SH_URL"
echo "Then send yourself a test push:"
echo "    $BASE/watchdog.sh --test"

March 23, 2026May 12, 2026

The Operations File: A Pattern for Establishing Maintainable Systems

Sometimes, you forget what you built.

Any complex system outgrows an operator’s memory. When you’re building a solution, you get a handful of services running and for the time we’re doing immediate development, you can hold the whole picture in your head. We know where the configs live, which ports map to what services & how to restart the occasionally hanging service. Then one evening something breaks and you realize you can’t remember whether the database runs in a container or as a systemd service. We’re not even sure which log to check first. We’re going to have to probe about for a while- and we’d rather be spending time with our kids.

I started writing operations diaries when I noticed how much time I was spending re-learning my own systems. Every time I needed to do some maintenance on an aging server, I started with twenty minutes of archaeology: I’d do some generic linux system fingerprinting to retrace how things are wired together, confirm which config controls what and eventually rediscover the decisions I had made a year ago.

I don’t maintain operations diaries anymore. I’ve landed on a pattern that saves me from the archeology. I build and maintain a structured document (Operations.md) in the Home directory of systems I need to maintain. It sits in the Home directory of my systems to make it easy to quickly reacquaint myself with the system. Making this file a permanent practice has a side benefit: I can enable agents to engage with the server without any pre-conceived context. The operations.md file answers the questions I usually have when I sit down at the terminal:

What’s running on this system?
How do I check the health of the important components?
What do I do when something goes wrong? and
What has gone wrong before?

The operations file is a living document that evolves with the project. The file has the following sections:

Operations.md
├── Quick Reference (status checks, logs, restarts)
├── Architecture Overview (visual map + port table)
├── Services (Homebrew/systemd, launchd, Docker)
├── Hardware Specifications
├── Disk Layout & Usage
├── Network Configuration
├── Listening Ports
├── Scheduled Tasks
├── Remote Access Setup
├── Troubleshooting Guide
├── Configuration Locations
├── Backup Recommendations
├── Known Issues
└── Changelog

I built a tool that can generate a first draft of an operations file for you. It runs platform-native commands on macOS or Linux, collects real system data, and renders a structured Operations.md based on what the script discovers.

This post walks through the concepts behind my Operations.md file. I describe the the layout of the document. The file is optimized for rapid troubleshooting. I explain the problems each section solves and describes how to start building and maintaining an Operations.md file for systems you need to maintain.

How Do I Know Nothing Is Broken Right Now?

At the very top of the Operations.md file, I capture one command (or a short loop) that checks all critical services at once. Below that, I capture per-service checks for troubleshooting. Imagine a broad check that identifies a problem- we’re going to need to drill into the components that could produce the problem. I’ll need to do some open ended troubleshooting- but I always want to start with the broadest check first and the ability to drill into specifics second. If you go after a hunch too soon, you can end up wasting hours on the wrong problem. My operations file starts with guidance on how to quickly collect the state of services on the system:

Services often fail silently. A container can report “running” while the application inside has crashed. A database can be online but unable to accept writes because the disk is full. A scheduled job can fail every execution for weeks, but I won’t notice till I need its output.

I handle these risks with a general health check section: a set of commands we can run in sequence to verify that each component of the system has what it needs to perform their jobs.

A good health check tests the delivery experience that a service is supposed to facilitate. For a web application, that means hitting an endpoint and confirming a valid response. For a database, that means running a query. For an API, that means making a real request. Checking whether the process is alive tells us something, but not enough. “Is the process delivering its feature?” tells us whether the system is failing.

Next I need to get a conceptual lay of the land for services running on the system. My operations file addresses this with an architecture overview near the top. The document lists every service, how it was installed (container, systemd unit, native process), what port it listens on, and what it depends on. A diagram showing how traffic flows through the system works for understanding relationships. Both sections give two views of the same information, optimized for different questions. What software’s on this system vs what are the listening services & ports operating on the system?

When something breaks on the system, I need to know what else might be affected. When I want to add a new service, I need to know which ports are already in use so I don’t stomp on existing ports. I also need to know which dependencies exist. The Inventory in an operations file helps me quickly reason about a system I don’t fully remember.

Generally speaking- I run the same set of discovery commands when I’m trying to discover the state of a system I’m troubleshooting. In practice- I don’t correctly remember all of the commands I need for checking systemd, docker containers, nginx and other services. To overcome my failing memory- I use a python script to collect the discrete details of the system- and I use an AI agent to infer the details & rationale of the system. If the agent doesn’t get it right- I correct it and update the documentation.

With my current operating practices- I can read the file directly, or have an agent summarize everything it sees in the Ops file & compare it against what it checks for on the system. I canapply agents to maintain a self-updating Operations.md file when I make changes to the system during troubleshooting/tuning activities. Here, we see some details about running services on a jump host that helps me quickly understand what’s running & how to troubleshoot it’s behavior:

You can use agents to generate each section as needed, however I would recommend writing these sections personally when the system is healthy and there is time to verify each entry- meaning you might skip this work at install time. Let the services run for a few days and make sure everything’s working correctly. Confirm the ports, confirm the install methods, trace your dependency chains and then initiate the generation of troubleshooting documentation. Writing documentation forces you to obtain a level of understanding that casual troubleshooting does not. We cannot write down how a service connects to its database without confirming we actually know which database it uses. The act of writing documentation helps you discover surprises. We may not remember these activities later- but the documentation quality will be higher because you were actively involved in producing it.

How Do I Fix Things When It Does Break?

When something goes wrong on the system, we need a way to diagnose the problem and evaluate it against a record of how similar problems were solved before.

I add sections to the troubleshooting section from specific incidents– each following a pattern: symptoms observed, commands run to investigate, root cause found, fix applied. The most valuable entries are the ones written right after spending two hours solving a problem, while the details are still fresh. High quality documentation means that in the future, we’ll solve the same problem in five minutes.

I try to organize this section by symptom rather than by cause. When something breaks, I know what we are seeing (the service is unresponsive, the page returns an error, the log is full of a particular message). I don’t know why at this point. Entries organized by symptom let me match what I observe to a known pattern without knowing the components I need to check.

I try to capture the troubleshooting wrong turns too. If I spend thirty minutes investigating a configuration issue before discovering that the real problem was an upstream provider outage, that sequence belongs in the entry. Next time I see the same symptoms, I check the upstream provider first and save myself from unnecessary detours.

How Do I Know Something Is About to Break?

The operations file addresses potential failures with a section on what to watch and how to check it. For each item, we document the command to check current status.

Known Issues, Constraints, and Workarounds

Every system has changes that reflect principled compromises on configurations. A partition might turn out to be too small. A driver may behaves unpredictably under certain conditions. I capture these types of problems in a Known Issues section.

Keeping It Alive: The Operations File as a Living Document

An operations file written once and never updated becomes a liability that provides false confidence. We trust it, act on stale information that no longer reflects the system, and make things worse.

I update the file during all maintenance. When I add a service, I add it to the inventory before moving on to the next task. When I solve a problem, I write the troubleshooting entry while the details are fresh. When I discover a new constraint, I add it to known issues immediately. The file grows with the system instead of drifting away from it. I leverage coding agents to troubleshoot in new ways. I also use coding agents to update the documentation.

A changelog section anchors the document. Every significant change gets a dated entry describing what changed, why it changed, and what I learned in the process. The emphasis belongs on reasoning, because the reasoning behind a decision decays fastest. Capturing the time of the decision makes tech debt manageable.

Over time, this section becomes institutional memory. When I encounter a configuration I do not recognize, the changelog tells me when it was added and what problem it was solving. That context is the difference between “I should not touch this because I do not understand it” and “I understand why this is here and whether it still applies.”

Getting Started: What to Write Down First

If this all sounds like a lot- don’t panic. The scope of a complete operations file can feel paralyzing when you are starting from nothing. So don’t start with a target of completeness. Start from what you already know. Build it over time. Also- don’t start from scratch. I have a tool at the bottom of this post that you can use to do your own initial discovery.

Speedy navigation of expanding Markdown files

As the Operations file grows navigation will get harder. You can use a tool like mindmap-cli to generate a mind map from a markdown file. A mind map generated from the markdown source gives you a bird’s-eye view of the whole structure, collapsible and clickable, without maintaining a separate document. The markdown file itself is for depth. The mindmap is for orientation. Together they let someone operate the system without reading a thousand lines of operations manuals.

Build an Operations.md file for Your System

The tool I mentioned at the top, Operations Discovery Mechanism, automates the hardest part of getting started: the initial inventory. It collects real data from the system using platform-native commands (brew services, launchctl, and lsof on macOS; systemctl, journalctl, and ss on Linux), structures it as JSON, and renders a complete Operations.md with copy-paste ready commands for every service it finds.

Create an Operations directory in the home directory of your target system. clone the repository into it.

gh repo clone CaptainMcCrank/OperationsDiscoveryMechanism

The workflow takes two steps. First, you’ll run the collector script to produce a JSON snapshot of your system. Second, run the renderer to turn that snapshot into a structured markdown document. No dependencies beyond Python 3.9 and the standard library.

# macOS
python3 mac_system_info.py -o system_info.json
python3 generate_operations.py -o Operations.md

# Linux
python3 linux_system_info.py -o system_info.json

For richer documentation, feed the collected JSON to Claude Code with the prompt template embedded in the Readme. Claude can infer relationships between services, add descriptions of what each one does, and flag potential issues that the script collects but cannot interpret.

The operations.md file covers the architecture overview, the service inventory, the quick reference commands, the port map, and the disk layout. Over time, if you’re disciplined about updating the troubleshooting entries, known issues and changelog the document becomes extremely powerful. You’ll be able to hop back onto old systems with ease. You’ll be in a position to start doing interesting experiments with Agentic operations. You can now have agents logging into systems that can acquire context without spending tokens on system enumerations.

November 15, 2020November 19, 2020

Security Apprenticeship

“What do I need to know how to do in order to pursue a career in security?” Good news! I just happen to have drafted this roadmap of reading material just for you!

In my previous writeup, I provided a roadmap of reading material for anyone who is interested in developing a career in security. The material covered concepts that should be understood by anyone working in the field- but the material didn’t cover how to “do” security. This article summarizes topics & activities that are well understood by anyone doing security work professionally. If you develop mastery of these ideas, you’ll be approaching a point where you can start doing meaningful security work. Unfortunately you won’t be done reading after you’ve finished this writeup. I have at least two other guides in draft states that will help guide you on your path towards pursuing a career in Security. This guide helps you establish your foundational knowledge of techniques for restricting access to a system.

Walking The Security Practitioners Path

If we distill security to its most fundamental concept, security is about the controls that ensure that activities are authorized on a system.

We can talk about these ideas abstractly, but unless you actually implement these ideas on a working system, your knowledge won’t have much application. The best way to learn these ideas is by using Linux.

Linux gives you plenty of experience with programs that don’t work because of security controls. You’ll have to learn how to debug those problems in order to get the processes working and eventually you’ll develop taste for the correct way to implement those controls.

This next phase of your education covers how to activate a linux system, implement interesting programs like web servers and some basic capabilities for implementing access controls. My goal in writing this is to provide pointers to well written guides that will help you learn to start “doing” security- which is to say, how to activate controls that protect a system. In my opinion, mastering these topics should be manadatory- however I should highlight that I know many people in the security industry today who don’t have mastery i all of these topics. There are areas in here that I’m not as strong as I’d like to be, as well. Everyone who is good at security struggles with the idea that they need to do more reading when time permits.

Learning Linux

Part 1- Well Begun is Half Done– Dip Your Toes into Linux.

Linux: TL;DR: https://linuxjourney.com/

Learn what Linux is and it’s history. You’ll want to learn about the basic purpose of linux if you have no experience with the Operating system. Some day, you’ll need to learn about important related concepts like Posix (https://en.wikipedia.org/wiki/POSIX) and Ansi C (https://en.wikipedia.org/wiki/ANSI_C). For now, let’s just learn some history: https://linuxjourney.com/lesson/linux-history

Learn about the Shell (aka the command line) on Linux. Learning the shell is the Computer Science equivalent of learning to walk or bike. You have an operating system or applications you want to interact with? Generally you will be working within a shell. A shell is ‘just’ a program that gives you the ability to use a keyboard to interact with the Operating System. Some day you will need to write shell scripts to automate tedious tasks or filter large amounts of data through scripting, which is a way of executing a series of commands that will execute in some order based on whatever conditions you chose to define. You’ll learn about “environment variables” like PATH and PYTHONPATH, which are variable names with values you’ll modify in order to make new programs or libraries accessible from any filesystem location while you’re in the shell. The shell is your foundation for working with a linux machine. https://linuxjourney.com/lesson/the-shell. Alternatively, this tutorial teaches you how to get around the shell.

Learn about User Accounts & Groups. This helps you understand the foundational framework of access controls on computers. When you need to protect data from access by an unauthorized party, you’ll need to use account & group management concepts. Learning about accounts & groups will help you understand important concepts like what the “root” account’s purpose is, what the “wheel,” “www-data,” and “nobody” groups purpose are and more. As you develop your security skills you’ll eventually need to learn about how hackers elevate privilege. Learning accounts and groups his will give you a foundation when we are ready to learn about “privilege escalation” https://linuxjourney.com/lesson/users-and-groups

Learn about File Permissions. When you are ready to actually protect that data from unauthorized access, you’ll have to read & evaluate the correctness of filesystem permissions. If you have a file on a system with multiple users, you need to learn how to control who can access it and define the level of access (read, write, executable, SUID, etc). Again- some day you will be responsible for helping make sure sensitive files are only accessible by the correct people- and that those people have the right level of access. You need to learn file permissions to perform this task. One way to know if you’ve mastered this topic is to evaluate if you can propose accurate Access Control Permissions on a web directory. Make one version that’s dangerous, and one that’s safe. Be able to explain why they’re dangerous, and how they could be exploited. Start your journey here: https://linuxjourney.com/lesson/file-permissions

Learn about the “Filesystem.” A file system is the logical structure where you store “files.” An operating system without the ability to process files is of limited utility. In the windows world, we have drives with names and folders. In the MacOS world, we have drives with names & folders. in iOS & Android, there is a file system, but it’s implementation is not as central to the user experience. In linux, deeply understanding what the directories are under”/” is critically important. There are “files” in the /proc directory that can tell you important statistics about system performance. Configuring servers that you’ll install will require an understanding of the /etc, /var and other directories. If a server is under DDOS attack, you’ll need to understand information about the number of network connections that the system is currently supporting. You can indirectly use tooling like ifconfig to gather system performance information, or you could just do an ls against /proc/net/dev. You’ll also need to learn about read only filesystems like squashFS (https://tldp.org/HOWTO/SquashFS-HOWTO/whatis.html). Someday, you’ll need to figure out what is actually happening when you type “ls” into the shell and it somehow enumerates the files that were in your current directory. If you type another random command in, it doesn’t work. Why? How did the OS know which version of “ls” to run? Where does this “ls” binary live anyway? Learning file systems is mandatory. https://linuxjourney.com/lesson/filesystem-hierarchy

Learn about /dev. An extremely important design philosophy in Unix is that “everything is a file.” Hardware is directly accessible through the file system. Some hardware is represented as file system objects. Learning about the dev directory will give you important insight into how devices work on the system, which at some point, you may want to tamper with if you aspire to be a hardware hacker. https://linuxjourney.com/lesson/dev-directory

Learn about the Kernel. The kernel is the mechanism for controlling hardware. It is where security policies get enforced. If the kernel is compromised, all security assumptions break. You need to understand this important resource conceptually to work in this industry- both for defense & offense. https://linuxjourney.com/lesson/kernel-overview

If you aspire to penetration testing or red teaming, you need to go further and learn about interacting with the kernel.

Linux programming interface: https://www.amazon.com/Linux-Programming-Interface-System-Handbook/dp/1593272200

Kernel Hacking (as in MIT definition): https://www.kernel.org/doc/html/latest/kernel-hacking/index.html

Kernel Hacking (as in exploitation): https://github.com/xairy/linux-kernel-exploitation

The Snowball Effect

If you’ve gotten this far, we learned about the Kernel, the Filesystem, Permissions, User accounts and a little hardware. This is a good stopping point- we have a golfball-sized snowball- you basically know the perimeter of the pitch. In my next sections, we’ll take that snowball up to the top of a mountain and give it a nudge: We’ll cover how to make things happen on a Linux system through process management and we’ll learn how to make our linux system talk through the power of IP networking. After that, we’ll start to cover the topics of a security practitioner: reverse engineering, vulnerability discovery, exploitation & remediation and eventually how to protect processes, systems & keys.