The Operations File: A Pattern for Establishing Maintainable Systems

Sometimes, you forget what you built.

Any complex system outgrows an operator’s memory. When you’re building a solution, you get a handful of services running and for the time we’re doing immediate development, you can hold the whole picture in your head. We know where the configs live, which ports map to what services & how to restart the occasionally hanging service. Then one evening something breaks and you realize you can’t remember whether the database runs in a container or as a systemd service. We’re not even sure which log to check first. We’re going to have to probe about for a while- and we’d rather be spending time with our kids.

I started writing operations diaries when I noticed how much time I was spending re-learning my own systems. Every time I needed to do some maintenance on an aging server, I started with twenty minutes of archaeology: I’d do some generic linux system fingerprinting to retrace how things are wired together, confirm which config controls what and eventually rediscover the decisions I had made a year ago.

I don’t do this much anymore. I’ve landed on a pattern that saves me from the archeology. I build and maintain a structured document (Operations.md) in the Home directory of systems I need to maintain. It sits in the Home directory of my system so it’s easy to quickly get re-acquainted as soon as I log into the system. Making this a permanent practice produces a way of enabling agents to engage with the server without any pre-conceived external context. It answers the questions I usually have when I sit down at the terminal:

  • What’s running on this system?
  • How do I check the health of the important components?
  • What do I do when something goes wrong? and
  • What has gone wrong before?

The operations file is a living document that evolves with the project.The file has the following sections:

Operations.md
├── Quick Reference (status checks, logs, restarts)
├── Architecture Overview (visual map + port table)
├── Services (Homebrew/systemd, launchd, Docker)
├── Hardware Specifications
├── Disk Layout & Usage
├── Network Configuration
├── Listening Ports
├── Scheduled Tasks
├── Remote Access Setup
├── Troubleshooting Guide
├── Configuration Locations
├── Backup Recommendations
├── Known Issues
└── Changelog

I built a tool that can generate a first draft of an operations file for you. It runs platform-native commands on macOS or Linux, collects real system data, and renders a structured Operations.md based on what the script discovers.

This post walks through the concepts behind my Operations.md file. I describe the the layout of the document. It’s optimized for rapid troubleshooting. I explain the problems each section solves and describes how to start building and maintaining an Operations.md file for systems you need to maintain.

How Do I Know Nothing Is Broken Right Now?

At the very top of the Operations.md file, I capture one command (or a short loop) that checks all critical services at once. Below that, I capture per-service checks for troubleshooting. Imagine a broad check that identifies a problem- we’re going to need to drill into the components that could produce the problem. I’ll need to do some open ended troubleshooting- but I always want to start with the broadest check first and the ability to drill into specifics second. If you go after a hunch too soon, you can end up wasting hours on the wrong problem. My operations file starts with guidance on how to quickly collect the state of services on the system:

Services often fail silently. A container can report “running” while the application inside has crashed. A database can be online but unable to accept writes because the disk is full. A scheduled job can fail every execution for weeks, but I won’t notice till I need its output.

I handle these risks with a general health check section: a set of commands we can run in sequence to verify that each component of the system has what it needs to perform their jobs.

A good health check tests the delivery experience that a service is supposed to facilitate. For a web application, that means hitting an endpoint and confirming a valid response. For a database, that means running a query. For an API, that means making a real request. Checking whether the process is alive tells us something, but not enough. “Is the process delivering its feature?” tells us whether the system is failing.

Next I need to get a conceptual lay of the land for services running on the system. My operations file addresses this with an architecture overview near the top. The document lists every service, how it was installed (container, systemd unit, native process), what port it listens on, and what it depends on. A diagram showing how traffic flows through the system works for understanding relationships. Both sections give two views of the same information, optimized for different questions. What software’s on this system vs what are the listening services & ports operating on the system?

When something breaks, I need to know what else might be affected. When I want to add a new service, I need to know which ports are already taken and which dependencies exist. Without an inventory, I am reasoning about a system I cannot fully describe.

Generally speaking- I always run the same set of discovery commands to figure out the state of the system. I find I don’t retain all of the commands I need for checking systemd, docker containers, nginx and other services. To overcome my failing memory- I use a python script to collect the discrete details of the system- and I use an AI agent to infer the details & rationale of the system. If the agent doesn’t get it right- I correct it and update the documentation.

With my current operating practices- I don’t have to keep dragging out my operations diaries to rediscover the state of the system. I can re-apply agents to maintain a self-updating Operations file that gives me the details I need for operating the system and troubleshooting/tuning the core services on the system.

I would recommend writing this section when the system is healthy and there is time to verify each entry. Run through the services, confirm the ports, confirm the install methods, trace the dependency chains. Writing documentation forces you to obtain a level of understanding that casual troubleshooting does not. We cannot write down how a service connects to its database without confirming we actually know which database it uses. The act of writing documentation helps you discover surprises.

How Do I Fix Things When It Does Break?

When something goes wrong, we need a way to diagnose the problem and a record of how similar problems were solved before.

I build the troubleshooting section from specific incidents, each following a pattern: symptoms observed, commands run to investigate, root cause found, fix applied. The most valuable entries are the ones written right after spending two hours solving a problem, while the details are still fresh. In the future, we’ll solve the same problem in five minutes.

I try to organize this section by symptom rather than by cause. When something breaks, I know what we are seeing (the service is unresponsive, the page returns an error, the log is full of a particular message). I don’t know why at this point. Entries organized by symptom let me match what I observe to a known pattern without knowing the components I need to check.

I try to capture the wrong turns too. If I spend thirty minutes investigating a configuration issue before discovering that the real problem was an upstream provider outage, that sequence belongs in the entry. Next time I see the same symptoms, I check the upstream provider first and save myself from unnecessary detours.

How Do I Know Something Is About to Break?

The operations file addresses potential failures with a section on what to watch and how to check it. For each item, we document the command to check current status.

Known Issues, Constraints, and Workarounds

Every system has compromises. A partition that turned out to be too small. A driver that behaves unpredictably under certain conditions. I capture these in a Known Issues section.

Keeping It Alive: The Operations File as a Living Document

An operations file written once and never updated becomes a liability that provides false confidence. We trust it, act on stale information that no longer reflects the system, and make things worse.

I update the file during all maintenance. When I add a service, I add it to the inventory before moving on to the next task. When I solve a problem, I write the troubleshooting entry while the details are fresh. When I discover a new constraint, I add it to known issues immediately. The file grows with the system instead of drifting away from it. I leverage coding agents to troubleshoot in new ways. I also use coding agents to update the documentation.

A changelog section anchors the document. Every significant change gets a dated entry describing what changed, why it changed, and what I learned in the process. The emphasis belongs on reasoning, because the reasoning behind a decision decays fastest. Capturing the time of the decision makes tech debt manageable.

Over time, this section becomes institutional memory. When I encounter a configuration I do not recognize, the changelog tells me when it was added and what problem it was solving. That context is the difference between “I should not touch this because I do not understand it” and “I understand why this is here and whether it still applies.”

Getting Started: What to Write Down First

If this all sounds like a lot- don’t panic. The scope of a complete operations file can feel paralyzing when you are starting from nothing. So don’t start with a target of completeness. Start from what you already know. Build it over time. Also- don’t start from scratch. I have a tool at the bottom of this post that you can use to do your own initial discovery.

Speedy navigation of expanding Markdown files

As the Operations file grows navigation will get harder. You can use a tool like mindmap-cli to generate a mind map from a markdown file. A mind map generated from the markdown source gives you a bird’s-eye view of the whole structure, collapsible and clickable, without maintaining a separate document. The markdown file itself is for depth. The mindmap is for orientation. Together they let someone operate the system without reading a thousand lines of operations manuals.

Build an Operations.md file for Your System

The tool I mentioned at the top, Operations Discovery Mechanism, automates the hardest part of getting started: the initial inventory. It collects real data from the system using platform-native commands (brew serviceslaunchctl, and lsof on macOS; systemctljournalctl, and ss on Linux), structures it as JSON, and renders a complete Operations.md with copy-paste ready commands for every service it finds.

Create an Operations directory in the home directory of your target system. clone the repository into it.

gh repo clone CaptainMcCrank/OperationsDiscoveryMechanism

The workflow takes two steps. First, you’ll run the collector script to produce a JSON snapshot of your system. Second, run the renderer to turn that snapshot into a structured markdown document. No dependencies beyond Python 3.9 and the standard library.

# macOS
python3 mac_system_info.py -o system_info.json
python3 generate_operations.py -o Operations.md

# Linux
python3 linux_system_info.py -o system_info.json

For richer documentation, feed the collected JSON to Claude Code with the prompt template embedded in the Readme. Claude can infer relationships between services, add descriptions of what each one does, and flag potential issues that the script collects but cannot interpret.

The operations.md file covers the architecture overview, the service inventory, the quick reference commands, the port map, and the disk layout. Over time, if you’re disciplined about updating the troubleshooting entries, known issues and changelog the document becomes extremely powerful. You’ll be able to hop back onto old systems with ease. You’ll be in a position to start doing interesting experiments with Agentic operations. You can now have agents logging into systems that can acquire context without spending tokens on system enumerations.

Security Apprenticeship

“What do I need to know how to do in order to pursue a career in security?” Good news! I just happen to have drafted this roadmap of reading material just for you!

In my previous writeup, I provided a roadmap of reading material for anyone who is interested in developing a career in security. The material covered concepts that should be understood by anyone working in the field- but the material didn’t cover how to “do” security. This article summarizes topics & activities that are well understood by anyone doing security work professionally. If you develop mastery of these ideas, you’ll be approaching a point where you can start doing meaningful security work. Unfortunately you won’t be done reading after you’ve finished this writeup. I have at least two other guides in draft states that will help guide you on your path towards pursuing a career in Security. This guide helps you establish your foundational knowledge of techniques for restricting access to a system.

Walking The Security Practitioners Path

If we distill security to its most fundamental concept, security is about the controls that ensure that activities are authorized on a system.  

We can talk about these ideas abstractly, but unless you actually implement these ideas on a working system, your knowledge won’t have much application.  The best way to learn these ideas is by using Linux.  

Linux gives you plenty of experience with programs that don’t work because of security controls.  You’ll have to learn how to debug those problems in order to get the processes working and eventually you’ll develop taste for the correct way to implement those controls.

This next phase of your education covers how to activate a linux system, implement interesting programs like web servers and some basic capabilities for implementing access controls.  My goal in writing this is to provide pointers to well written guides that will help you learn to start “doing” security- which is to say, how to activate controls that protect a system.  In my opinion, mastering these topics should be manadatory- however I should highlight that I know many people in the security industry today who don’t have mastery i all of these topics.  There are areas in here that I’m not as strong as I’d like to be, as well. Everyone who is good at security struggles with the idea that they need to do more reading when time permits.

Learning Linux

Part 1- Well Begun is Half Done– Dip Your Toes into Linux.

Linux: TL;DR:  https://linuxjourney.com/

Learn what Linux is and it’s history.  You’ll want to learn about the basic purpose of linux if you have no experience with the Operating system.  Some day, you’ll need to learn about important related concepts like Posix (https://en.wikipedia.org/wiki/POSIX) and Ansi C (https://en.wikipedia.org/wiki/ANSI_C).  For now, let’s just learn some history: https://linuxjourney.com/lesson/linux-history

Learn about the Shell (aka the command line) on Linux.   Learning the shell is the Computer Science equivalent of learning to walk or bike.  You have an operating system or applications you want to interact with?  Generally you will be working within a shell.  A shell is ‘just’ a program that gives you the ability to use a keyboard to interact with the Operating System.  Some day you will need to write shell scripts to automate tedious tasks or filter large amounts of data through scripting, which is a way of executing a series of commands that will execute in some order based on whatever conditions you chose to define.  You’ll learn about “environment variables” like PATH and PYTHONPATH, which are variable names with values you’ll modify in order to make new programs or libraries accessible from any filesystem location while you’re in the shell.  The shell is your foundation for working with a linux machine.  https://linuxjourney.com/lesson/the-shell. Alternatively, this tutorial teaches you how to get around the shell.

Learn about User Accounts & Groups.  This helps you understand the foundational framework of access controls on computers.  When you need to protect data from access by an unauthorized party, you’ll need to use account & group management concepts.  Learning about accounts & groups will help you understand important concepts like what the “root” account’s purpose is, what the “wheel,” “www-data,” and “nobody” groups purpose are and more.  As you develop your security skills you’ll eventually need to learn about how hackers elevate privilege.  Learning accounts and groups his will give you a foundation when we are ready to learn about “privilege escalation” https://linuxjourney.com/lesson/users-and-groups

Learn about File Permissions.  When you are ready to actually protect that data from unauthorized access, you’ll have to read & evaluate the correctness of filesystem permissions.  If you have a file on a system with multiple users, you need to learn how to control who can access it and define the level of access (read, write, executable, SUID, etc).  Again- some day you will be responsible for helping make sure sensitive files are only accessible by the correct people- and that those people have the right level of access.  You need to learn file permissions to perform this task.  One way to know if you’ve mastered this topic is to evaluate if you can propose accurate Access Control Permissions on a web directory.  Make one version that’s dangerous, and one that’s safe.  Be able to explain why they’re dangerous, and how they could be exploited.  Start your journey here:  https://linuxjourney.com/lesson/file-permissions

Learn about the “Filesystem.”  A file system is the logical structure where you store “files.”  An operating system without the ability to process files is of limited utility.  In the windows world, we have drives with names and folders.  In the MacOS world, we have drives with names & folders.  in iOS & Android, there is a file system, but it’s implementation is not as central to the user experience.  In linux, deeply understanding what the directories are under”/” is critically important.  There are “files” in the /proc directory that can tell you important statistics about system performance.  Configuring servers that you’ll install will require an understanding of the /etc, /var and other directories.  If a server is under DDOS attack, you’ll need to understand information about the number of network connections that the system is currently supporting.  You can indirectly use tooling like ifconfig to gather system performance information, or you could just do an ls against /proc/net/dev.  You’ll also need to learn about read only filesystems like squashFS (https://tldp.org/HOWTO/SquashFS-HOWTO/whatis.html). Someday, you’ll need to figure out what is actually happening when you type “ls” into the shell and it somehow enumerates the files that were in your current directory. If you type another random command in, it doesn’t work. Why? How did the OS know which version of “ls” to run? Where does this “ls” binary live anyway? Learning file systems is mandatory.  https://linuxjourney.com/lesson/filesystem-hierarchy

Learn about /dev.  An extremely important design philosophy in Unix is that “everything is a file.”  Hardware is directly accessible through the file system.  Some hardware is represented as file system objects.   Learning about the dev directory will give you important insight into how devices work on the system, which at some point, you may want to tamper with if you aspire to be a hardware hacker.  https://linuxjourney.com/lesson/dev-directory

Learn about the Kernel.  The kernel is the mechanism for controlling hardware.  It is where security policies get enforced.  If the kernel is compromised, all security assumptions break.  You need to understand this important resource conceptually to work in this industry- both for defense & offense.  https://linuxjourney.com/lesson/kernel-overview

If you aspire to penetration testing or red teaming, you need to go further and learn about interacting with the kernel.  

Linux programming interface: https://www.amazon.com/Linux-Programming-Interface-System-Handbook/dp/1593272200

Kernel Hacking (as in MIT definition): https://www.kernel.org/doc/html/latest/kernel-hacking/index.html

Kernel Hacking (as in exploitation): https://github.com/xairy/linux-kernel-exploitation

The Snowball Effect

If you’ve gotten this far, we learned about the Kernel, the Filesystem, Permissions, User accounts and a little hardware. This is a good stopping point- we have a golfball-sized snowball- you basically know the perimeter of the pitch. In my next sections, we’ll take that snowball up to the top of a mountain and give it a nudge: We’ll cover how to make things happen on a Linux system through process management and we’ll learn how to make our linux system talk through the power of IP networking. After that, we’ll start to cover the topics of a security practitioner: reverse engineering, vulnerability discovery, exploitation & remediation and eventually how to protect processes, systems & keys.