Sometimes, you forget what you built.

Any complex system outgrows an operator’s memory. When you’re building a solution, you get a handful of services running and for the time we’re doing immediate development, you can hold the whole picture in your head. We know where the configs live, which ports map to what services & how to restart the occasionally hanging service. Then one evening something breaks and you realize you can’t remember whether the database runs in a container or as a systemd service. We’re not even sure which log to check first. We’re going to have to probe about for a while- and we’d rather be spending time with our kids.

I started writing operations diaries when I noticed how much time I was spending re-learning my own systems. Every time I needed to do some maintenance on an aging server, I started with twenty minutes of archaeology: I’d do some generic linux system fingerprinting to retrace how things are wired together, confirm which config controls what and eventually rediscover the decisions I had made a year ago.

I don’t maintain operations diaries anymore. I’ve landed on a pattern that saves me from the archeology. I build and maintain a structured document (Operations.md) in the Home directory of systems I need to maintain. It sits in the Home directory of my systems to make it easy to quickly reacquaint myself with the system. Making this file a permanent practice has a side benefit: I can enable agents to engage with the server without any pre-conceived context. The operations.md file answers the questions I usually have when I sit down at the terminal:

What’s running on this system?
How do I check the health of the important components?
What do I do when something goes wrong? and
What has gone wrong before?

The operations file is a living document that evolves with the project. The file has the following sections:

Operations.md
├── Quick Reference (status checks, logs, restarts)
├── Architecture Overview (visual map + port table)
├── Services (Homebrew/systemd, launchd, Docker)
├── Hardware Specifications
├── Disk Layout & Usage
├── Network Configuration
├── Listening Ports
├── Scheduled Tasks
├── Remote Access Setup
├── Troubleshooting Guide
├── Configuration Locations
├── Backup Recommendations
├── Known Issues
└── Changelog

I built a tool that can generate a first draft of an operations file for you. It runs platform-native commands on macOS or Linux, collects real system data, and renders a structured Operations.md based on what the script discovers.

This post walks through the concepts behind my Operations.md file. I describe the the layout of the document. The file is optimized for rapid troubleshooting. I explain the problems each section solves and describes how to start building and maintaining an Operations.md file for systems you need to maintain.

How Do I Know Nothing Is Broken Right Now?

At the very top of the Operations.md file, I capture one command (or a short loop) that checks all critical services at once. Below that, I capture per-service checks for troubleshooting. Imagine a broad check that identifies a problem- we’re going to need to drill into the components that could produce the problem. I’ll need to do some open ended troubleshooting- but I always want to start with the broadest check first and the ability to drill into specifics second. If you go after a hunch too soon, you can end up wasting hours on the wrong problem. My operations file starts with guidance on how to quickly collect the state of services on the system:

Services often fail silently. A container can report “running” while the application inside has crashed. A database can be online but unable to accept writes because the disk is full. A scheduled job can fail every execution for weeks, but I won’t notice till I need its output.

I handle these risks with a general health check section: a set of commands we can run in sequence to verify that each component of the system has what it needs to perform their jobs.

A good health check tests the delivery experience that a service is supposed to facilitate. For a web application, that means hitting an endpoint and confirming a valid response. For a database, that means running a query. For an API, that means making a real request. Checking whether the process is alive tells us something, but not enough. “Is the process delivering its feature?” tells us whether the system is failing.

Next I need to get a conceptual lay of the land for services running on the system. My operations file addresses this with an architecture overview near the top. The document lists every service, how it was installed (container, systemd unit, native process), what port it listens on, and what it depends on. A diagram showing how traffic flows through the system works for understanding relationships. Both sections give two views of the same information, optimized for different questions. What software’s on this system vs what are the listening services & ports operating on the system?

When something breaks on the system, I need to know what else might be affected. When I want to add a new service, I need to know which ports are already in use so I don’t stomp on existing ports. I also need to know which dependencies exist. The Inventory in an operations file helps me quickly reason about a system I don’t fully remember.

Generally speaking- I run the same set of discovery commands when I’m trying to discover the state of a system I’m troubleshooting. In practice- I don’t correctly remember all of the commands I need for checking systemd, docker containers, nginx and other services. To overcome my failing memory- I use a python script to collect the discrete details of the system- and I use an AI agent to infer the details & rationale of the system. If the agent doesn’t get it right- I correct it and update the documentation.

With my current operating practices- I can read the file directly, or have an agent summarize everything it sees in the Ops file & compare it against what it checks for on the system. I canapply agents to maintain a self-updating Operations.md file when I make changes to the system during troubleshooting/tuning activities. Here, we see some details about running services on a jump host that helps me quickly understand what’s running & how to troubleshoot it’s behavior:

You can use agents to generate each section as needed, however I would recommend writing these sections personally when the system is healthy and there is time to verify each entry- meaning you might skip this work at install time. Let the services run for a few days and make sure everything’s working correctly. Confirm the ports, confirm the install methods, trace your dependency chains and then initiate the generation of troubleshooting documentation. Writing documentation forces you to obtain a level of understanding that casual troubleshooting does not. We cannot write down how a service connects to its database without confirming we actually know which database it uses. The act of writing documentation helps you discover surprises. We may not remember these activities later- but the documentation quality will be higher because you were actively involved in producing it.

How Do I Fix Things When It Does Break?

When something goes wrong on the system, we need a way to diagnose the problem and evaluate it against a record of how similar problems were solved before.

I add sections to the troubleshooting section from specific incidents– each following a pattern: symptoms observed, commands run to investigate, root cause found, fix applied. The most valuable entries are the ones written right after spending two hours solving a problem, while the details are still fresh. High quality documentation means that in the future, we’ll solve the same problem in five minutes.

I try to organize this section by symptom rather than by cause. When something breaks, I know what we are seeing (the service is unresponsive, the page returns an error, the log is full of a particular message). I don’t know why at this point. Entries organized by symptom let me match what I observe to a known pattern without knowing the components I need to check.

I try to capture the troubleshooting wrong turns too. If I spend thirty minutes investigating a configuration issue before discovering that the real problem was an upstream provider outage, that sequence belongs in the entry. Next time I see the same symptoms, I check the upstream provider first and save myself from unnecessary detours.

How Do I Know Something Is About to Break?

The operations file addresses potential failures with a section on what to watch and how to check it. For each item, we document the command to check current status.

Known Issues, Constraints, and Workarounds

Every system has changes that reflect principled compromises on configurations. A partition might turn out to be too small. A driver may behaves unpredictably under certain conditions. I capture these types of problems in a Known Issues section.

Keeping It Alive: The Operations File as a Living Document

An operations file written once and never updated becomes a liability that provides false confidence. We trust it, act on stale information that no longer reflects the system, and make things worse.

I update the file during all maintenance. When I add a service, I add it to the inventory before moving on to the next task. When I solve a problem, I write the troubleshooting entry while the details are fresh. When I discover a new constraint, I add it to known issues immediately. The file grows with the system instead of drifting away from it. I leverage coding agents to troubleshoot in new ways. I also use coding agents to update the documentation.

A changelog section anchors the document. Every significant change gets a dated entry describing what changed, why it changed, and what I learned in the process. The emphasis belongs on reasoning, because the reasoning behind a decision decays fastest. Capturing the time of the decision makes tech debt manageable.

Over time, this section becomes institutional memory. When I encounter a configuration I do not recognize, the changelog tells me when it was added and what problem it was solving. That context is the difference between “I should not touch this because I do not understand it” and “I understand why this is here and whether it still applies.”

Getting Started: What to Write Down First

If this all sounds like a lot- don’t panic. The scope of a complete operations file can feel paralyzing when you are starting from nothing. So don’t start with a target of completeness. Start from what you already know. Build it over time. Also- don’t start from scratch. I have a tool at the bottom of this post that you can use to do your own initial discovery.

Speedy navigation of expanding Markdown files

As the Operations file grows navigation will get harder. You can use a tool like mindmap-cli to generate a mind map from a markdown file. A mind map generated from the markdown source gives you a bird’s-eye view of the whole structure, collapsible and clickable, without maintaining a separate document. The markdown file itself is for depth. The mindmap is for orientation. Together they let someone operate the system without reading a thousand lines of operations manuals.

Build an Operations.md file for Your System

The tool I mentioned at the top, Operations Discovery Mechanism, automates the hardest part of getting started: the initial inventory. It collects real data from the system using platform-native commands (brew services, launchctl, and lsof on macOS; systemctl, journalctl, and ss on Linux), structures it as JSON, and renders a complete Operations.md with copy-paste ready commands for every service it finds.

Create an Operations directory in the home directory of your target system. clone the repository into it.

gh repo clone CaptainMcCrank/OperationsDiscoveryMechanism

The workflow takes two steps. First, you’ll run the collector script to produce a JSON snapshot of your system. Second, run the renderer to turn that snapshot into a structured markdown document. No dependencies beyond Python 3.9 and the standard library.

# macOS
python3 mac_system_info.py -o system_info.json
python3 generate_operations.py -o Operations.md

# Linux
python3 linux_system_info.py -o system_info.json

For richer documentation, feed the collected JSON to Claude Code with the prompt template embedded in the Readme. Claude can infer relationships between services, add descriptions of what each one does, and flag potential issues that the script collects but cannot interpret.

The operations.md file covers the architecture overview, the service inventory, the quick reference commands, the port map, and the disk layout. Over time, if you’re disciplined about updating the troubleshooting entries, known issues and changelog the document becomes extremely powerful. You’ll be able to hop back onto old systems with ease. You’ll be in a position to start doing interesting experiments with Agentic operations. You can now have agents logging into systems that can acquire context without spending tokens on system enumerations.

The Operations File: A Pattern for Establishing Maintainable Systems