Inviting two AI agents to generate a collaboration protocol

Have you ever spent hours of work in a Claude session, realized you couldn’t use that same session for the next phase of the work, and still wanted to preserve the context? What if you could split the task across two agents — without having designed that capability into a preprompt up front?

I want to walk you through a small piece of social engineering between me and two Claude Code agents. It came in the form of a collaboration protocol defined in a markdown file. The file itself matters less than how it got written: the agents helped me write it while they worked on their role-specific tasks.

Here’s how it came together.

Context was running out!

I could see I was running myself into a corner with the complexity of the prompts I was providing.

I was rewriting the opening of a workshop deck I’ve been building. (It’s an impress.js presentation, the kind where slides float around a 2D canvas instead of just paging.) The old opener had a “what if your terminal could think?” hook. I wanted a different opener, one that named a real thesis: prompts are an asset, and recording and iterating them builds a personal library that compounds over time.

Impress.js is beautiful. My workshop content is hard to understand in a single shot. Using a beautiful presentation framework gave me an opportunity to present ideas in a way that would keep the reader engaged.

My objective in the session was to brainstorm different approaches for expressing the ideas presented in the draft of my presentation.

I had lots of bad “first-draft” clips of material in the presentation. I’ve discovered they were bad through occasionally humiliating practice with people who were nice enough to be a test audience. There was some bruised slides that needed to be renovated.

This copy-drafting activity naturally consumed a lot of context. I was asking the LLM to generate 4-5 different versions of the same content. Then I’d move on to other sections of the presentation and repeat. I’d spend most of an afternoon brainstorming language with Claude Code. By the end, the model’s context was full of voice coaching: tone rules, banned phrasings, half a dozen near-final drafts. Rinse and repeat, my gas tank of available Context was rapidly trending towards empty.

I was going to start nudging the agent to edit the presentation html — but my /context size suggested I wasn’t going to get very far. The agent should compact before too much longer. The session summary that would be generated would end up losing important rules. It’s going to end up like a screenshot of a gif saved a a jpeg. A mutant.

You probably can see where this is going. The same agent that had spent four hours iterating on word choice was about to make numerous precise edits to a 2,000-line HTML file. Editing source is a different kind of work than picking metaphors. An agent doing code manipulation needs a lot of free context and extremely pithy context statements. The goal for the agent should be exclusive: place the chosen content in the correct slides in ways that wouldn’t break the on-screen presentation. But here I was- I needed to change tasks to a very different memory management workflow.

What can we do about a foreseeable failure? How can we preserve the brainstorm context?

The states I was trying to cultivate

I find it helps to describe the end states you want to create and the bad end states you’d like to prevent.

States I wanted to produce:

A clean context window for the editing work
The brainstorming context is preserved
A backing repository that was never mid-drift — never in a state where a slide had been updated but the directive content creation docs were stale- we needed to update content creation docs with every edit.
An auditable decision log explaining why each change happened

States I wanted to avoid:

A saturated agent making source edits
Brainstorming work lost the moment I closed the chat
Edits to a slide before the syllabus and teacher’s guide were updated to match the content
A change in the repo with no recorded reason. Future-me does not enjoy reconstructing intent by inspecting diffs.

Almost everything in these lists is about context. What if we could have multiple agents leveraging shared context?

What you’d actually need to try this

If you wanted to reproduce this experience, you’d need four things:

Roles: Each agent announces what it is on its first message. (“Acting as Drafter.” “Acting as Editor.”)
Role-specific Turf, split by directory. Each agent gets their own directory. We define “turf” and specify that Crossing turf without permission is a violation. Agents have definition statements that express what is approved and disallowed behavior. We tell the agents to be good by only writing in their directory. We tell them bad agents try to write in the other agents directory. We tell the agent: “don’t be a bad agent!” (Evolved developers will use sandboxing). We establish a concept of turf, as a mechanism for making the agents distinct.
A unit of handoff. The agents would produce one markdown file per atomic edit, with a known structure: what the current text looks like, what it want to change it to, and why. The structure isn’t significant. What matters is that handoffs are small, named, and reviewable.
A human bus. The two agents can’t talk to each other. The human is the message bus. You paste a one-line “HANDOFF: 3 drafts proposed” from one agent into the other.

Obviously, if I had come up with this idea in advance of the session, I could have proposed a shared file system location for managing the handoffs. This example is a summary of how to back yourself out of a corner when context has gotten low.

The lifecycle, the verification rules, the conflict resolution all grew out of these four primitives as I used them. I didn’t design them up front. They were a mushroom growing on some scaffolding.

Agents don’t self-approve

There are 5 possible artifacts states in this newly developed collaboration protocol:

Handoff
Proposed
Approved
Applied
Blocked

I’m the Operator. I am the man-in-the-middle! Neither agent can mark its own work as done. The Drafter Agent writes a proposal and sets its status to proposed. Only the operator can move a proposal to approved. The Editor Agent will not touch a draft that isn’t approved. After the Editor agent applies it, it sets the status to applied and records the commit hash that landed the change.

There’s a fifth status, blocked. It means the Editor agent opened a draft, looked at the live source file, and noticed the source had shifted since the Drafter agent wrote the proposal. It could be that a different draft already changed adjacent text. Rather than apply something that no longer fits, the Editor agent sets the status back to blocked with a note. The Drafter agent has to re-read the file and revise.

The Drafter agent discovers that its work is now stale, but it doesn’t decide whether to ship. The operator will need to intervene.

The agents helped me write the protocol

I didn’t write the protocol up before the session. I drafted a first version defining concepts of turf, draft format & draft status lifecycle and put it in a file called v1-edit-protocol.md. Both agents read it at the start of every turn. As I used it and the protocol evolved.

The Editor agent, on its third or fourth applied draft, noticed it was hitting an ambiguous case: a draft of proposed content would result in a layout change to a slide. The Editor agent had no way to tell whether the layout would actually look right when rendered.

The Editor Agent observed, “I can’t verify this without rendering the deck. Do you want me to flag layout-changing drafts in pre-review?”

That interaction with the operator resulted in the creation of the visual-layout-verification rule, which is now enshrined in the protocol file.

The Drafter agent noticed that drift in its source documents wasn’t reliably being discovered. I was at fault. Sometimes I’d remember to check whether a slide change needed a corresponding update in the syllabus. Sometimes I’d forget. It proposed a stricter formulation: every draft has to declare what it searched for and discovered in each source document during an edit. If there was a different result than previous versions, it implied an outside party (likely the operator) tampered with source materials.

I carried each of these rule proposals between agents. The other agent would read the new rule, push back if it didn’t make sense, and we’d land on something both agents could follow. Then I’d update the protocol file, and both agents would pick up the new version on their next read.

You can dynamically give multiple agents a mechanism to interact, and the interaction mechanism can be evolved by agents.

You might not need a multi-agent framework, a router, or an orchestrator, or a fancy preprompt that anticipates every situation.

You might be able to get away with a shared markdown file, a turf table, and a willingness to let the agents discover where the contract needs to grow.

Use Case: Rendering & Observation

Here’s the layout-verification rule the Editor proposed, in the form it eventually took.

When a draft changes anything about how a slide is positioned, the Drafter Agent has to render the deck, click through the affected region, and write a one-sentence observation into the draft’s rationale. Something like: “Verified rendering at the local server. The 500-unit gap between row 1 and row 2 produced visible overlap. Adjusted to a 1,000-unit gap and reverified.”

This observation exists because impress.js layouts can fail visually in ways that can’t be caught by agent driven tests. E.g. Two slides can technically be at valid coordinates and still overlap on screen. I’ve only been able to discover these bugs by being a human reviewer in the loop. Writing tests for validating human usability of a canvas UI element is Very Hard.

The Editor agent didn’t try to solve the layout problem with the observation. It recognized that a known layout failure condition had been identified flagged the gap to me, and we collaborated to co- write a rule that pushed verification to the editor agent. I hadn’t defined a way to embed this validation activity into the editor agent. The editor agent discovered it needed to be able to check its own work- and since it can’t change its own rules, it worked with me to create new role definitions that closed the gap. The protocol grew exactly where it needed to.

When two agents want the same file

Conflicts are rare because the turf table prevents most of them. The agents only write in their own directories. For the cases the table doesn’t cover:

Editor wins for presentation/ (the source files)
Drafter wins for .drafts/ (proposals)
Anything else is mine to resolve

If I edit a file directly outside the protocol — patch a typo, fix a broken link, whatever — I announce it with a HANDOFF line so both agents re-read the file before their next operation.

What this looks like on disk

If you cloned the repo right now and poked at it, you’d see the artifacts of the protocol:

# All the proposals, one per atomic edit
$ ls .drafts/ | wc -l
67
# Every Editor commit that landed a change
$ git log --oneline --grep "^content:"
1090b0f content: rewrite Topic 9 Explain to use relative symlink pattern
a48bb30 content: rewrite Topic 1 Tell to lead with library thesis
# Every Drafter commit (proposals, never source files)
$ git log --oneline --grep "^drafts:"
# Anything currently kicked back to the Drafter
$ grep -l "status: blocked" .drafts/*.md
# Trace any applied draft to the commit that landed it
$ grep -l "applied_commit: 1090b0f" .drafts/*.md
.drafts/slide-09-explain-symlink-rewrite.md

Each draft links to its git commit hash. Each commit links back to its draft in the message body. The commit history becomes a ledger of decisions, with rationale.

The key takeaways from this weird experiment in ad-hoc collaboration protocol establishment:

The default mental model for most people’s agent work is one human, one agent reflected in one conversation.

When an LLM conversation gets too long or too saturated, you have to start over.

You compact the session summary and lose detail about what has been done. You push through your compressed session and accept degraded LLM output. It’s lossy and degrades over time.

This is context degradation is why Browser-based use of LLMs are a dead end.

They keep builders tethered to the ground. You need agents that interact with version controlled artifacts that can help establish and maintain context.

Your mental model should be: interactions with LLMs produce boundaried, artifact-driven workflows.

Spin up multiple agents. Define and distribute turf amongst them. Let them read a shared file at the start of every turn. Operate as the message bus by copying messages between the agents. And (this part feels weird) when the protocol has a gap, ask the agents to help you fill it. Agents notice the gaps faster than you do.

After you establish an operating pattern that works, evolve the agents to writing the prompts you’re pasting to a shared file. Remove yourself from the loop- and focus on approving or rejecting change requests from the agents.

You’re collaborating with agents who are also collaborating with each other, through a contract you collaboratively maintain. The contract can be a markdown file that evolves every day. The agents can collaborate with the operator to add a new section to the contract.

The foundational rule is: Every agent reads the collaboration contract before acting.

March 23, 2026May 12, 2026

The Operations File: A Pattern for Establishing Maintainable Systems

Sometimes, you forget what you built.

Any complex system outgrows an operator’s memory. When you’re building a solution, you get a handful of services running and for the time we’re doing immediate development, you can hold the whole picture in your head. We know where the configs live, which ports map to what services & how to restart the occasionally hanging service. Then one evening something breaks and you realize you can’t remember whether the database runs in a container or as a systemd service. We’re not even sure which log to check first. We’re going to have to probe about for a while- and we’d rather be spending time with our kids.

I started writing operations diaries when I noticed how much time I was spending re-learning my own systems. Every time I needed to do some maintenance on an aging server, I started with twenty minutes of archaeology: I’d do some generic linux system fingerprinting to retrace how things are wired together, confirm which config controls what and eventually rediscover the decisions I had made a year ago.

I don’t maintain operations diaries anymore. I’ve landed on a pattern that saves me from the archeology. I build and maintain a structured document (Operations.md) in the Home directory of systems I need to maintain. It sits in the Home directory of my systems to make it easy to quickly reacquaint myself with the system. Making this file a permanent practice has a side benefit: I can enable agents to engage with the server without any pre-conceived context. The operations.md file answers the questions I usually have when I sit down at the terminal:

What’s running on this system?
How do I check the health of the important components?
What do I do when something goes wrong? and
What has gone wrong before?

The operations file is a living document that evolves with the project. The file has the following sections:

Operations.md
├── Quick Reference (status checks, logs, restarts)
├── Architecture Overview (visual map + port table)
├── Services (Homebrew/systemd, launchd, Docker)
├── Hardware Specifications
├── Disk Layout & Usage
├── Network Configuration
├── Listening Ports
├── Scheduled Tasks
├── Remote Access Setup
├── Troubleshooting Guide
├── Configuration Locations
├── Backup Recommendations
├── Known Issues
└── Changelog

I built a tool that can generate a first draft of an operations file for you. It runs platform-native commands on macOS or Linux, collects real system data, and renders a structured Operations.md based on what the script discovers.

This post walks through the concepts behind my Operations.md file. I describe the the layout of the document. The file is optimized for rapid troubleshooting. I explain the problems each section solves and describes how to start building and maintaining an Operations.md file for systems you need to maintain.

How Do I Know Nothing Is Broken Right Now?

At the very top of the Operations.md file, I capture one command (or a short loop) that checks all critical services at once. Below that, I capture per-service checks for troubleshooting. Imagine a broad check that identifies a problem- we’re going to need to drill into the components that could produce the problem. I’ll need to do some open ended troubleshooting- but I always want to start with the broadest check first and the ability to drill into specifics second. If you go after a hunch too soon, you can end up wasting hours on the wrong problem. My operations file starts with guidance on how to quickly collect the state of services on the system:

Services often fail silently. A container can report “running” while the application inside has crashed. A database can be online but unable to accept writes because the disk is full. A scheduled job can fail every execution for weeks, but I won’t notice till I need its output.

I handle these risks with a general health check section: a set of commands we can run in sequence to verify that each component of the system has what it needs to perform their jobs.

A good health check tests the delivery experience that a service is supposed to facilitate. For a web application, that means hitting an endpoint and confirming a valid response. For a database, that means running a query. For an API, that means making a real request. Checking whether the process is alive tells us something, but not enough. “Is the process delivering its feature?” tells us whether the system is failing.

Next I need to get a conceptual lay of the land for services running on the system. My operations file addresses this with an architecture overview near the top. The document lists every service, how it was installed (container, systemd unit, native process), what port it listens on, and what it depends on. A diagram showing how traffic flows through the system works for understanding relationships. Both sections give two views of the same information, optimized for different questions. What software’s on this system vs what are the listening services & ports operating on the system?

When something breaks on the system, I need to know what else might be affected. When I want to add a new service, I need to know which ports are already in use so I don’t stomp on existing ports. I also need to know which dependencies exist. The Inventory in an operations file helps me quickly reason about a system I don’t fully remember.

Generally speaking- I run the same set of discovery commands when I’m trying to discover the state of a system I’m troubleshooting. In practice- I don’t correctly remember all of the commands I need for checking systemd, docker containers, nginx and other services. To overcome my failing memory- I use a python script to collect the discrete details of the system- and I use an AI agent to infer the details & rationale of the system. If the agent doesn’t get it right- I correct it and update the documentation.

With my current operating practices- I can read the file directly, or have an agent summarize everything it sees in the Ops file & compare it against what it checks for on the system. I canapply agents to maintain a self-updating Operations.md file when I make changes to the system during troubleshooting/tuning activities. Here, we see some details about running services on a jump host that helps me quickly understand what’s running & how to troubleshoot it’s behavior:

You can use agents to generate each section as needed, however I would recommend writing these sections personally when the system is healthy and there is time to verify each entry- meaning you might skip this work at install time. Let the services run for a few days and make sure everything’s working correctly. Confirm the ports, confirm the install methods, trace your dependency chains and then initiate the generation of troubleshooting documentation. Writing documentation forces you to obtain a level of understanding that casual troubleshooting does not. We cannot write down how a service connects to its database without confirming we actually know which database it uses. The act of writing documentation helps you discover surprises. We may not remember these activities later- but the documentation quality will be higher because you were actively involved in producing it.

How Do I Fix Things When It Does Break?

When something goes wrong on the system, we need a way to diagnose the problem and evaluate it against a record of how similar problems were solved before.

I add sections to the troubleshooting section from specific incidents– each following a pattern: symptoms observed, commands run to investigate, root cause found, fix applied. The most valuable entries are the ones written right after spending two hours solving a problem, while the details are still fresh. High quality documentation means that in the future, we’ll solve the same problem in five minutes.

I try to organize this section by symptom rather than by cause. When something breaks, I know what we are seeing (the service is unresponsive, the page returns an error, the log is full of a particular message). I don’t know why at this point. Entries organized by symptom let me match what I observe to a known pattern without knowing the components I need to check.

I try to capture the troubleshooting wrong turns too. If I spend thirty minutes investigating a configuration issue before discovering that the real problem was an upstream provider outage, that sequence belongs in the entry. Next time I see the same symptoms, I check the upstream provider first and save myself from unnecessary detours.

How Do I Know Something Is About to Break?

The operations file addresses potential failures with a section on what to watch and how to check it. For each item, we document the command to check current status.

Known Issues, Constraints, and Workarounds

Every system has changes that reflect principled compromises on configurations. A partition might turn out to be too small. A driver may behaves unpredictably under certain conditions. I capture these types of problems in a Known Issues section.

Keeping It Alive: The Operations File as a Living Document

An operations file written once and never updated becomes a liability that provides false confidence. We trust it, act on stale information that no longer reflects the system, and make things worse.

I update the file during all maintenance. When I add a service, I add it to the inventory before moving on to the next task. When I solve a problem, I write the troubleshooting entry while the details are fresh. When I discover a new constraint, I add it to known issues immediately. The file grows with the system instead of drifting away from it. I leverage coding agents to troubleshoot in new ways. I also use coding agents to update the documentation.

A changelog section anchors the document. Every significant change gets a dated entry describing what changed, why it changed, and what I learned in the process. The emphasis belongs on reasoning, because the reasoning behind a decision decays fastest. Capturing the time of the decision makes tech debt manageable.

Over time, this section becomes institutional memory. When I encounter a configuration I do not recognize, the changelog tells me when it was added and what problem it was solving. That context is the difference between “I should not touch this because I do not understand it” and “I understand why this is here and whether it still applies.”

Getting Started: What to Write Down First

If this all sounds like a lot- don’t panic. The scope of a complete operations file can feel paralyzing when you are starting from nothing. So don’t start with a target of completeness. Start from what you already know. Build it over time. Also- don’t start from scratch. I have a tool at the bottom of this post that you can use to do your own initial discovery.

Speedy navigation of expanding Markdown files

As the Operations file grows navigation will get harder. You can use a tool like mindmap-cli to generate a mind map from a markdown file. A mind map generated from the markdown source gives you a bird’s-eye view of the whole structure, collapsible and clickable, without maintaining a separate document. The markdown file itself is for depth. The mindmap is for orientation. Together they let someone operate the system without reading a thousand lines of operations manuals.

Build an Operations.md file for Your System

The tool I mentioned at the top, Operations Discovery Mechanism, automates the hardest part of getting started: the initial inventory. It collects real data from the system using platform-native commands (brew services, launchctl, and lsof on macOS; systemctl, journalctl, and ss on Linux), structures it as JSON, and renders a complete Operations.md with copy-paste ready commands for every service it finds.

Create an Operations directory in the home directory of your target system. clone the repository into it.

gh repo clone CaptainMcCrank/OperationsDiscoveryMechanism

The workflow takes two steps. First, you’ll run the collector script to produce a JSON snapshot of your system. Second, run the renderer to turn that snapshot into a structured markdown document. No dependencies beyond Python 3.9 and the standard library.

# macOS
python3 mac_system_info.py -o system_info.json
python3 generate_operations.py -o Operations.md

# Linux
python3 linux_system_info.py -o system_info.json

For richer documentation, feed the collected JSON to Claude Code with the prompt template embedded in the Readme. Claude can infer relationships between services, add descriptions of what each one does, and flag potential issues that the script collects but cannot interpret.

The operations.md file covers the architecture overview, the service inventory, the quick reference commands, the port map, and the disk layout. Over time, if you’re disciplined about updating the troubleshooting entries, known issues and changelog the document becomes extremely powerful. You’ll be able to hop back onto old systems with ease. You’ll be in a position to start doing interesting experiments with Agentic operations. You can now have agents logging into systems that can acquire context without spending tokens on system enumerations.

March 2, 2026March 4, 2026

Allocating RAM for GPU performance on self hosted LLM systems with integrated System & GPU RAM

Are you sure that the system you’re running self-hosted LLMs on has properly allocated its GPU memory?

I was doing some work on my 128gb Ryzen AMD mini PC. I operate this machine as a linux server dedicated to self-hosted AI infrastructure. I had run into a performance problem on the system where I had saturated all resources and experienced a hard lock. After rebooting to do some troubleshooting, I discovered it didn’t look like my system was operating with 128 gb of RAM.

Diagnosing the problem

This machine’s purpose is hosting local AI inference. The product listing indicated 128 gb of unified memory. Did GMKTeck/Amazon ship me the wrong unit? I tested for system memory:

$ free -h
  Mem: 62Gi

Linux reported sixty-two gigabytes. I tested for the GPU’s VRAM total from the kernel:

$ cat /sys/class/drm/card*/device/mem_info_vram_total
  68719476736

Sixty-four gigabytes on the graphics side. Sixty-two visible to the operating system. That accounts for roughly 126 gigabytes if you add them together, but the system showed only half of the memory I thought I paid for.

Memory in integrated GPU/CPU systems

The processor on this system carries an integrated GPU. Unlike desktop workstations, there is no discrete graphics card on a separate board with dedicated memory. Every byte of physical RAM occupies one unified pool of LPDDR5X shared between CPU and GPU. I should have known this, but I didn’t. I haven’t built a gaming PC in over 20 years. On this hardware, the distinction between “system memory” and “graphics memory” exists only in firmware. On this machine, the BIOS has configurations for assigning memory to the CPU and GPUs.

Integrated graphics have been operating this way for a while. Intel’s onboard GPUs quietly borrow from 128 megabytes up to a gigabyte or two from system RAM.

The Intel 810 chipset (1999) was Intel’s first integrated graphics chipset and used what Intel called “Unified Memory Architecture” (UMA). It borrowed 7-11 MB of system RAM for the GPU’s frame buffer, textures, and Z-buffer. This document describes the Graphics and Memory Controller Hub directly.

Intel later formalized this as DVMT (Dynamic Video Memory Technology), which let the graphics driver and OS dynamically allocate system RAM to the iGPU based on real-time demand. The BIOS setting “DVMT Pre-Allocated” (letting you choose 32 MB, 64 MB, 128 MB, etc.) became a standard fixture on Intel-based motherboards for the next two decades. https://www.techarp.com/bios-guide/dvmt-mode/ documents the DVMT modes in detail.

Intel’s own support documentation still explains this architecture for current hardware: https://www.intel.com/content/www/us/en/support/articles/000020962/graphics.html confirms that integrated Intel GPUs use system memory rather than a separate memory bank.

The kernel-level term is “stolen memory” (or Graphics Stolen Memory / GSM). https://igor-blue.github.io/2021/02/10/graphics-part1.html documents how the UEFI firmware reserves a region of physical RAM for the GPU through the Global GTT, managed by hardware and invisible to the OS’s general memory pool.

This design lineage runs from the Intel 810 in 1999 through every Intel iGPU since, with the same fundamental mechanism: firmware carves system RAM away from the OS and hands it to the GPU. The Strix Halo platform applies the same idea at 1000x the scale.

I’ve never noticed because I’ve been operating with MacOS for the last 15 years.

The M-series chips (M1 through M4) share the same fundamental architecture: CPU, GPU, and Neural Engine all access one physical pool of memory. But Apple and AMD made different choices about how to manage that pool.

On Apple Silicon, macOS sees all the memory and allocates it dynamically. If you buy a MacBook with 64 GB of unified memory, top and Activity Monitor report 64 GB. The GPU draws from that pool on demand. The CPU draws from it on demand. No firmware partition divides them. When the GPU needs 20 GB for a rendering task, it gets 20 GB. When it finishes, that memory returns to the general pool. The OS arbitrates in real time.

But on this purpose-specific machine- the default resource allocations produce performance degradation. It looks like GMKTec is assuming windows and gaming are going to be the main application of this hardware. If your objective is running LLMs locally, the default config is going to need adjustments.

I reached out to GMKTec to clarify if there was a hardware problem. They indicated that the default config assigns 64 gigabytes to graphics and 64 gigabytes to the system. To fix this inefficient configuration, I needed to get into the BIOS and adjust the split.

Adjusting memory allocated to GPUs

That raised a practical question: how much memory should I allocate to the Host OS versus the GPU?

My system has Docker containers handling most of the system workload: a search engine, a workflow automation platform, a CMS, a kanban board, a chat interface for local models and the databases backing all of them. My Gnome/COSMIC desktop session was also running, plus a couple of terminal processes consuming their share of memory. Total system memory use hovered around 12 gigabytes. Fifty gigabytes of allocated system RAM sat idle.

The GPU told the same story from a different angle. Of its 64-gigabyte allocation, 330 megabytes held active data. The local inference server sat installed and waiting. Models rested on disk, ready to load, but nothing filled the VRAM. The GPU’s enormous partition accomplished almost nothing.

$ cat /sys/class/drm/card*/device/mem_info_vram_used
348594176

That returned 348,594,176 bytes, which is roughly 330 MB. The companion command for the total allocation was:

$ cat /sys/class/drm/card*/device/mem_info_vram_total
68719476736

That returned 68,719,476,736 bytes, which is 64 GB.

Both values come from the amdgpu kernel driver, which exposes them as sysfs files under /sys/class/drm/card*/device/. The mem_info_vram_used file reports how much of the GPU’s allocated partition is actively holding data at that moment. The mem_info_vram_total file reports the size of the partition itself.

This machine was built to run large language models. I wasn’t getting the utilization I expected. A 70-billion parameter model quantized to Q8 needs roughly 70 gigabytes of VRAM. With this system’s default, larger models don’t fit. I rebooted into the BIOS and bumped the GPU allocation to 96 gigabytes. The system side drops to 32 gigabytes, which still exceeds my current workloads by a wide margin. Twelve gigabytes of active use against 32 gigabytes of capacity leaves generous headroom for growth.

Aside on model quantization

When you run something like ollama pull deepseek-coder-v2:16b, the quantization level is baked into that specific model file. If you look at the Ollama model library, you’ll typically see tags like:

model:7b-q4_0
model:7b-q5_K_M
model:7b-q8_0
model:7b-fp16

The Q4, Q5, Q8, fp16 suffixes indicate the quantization level. Lower numbers mean more compression (smaller file, less VRAM, lower quality). Higher numbers and fp16 mean less compression (larger file, more VRAM, better quality). Quantization reduces the numerical precision of a model’s weights. A weight stored at fp16 uses 16 bits. Q8 uses 8 bits. Q4 uses 4 bits. Fewer bits mean the weight carries a rounded approximation of its original value instead of the precise one the model learned during training.

Where you notice performance that is “higher quality”:

Complex reasoning chains. A Q4 model is more likely to lose the thread on multi-step logic, math problems, or long code generation. The accumulated rounding errors across billions of weights degrade the model’s ability to hold coherent structure over long outputs.
Nuance in language. Word choice becomes slightly flatter. A fp16 model might select a precise, unexpected word. The Q4 version gravitates toward more generic alternatives. The difference is hard to spot in a single response but becomes noticeable over a session.
Instruction following. Heavily quantized models drift from instructions more often. They might ignore a formatting constraint, repeat themselves, or partially answer a question. The precision loss makes the model slightly less responsive to the signal embedded in your prompt.
Factual reliability. Q4 models hallucinate marginally more. The degraded weights weaken the model’s ability to distinguish between what it “knows” confidently and what it is guessing at.

Where you probably won’t notice “lower quality” quantization levels:

Simple question and answer.
Casual conversation.
Summarization of short texts.

Ollama does not re-quantize a model at load time. You pick your quantization when you pull the model. With this change, I now have the ability to pull larger models with higher precision for experimentation & training. This produces increases in inference quality and a far better experience with models.

Hope this helps- To summarize:

There are a few systems that have come out in recent months that are good candidates for running local LLMs. If you get a mini-pc with AMD hardware, you’re likely going to need to adjust the split of ram for your inference goals. I covered how I encountered a problem that led me to discover problems with my config- and I summarized how to reason about changing the config for better performance.

Want help building self-hosted LLMs? Let’s connect!

Post Script:
If you’re exploring buying hardware for self hosting and considering an AMD GPU- you should absolutely take some time to read https://strixhalo.wiki/Guides/Buyer’s_Guide

The Strixhalo wiki has a ton of valuable & relevant resources.