Are you sure that the system you’re running self-hosted LLMs on has properly allocated its GPU memory?
I was doing some work on my 128gb Ryzen AMD mini PC. I operate this machine as a linux server dedicated to self-hosted AI infrastructure. I had run into a performance problem on the system where I had saturated all resources and experienced a hard lock. After rebooting to do some troubleshooting, I discovered it didn’t look like my system was operating with 128 gb of RAM.
Diagnosing the problem
This machine’s purpose is hosting local AI inference. The product listing indicated 128 gb of unified memory. Did GMKTeck/Amazon ship me the wrong unit? I tested for system memory:
$ free -h
Mem: 62Gi
Linux reported sixty-two gigabytes. I tested for the GPU’s VRAM total from the kernel:
$ cat /sys/class/drm/card*/device/mem_info_vram_total
68719476736
Sixty-four gigabytes on the graphics side. Sixty-two visible to the operating system. That accounts for roughly 126 gigabytes if you add them together, but the system showed only half of the memory I thought I paid for.
Memory in integrated GPU/CPU systems
The processor on this system carries an integrated GPU. Unlike desktop workstations, there is no discrete graphics card on a separate board with dedicated memory. Every byte of physical RAM occupies one unified pool of LPDDR5X shared between CPU and GPU. I should have known this, but I didn’t. I haven’t built a gaming PC in over 20 years. On this hardware, the distinction between “system memory” and “graphics memory” exists only in firmware. On this machine, the BIOS has configurations for assigning memory to the CPU and GPUs.
Integrated graphics have been operating this way for a while. Intel’s onboard GPUs quietly borrow from 128 megabytes up to a gigabyte or two from system RAM.
The Intel 810 chipset (1999) was Intel’s first integrated graphics chipset and used what Intel called “Unified Memory Architecture” (UMA). It borrowed 7-11 MB of system RAM for the GPU’s frame buffer, textures, and Z-buffer. This document describes the Graphics and Memory Controller Hub directly.
Intel later formalized this as DVMT (Dynamic Video Memory Technology), which let the graphics driver and OS dynamically allocate system RAM to the iGPU based on real-time demand. The BIOS setting “DVMT Pre-Allocated” (letting you choose 32 MB, 64 MB, 128 MB, etc.) became a standard fixture on Intel-based motherboards for the next two decades. https://www.techarp.com/bios-guide/dvmt-mode/ documents the DVMT modes in detail.
Intel’s own support documentation still explains this architecture for current hardware: https://www.intel.com/content/www/us/en/support/articles/000020962/graphics.html confirms that integrated Intel GPUs use system memory rather than a separate memory bank.
The kernel-level term is “stolen memory” (or Graphics Stolen Memory / GSM). https://igor-blue.github.io/2021/02/10/graphics-part1.html documents how the UEFI firmware reserves a region of physical RAM for the GPU through the Global GTT, managed by hardware and invisible to the OS’s general memory pool.
This design lineage runs from the Intel 810 in 1999 through every Intel iGPU since, with the same fundamental mechanism: firmware carves system RAM away from the OS and hands it to the GPU. The Strix Halo platform applies the same idea at 1000x the scale.
I’ve never noticed because I’ve been operating with MacOS for the last 15 years.
The M-series chips (M1 through M4) share the same fundamental architecture: CPU, GPU, and Neural Engine all access one physical pool of memory. But Apple and AMD made different choices about how to manage that pool.
On Apple Silicon, macOS sees all the memory and allocates it dynamically. If you buy a MacBook with 64 GB of unified memory, top and Activity Monitor report 64 GB. The GPU draws from that pool on demand. The CPU draws from it on demand. No firmware partition divides them. When the GPU needs 20 GB for a rendering task, it gets 20 GB. When it finishes, that memory returns to the general pool. The OS arbitrates in real time.
But on this purpose-specific machine- the default resource allocations produce performance degradation. It looks like GMKTec is assuming windows and gaming are going to be the main application of this hardware. If your objective is running LLMs locally, the default config is going to need adjustments.
I reached out to GMKTec to clarify if there was a hardware problem. They indicated that the default config assigns 64 gigabytes to graphics and 64 gigabytes to the system. To fix this inefficient configuration, I needed to get into the BIOS and adjust the split.
Adjusting memory allocated to GPUs
That raised a practical question: how much memory should I allocate to the Host OS versus the GPU?
My system has Docker containers handling most of the system workload: a search engine, a workflow automation platform, a CMS, a kanban board, a chat interface for local models and the databases backing all of them. My Gnome/COSMIC desktop session was also running, plus a couple of terminal processes consuming their share of memory. Total system memory use hovered around 12 gigabytes. Fifty gigabytes of allocated system RAM sat idle.
The GPU told the same story from a different angle. Of its 64-gigabyte allocation, 330 megabytes held active data. The local inference server sat installed and waiting. Models rested on disk, ready to load, but nothing filled the VRAM. The GPU’s enormous partition accomplished almost nothing.
$ cat /sys/class/drm/card*/device/mem_info_vram_used
348594176
That returned 348,594,176 bytes, which is roughly 330 MB. The companion command for the total allocation was:
$ cat /sys/class/drm/card*/device/mem_info_vram_total
68719476736
That returned 68,719,476,736 bytes, which is 64 GB.
Both values come from the amdgpu kernel driver, which exposes them as sysfs files under /sys/class/drm/card*/device/. The mem_info_vram_used file reports how much of the GPU’s allocated partition is actively holding data at that moment. The mem_info_vram_total file reports the size of the partition itself.
This machine was built to run large language models. I wasn’t getting the utilization I expected. A 70-billion parameter model quantized to Q8 needs roughly 70 gigabytes of VRAM. With this system’s default, larger models don’t fit. I rebooted into the BIOS and bumped the GPU allocation to 96 gigabytes. The system side drops to 32 gigabytes, which still exceeds my current workloads by a wide margin. Twelve gigabytes of active use against 32 gigabytes of capacity leaves generous headroom for growth.


Aside on model quantization
When you run something like ollama pull deepseek-coder-v2:16b, the quantization level is baked into that specific model file. If you look at the Ollama model library, you’ll typically see tags like:
- model:7b-q4_0
- model:7b-q5_K_M
- model:7b-q8_0
- model:7b-fp16
The Q4, Q5, Q8, fp16 suffixes indicate the quantization level. Lower numbers mean more compression (smaller file, less VRAM, lower quality). Higher numbers and fp16 mean less compression (larger file, more VRAM, better quality). Quantization reduces the numerical precision of a model’s weights. A weight stored at fp16 uses 16 bits. Q8 uses 8 bits. Q4 uses 4 bits. Fewer bits mean the weight carries a rounded approximation of its original value instead of the precise one the model learned during training.
Where you notice performance that is “higher quality”:
- Complex reasoning chains. A Q4 model is more likely to lose the thread on multi-step logic, math problems, or long code generation. The accumulated rounding errors across billions of weights degrade the model’s ability to hold coherent structure over long outputs.
- Nuance in language. Word choice becomes slightly flatter. A fp16 model might select a precise, unexpected word. The Q4 version gravitates toward more generic alternatives. The difference is hard to spot in a single response but becomes noticeable over a session.
- Instruction following. Heavily quantized models drift from instructions more often. They might ignore a formatting constraint, repeat themselves, or partially answer a question. The precision loss makes the model slightly less responsive to the signal embedded in your prompt.
- Factual reliability. Q4 models hallucinate marginally more. The degraded weights weaken the model’s ability to distinguish between what it “knows” confidently and what it is guessing at.
Where you probably won’t notice “lower quality” quantization levels:
- Simple question and answer.
- Casual conversation.
- Summarization of short texts.
Ollama does not re-quantize a model at load time. You pick your quantization when you pull the model. With this change, I now have the ability to pull larger models with higher precision for experimentation & training. This produces increases in inference quality and a far better experience with models.
Hope this helps- To summarize:
There are a few systems that have come out in recent months that are good candidates for running local LLMs. If you get a mini-pc with AMD hardware, you’re likely going to need to adjust the split of ram for your inference goals. I covered how I encountered a problem that led me to discover problems with my config- and I summarized how to reason about changing the config for better performance.
Want help building self-hosted LLMs? Let’s connect!
Post Script:
If you’re exploring buying hardware for self hosting and considering an AMD GPU- you should absolutely take some time to read https://strixhalo.wiki/Guides/Buyer’s_Guide
The Strixhalo wiki has a ton of valuable & relevant resources.



























