Running a Local Model on the OnLogic FR201 with llama.cpp and ZeroClaw
This is Part 2 of the ZeroClaw on the OnLogic FR201 series. In Part 1, we deployed ZeroClaw as a static Rust binary managed by systemd, pointed at Anthropic’s API. That’s the recommended setup for most use cases. But what if you can’t make outbound API calls?
In some manufacturing environments, the devices on the plant floor are completely air-gapped. No internet access, no cloud endpoints, period. Maybe it’s a regulatory requirement, maybe it’s an IT policy, or maybe you’re running in a classified facility. Whatever the reason, you still want an agent that can process data locally.
This guide walks through running a small language model directly on the FR201 using llama.cpp, served as an OpenAI-compatible API endpoint, and pointing ZeroClaw at it instead of Anthropic’s cloud API.
A Realistic Disclaimer
Section titled “A Realistic Disclaimer”Let’s be upfront about this: the FR201 was not designed to run language models. It’s built around a Raspberry Pi Compute Module 4 with a quad-core Cortex-A72 at 1.5 GHz and a maximum of 8 GB of LPDDR4 RAM. There’s no NPU, no GPU acceleration for inference, and the CPU is an older architecture compared to what’s available today. Everything runs on the CPU.
What does that actually mean in practice? With the models that fit in 8 GB of RAM, you’re looking at roughly 1-8 tokens per second depending on the model size and quantization level. For reference, a typical API call to Claude returns tokens at 50-100+ per second. So this isn’t a replacement for a cloud model. It’s a fallback for when a cloud model isn’t an option.
That said, this is still a worthwhile exercise. If OnLogic releases an FR201 successor built on the CM5 (which has a faster Cortex-A76 and potentially more RAM), these same steps will apply and the performance picture gets a lot more interesting. Consider this a foundation you can build on as the hardware improves.
Choosing a Model
Section titled “Choosing a Model”With 8 GB of total system RAM (shared between the OS, llama.cpp, the model weights, and the KV cache), we need to be selective. Here are the realistic options, ordered from most to least comfortable:
Qwen2.5 0.5B Instruct (Q4_K_M) — ~400 MB on disk, uses under 1 GB of RAM. This is the sweet spot for the CM4. Benchmarks on comparable Pi 4 hardware show around 8 tokens/second, which is genuinely usable for short tasks like parsing sensor data, classifying alerts, or generating structured responses. Qwen2.5 punches well above its weight for a 0.5B model.
TinyLlama 1.1B Chat (Q4_K_M) — ~669 MB on disk, uses about 1 GB of RAM. A solid step up in capability from the 0.5B models. Expect roughly 2-4 tokens/second on the CM4. Good enough for simple summarization and structured output tasks, but you’ll feel the latency on anything longer than a sentence or two.
Qwen2.5 1.5B Instruct (Q4_K_M) — ~1 GB on disk, uses about 1.5 GB of RAM. Better reasoning ability, but you’re down to 1-2 tokens/second on the CM4. Still fits comfortably in memory with room for the OS and ZeroClaw, but response times start to stretch.
Phi-3 Mini 3.8B (Q4_K_M) — ~2.2 GB on disk, uses about 3.5 GB of RAM. This is the upper boundary of what fits. You’ll have about 4 GB left for the OS and KV cache, which is tight. Performance will be well under 1 token/second. Only worth considering if you need the reasoning capability and can tolerate very slow responses.
For this guide, we’ll use Qwen2.5 0.5B Instruct as the primary example since it offers the best balance of speed and capability on this hardware. We’ll note where the steps differ if you want to swap in a larger model.
Prerequisites
Section titled “Prerequisites”This guide assumes you’ve already completed Part 1 and have a working FR201 with:
- Raspberry Pi OS Lite (64-bit) installed
- SSH key authentication configured
- The
clawuser account - ZeroClaw installed and running as a systemd service
Cross-Compiling llama.cpp
Section titled “Cross-Compiling llama.cpp”Just like we did with ZeroClaw, we’re going to cross-compile llama.cpp on our workstation to produce a static ARM64 binary. No Docker, no Python, no pip installs on the FR201. Just a binary.
Install Build Dependencies
Section titled “Install Build Dependencies”You’ll need CMake and the same musl cross-compilation toolchain from Part 1.
On macOS:
brew install cmakeOn Windows:
winget install --id=Kitware.CMake -eMake sure you still have the aarch64-linux-musl-gcc toolchain installed from Part 1:
which aarch64-linux-musl-gccClone and Build
Section titled “Clone and Build”git clone https://github.com/ggml-org/llama.cpp.gitcd llama.cppCreate a CMake toolchain file for cross-compilation. Save this as aarch64-musl-toolchain.cmake in the project root:
set(CMAKE_SYSTEM_NAME Linux)set(CMAKE_SYSTEM_PROCESSOR aarch64)
set(CMAKE_C_COMPILER aarch64-linux-musl-gcc)set(CMAKE_CXX_COMPILER aarch64-linux-musl-g++)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)Now build with static linking and ARM64 optimizations:
cmake -B build \ -DCMAKE_TOOLCHAIN_FILE=aarch64-musl-toolchain.cmake \ -DCMAKE_BUILD_TYPE=Release \ -DBUILD_SHARED_LIBS=OFF \ -DGGML_NATIVE=OFF \ -DGGML_CPU_AARCH64=ON \ -DCMAKE_EXE_LINKER_FLAGS="-static"
cmake --build build --config Release -j $(nproc)This will produce the binaries we need in build/bin/. The two we care about are:
llama-server— the OpenAI-compatible HTTP serverllama-cli— useful for quick testing from the command line
Verify they’re the right architecture:
file build/bin/llama-server# Should output: ELF 64-bit LSB executable, ARM aarch64
file build/bin/llama-cli# Should output: ELF 64-bit LSB executable, ARM aarch64Downloading the Model
Section titled “Downloading the Model”We need the GGUF model file. On your workstation, download the Qwen2.5 0.5B Instruct model in Q4_K_M quantization from Hugging Face:
# If you don't have the Hugging Face CLI installedpip install huggingface-hub
# Download the modelhuggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF \ qwen2.5-0.5b-instruct-q4_k_m.gguf \ --local-dir ./modelsThe file is roughly 400 MB. If you want to use TinyLlama instead:
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \ tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ --local-dir ./modelsDeploying to the FR201
Section titled “Deploying to the FR201”Copy the llama.cpp binaries and the model file to the FR201:
rsync -avz --progress build/bin/llama-server claw@<IP_ADDRESS>:/home/claw/rsync -avz --progress build/bin/llama-cli claw@<IP_ADDRESS>:/home/claw/rsync -avz --progress models/qwen2.5-0.5b-instruct-q4_k_m.gguf claw@<IP_ADDRESS>:/home/claw/The model transfer will take a few minutes depending on your network speed.
SSH into the FR201 and verify everything works:
ssh claw@<IP_ADDRESS>
# Make the binaries executablechmod +x llama-server llama-cli
# Quick sanity check./llama-cli --versionTest the Model
Section titled “Test the Model”Before setting up the service, run a quick test to make sure the model loads and generates output:
./llama-cli \ -m qwen2.5-0.5b-instruct-q4_k_m.gguf \ -p "What is the boiling point of water?" \ -n 64 \ -t 4You should see the model load into memory (takes a few seconds on the CM4) and then start generating tokens. Pay attention to the token generation speed reported at the end. If you’re getting somewhere in the range of 5-10 tokens/second on the Qwen 0.5B model, everything is working as expected.
Setting Up the Inference Server
Section titled “Setting Up the Inference Server”Now we’ll install llama-server as a systemd service, following the same hardening approach we used for ZeroClaw in Part 1.
Create a Service User
Section titled “Create a Service User”sudo useradd --system \ --home /var/lib/llama \ --create-home \ --shell /usr/sbin/nologin \ llamaInstall the Binary and Model
Section titled “Install the Binary and Model”# Install the server binarysudo mkdir -p /opt/llamasudo cp llama-server /opt/llama/sudo chown -R llama:llama /opt/llamasudo chmod 755 /opt/llama/llama-serversudo ln -sf /opt/llama/llama-server /usr/local/bin/llama-server
# Install the modelsudo mkdir -p /var/lib/llama/modelssudo cp qwen2.5-0.5b-instruct-q4_k_m.gguf /var/lib/llama/models/sudo chown -R llama:llama /var/lib/llamaCreate the Systemd Service
Section titled “Create the Systemd Service”sudo nano /etc/systemd/system/llama-server.serviceAdd the following:
[Unit]Description=llama.cpp Inference ServerAfter=network-online.targetWants=network-online.target
[Service]Type=simpleUser=llamaGroup=llama
ExecStart=/usr/local/bin/llama-server \ --model /var/lib/llama/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \ --host 127.0.0.1 \ --port 8080 \ --ctx-size 2048 \ --threads 4 \ --no-webui
WorkingDirectory=/var/lib/llama
Restart=on-failureRestartSec=10
StandardOutput=journalStandardError=journal
# HardeningNoNewPrivileges=truePrivateTmp=trueProtectSystem=strictProtectHome=trueReadWritePaths=/var/lib/llamaRestrictNamespaces=trueRestrictRealtime=trueLockPersonality=true
[Install]WantedBy=multi-user.targetA few things to note about the configuration:
--host 127.0.0.1binds the server to localhost only. ZeroClaw runs on the same device, so there’s no reason to expose this on the network. Nothing outside the FR201 can reach it.--port 8080is the port the OpenAI-compatible API will listen on.--ctx-size 2048limits the context window to 2048 tokens. On the CM4, larger contexts eat into your limited RAM fast. For most edge tasks (parsing sensor data, generating short responses), 2048 is plenty.--threads 4uses all four Cortex-A72 cores. Since ZeroClaw itself is lightweight and mostly idle between requests, giving all cores to llama-server during inference is the right call.--no-webuidisables the built-in web interface. We don’t need it on a headless edge device.
Enable and start the service:
sudo systemctl daemon-reloadsudo systemctl enable llama-serversudo systemctl start llama-serverCheck that it’s running:
systemctl is-active llama-serverjournalctl -u llama-server -n 50 --no-pagerVerify the API
Section titled “Verify the API”Once the model is loaded (give it 10-15 seconds on the CM4), test the API endpoint:
curl -s http://127.0.0.1:8080/v1/models | head -20You should see a JSON response listing the loaded model. Now test an actual completion:
curl -s http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5-0.5b-instruct", "messages": [ {"role": "user", "content": "What is the boiling point of water in Fahrenheit?"} ], "max_tokens": 64 }'If you get a JSON response with the model’s answer, the inference server is working correctly.
Pointing ZeroClaw at the Local Model
Section titled “Pointing ZeroClaw at the Local Model”Now we need to reconfigure ZeroClaw to use the local llama-server endpoint instead of Anthropic’s API.
Edit the ZeroClaw configuration:
sudo nano /var/lib/zeroclaw/.zeroclaw/config.tomlUpdate the configuration to point at the local server:
default_provider = "openai-compatible"default_model = "qwen2.5-0.5b-instruct"api_base = "http://127.0.0.1:8080/v1"api_key = "not-needed"Since llama-server runs on localhost with no authentication, the api_key value doesn’t matter, but the field may still be required by ZeroClaw’s config parser, so we set it to a placeholder.
Restart ZeroClaw to pick up the new configuration:
sudo systemctl restart zeroclawService Startup Order
Section titled “Service Startup Order”Since ZeroClaw now depends on llama-server being available, we should update the ZeroClaw systemd unit to reflect that dependency. Edit the service file:
sudo nano /etc/systemd/system/zeroclaw.serviceUpdate the [Unit] section to include the llama-server dependency:
[Unit]Description=ZeroClaw ServiceAfter=network-online.target llama-server.serviceWants=network-online.targetRequires=llama-server.serviceThe Requires=llama-server.service directive tells systemd that ZeroClaw can’t run without llama-server. If llama-server fails, ZeroClaw will be stopped too. The After=llama-server.service ensures llama-server starts first.
Reload and restart:
sudo systemctl daemon-reloadsudo systemctl restart zeroclawTest It End-to-End
Section titled “Test It End-to-End”sudo -u zeroclaw -szeroclaw agent -m "Summarize the current system uptime in one sentence."The response will be noticeably slower than when using the Anthropic API, but if you get a coherent answer, the full local stack is working. No internet required.
Performance Tuning
Section titled “Performance Tuning”There are a few things you can tweak to squeeze more performance out of the CM4:
Reduce context size. If your use case only needs short prompts and responses, drop --ctx-size to 512 or 1024. Smaller context means less memory used for the KV cache, which leaves more room for the model and OS.
Use a smaller quantization. If you’re willing to trade some quality for speed, use the Q2_K or IQ2_M quantization variants instead of Q4_K_M. They’re smaller and faster to process, though the output quality takes a hit.
Overclock the CM4. The Cortex-A72 can be overclocked from 1.5 GHz to around 2.0 GHz by editing /boot/config.txt. This gives a meaningful speed boost for inference, but make sure your FR201 has adequate thermal headroom. Add the following to /boot/config.txt:
over_voltage=6arm_freq=2000This is at your own risk, and you should monitor temperatures with vcgencmd measure_temp under load. The FR201’s fanless design handles stock clocks well, but overclocking generates more heat.
Disable unnecessary services. Every bit of RAM and CPU that goes to background services is RAM and CPU that isn’t doing inference. Disable anything you’re not using:
sudo systemctl disable bluetoothsudo systemctl disable avahi-daemonMonitoring Both Services
Section titled “Monitoring Both Services”With two systemd services running, you’ll want a quick way to check on both:
# Check status of both servicessystemctl is-active zeroclaw llama-server
# View recent logs from both services interleavedjournalctl -u zeroclaw -u llama-server --since "1 hour ago" --no-pagerWhen to Use This vs. the Cloud API
Section titled “When to Use This vs. the Cloud API”To be clear, this local model setup makes sense in a narrow set of circumstances:
- Air-gapped environments where the device has no internet access and never will.
- Data sovereignty requirements where process data cannot leave the device or the local network under any circumstances.
- Extremely latency-sensitive tasks where the round-trip to a cloud API is too slow (though at 5-10 tokens/second locally, this only applies if you need the first token fast and the response is very short).
- Experimentation and learning. If you’re an OT developer getting into AI and want to understand how inference works at the metal level, this is an excellent hands-on project.
For everything else, the Anthropic API (as configured in Part 1) will give you dramatically better results, faster responses, and access to far more capable models. The local model is a tool for specific situations, not a general replacement.
Looking Ahead
Section titled “Looking Ahead”The CM4 is the floor, not the ceiling. The same cross-compilation workflow, systemd service configuration, and ZeroClaw integration in this guide will work with faster hardware as it becomes available. A future CM5-based OnLogic device with more RAM and a faster processor could comfortably run 3B-7B parameter models at usable speeds, which opens the door to much more capable local inference on the plant floor.
The important thing is that the architecture is right. One static binary for the inference server, one static binary for the agent, both managed by systemd, both hardened, no dependencies. The model file is the only thing that changes when you upgrade.