Skip to content

Running a Local Model on the OnLogic FR201 with llama.cpp and ZeroClaw

This is Part 2 of the ZeroClaw on the OnLogic FR201 series. In Part 1, we deployed ZeroClaw as a static Rust binary managed by systemd, pointed at Anthropic’s API. That’s the recommended setup for most use cases. But what if you can’t make outbound API calls?

In some manufacturing environments, the devices on the plant floor are completely air-gapped. No internet access, no cloud endpoints, period. Maybe it’s a regulatory requirement, maybe it’s an IT policy, or maybe you’re running in a classified facility. Whatever the reason, you still want an agent that can process data locally.

This guide walks through running a small language model directly on the FR201 using llama.cpp, served as an OpenAI-compatible API endpoint, and pointing ZeroClaw at it instead of Anthropic’s cloud API.

Let’s be upfront about this: the FR201 was not designed to run language models. It’s built around a Raspberry Pi Compute Module 4 with a quad-core Cortex-A72 at 1.5 GHz and a maximum of 8 GB of LPDDR4 RAM. There’s no NPU, no GPU acceleration for inference, and the CPU is an older architecture compared to what’s available today. Everything runs on the CPU.

What does that actually mean in practice? With the models that fit in 8 GB of RAM, you’re looking at roughly 1-8 tokens per second depending on the model size and quantization level. For reference, a typical API call to Claude returns tokens at 50-100+ per second. So this isn’t a replacement for a cloud model. It’s a fallback for when a cloud model isn’t an option.

That said, this is still a worthwhile exercise. If OnLogic releases an FR201 successor built on the CM5 (which has a faster Cortex-A76 and potentially more RAM), these same steps will apply and the performance picture gets a lot more interesting. Consider this a foundation you can build on as the hardware improves.

With 8 GB of total system RAM (shared between the OS, llama.cpp, the model weights, and the KV cache), we need to be selective. Here are the realistic options, ordered from most to least comfortable:

Qwen2.5 0.5B Instruct (Q4_K_M) — ~400 MB on disk, uses under 1 GB of RAM. This is the sweet spot for the CM4. Benchmarks on comparable Pi 4 hardware show around 8 tokens/second, which is genuinely usable for short tasks like parsing sensor data, classifying alerts, or generating structured responses. Qwen2.5 punches well above its weight for a 0.5B model.

TinyLlama 1.1B Chat (Q4_K_M) — ~669 MB on disk, uses about 1 GB of RAM. A solid step up in capability from the 0.5B models. Expect roughly 2-4 tokens/second on the CM4. Good enough for simple summarization and structured output tasks, but you’ll feel the latency on anything longer than a sentence or two.

Qwen2.5 1.5B Instruct (Q4_K_M) — ~1 GB on disk, uses about 1.5 GB of RAM. Better reasoning ability, but you’re down to 1-2 tokens/second on the CM4. Still fits comfortably in memory with room for the OS and ZeroClaw, but response times start to stretch.

Phi-3 Mini 3.8B (Q4_K_M) — ~2.2 GB on disk, uses about 3.5 GB of RAM. This is the upper boundary of what fits. You’ll have about 4 GB left for the OS and KV cache, which is tight. Performance will be well under 1 token/second. Only worth considering if you need the reasoning capability and can tolerate very slow responses.

For this guide, we’ll use Qwen2.5 0.5B Instruct as the primary example since it offers the best balance of speed and capability on this hardware. We’ll note where the steps differ if you want to swap in a larger model.

This guide assumes you’ve already completed Part 1 and have a working FR201 with:

  • Raspberry Pi OS Lite (64-bit) installed
  • SSH key authentication configured
  • The claw user account
  • ZeroClaw installed and running as a systemd service

Just like we did with ZeroClaw, we’re going to cross-compile llama.cpp on our workstation to produce a static ARM64 binary. No Docker, no Python, no pip installs on the FR201. Just a binary.

You’ll need CMake and the same musl cross-compilation toolchain from Part 1.

On macOS:

Terminal window
brew install cmake

On Windows:

Terminal window
winget install --id=Kitware.CMake -e

Make sure you still have the aarch64-linux-musl-gcc toolchain installed from Part 1:

Terminal window
which aarch64-linux-musl-gcc
Terminal window
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Create a CMake toolchain file for cross-compilation. Save this as aarch64-musl-toolchain.cmake in the project root:

set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)
set(CMAKE_C_COMPILER aarch64-linux-musl-gcc)
set(CMAKE_CXX_COMPILER aarch64-linux-musl-g++)
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)

Now build with static linking and ARM64 optimizations:

Terminal window
cmake -B build \
-DCMAKE_TOOLCHAIN_FILE=aarch64-musl-toolchain.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_NATIVE=OFF \
-DGGML_CPU_AARCH64=ON \
-DCMAKE_EXE_LINKER_FLAGS="-static"
cmake --build build --config Release -j $(nproc)

This will produce the binaries we need in build/bin/. The two we care about are:

  • llama-server — the OpenAI-compatible HTTP server
  • llama-cli — useful for quick testing from the command line

Verify they’re the right architecture:

Terminal window
file build/bin/llama-server
# Should output: ELF 64-bit LSB executable, ARM aarch64
file build/bin/llama-cli
# Should output: ELF 64-bit LSB executable, ARM aarch64

We need the GGUF model file. On your workstation, download the Qwen2.5 0.5B Instruct model in Q4_K_M quantization from Hugging Face:

Terminal window
# If you don't have the Hugging Face CLI installed
pip install huggingface-hub
# Download the model
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF \
qwen2.5-0.5b-instruct-q4_k_m.gguf \
--local-dir ./models

The file is roughly 400 MB. If you want to use TinyLlama instead:

Terminal window
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--local-dir ./models

Copy the llama.cpp binaries and the model file to the FR201:

Terminal window
rsync -avz --progress build/bin/llama-server claw@<IP_ADDRESS>:/home/claw/
rsync -avz --progress build/bin/llama-cli claw@<IP_ADDRESS>:/home/claw/
rsync -avz --progress models/qwen2.5-0.5b-instruct-q4_k_m.gguf claw@<IP_ADDRESS>:/home/claw/

The model transfer will take a few minutes depending on your network speed.

SSH into the FR201 and verify everything works:

Terminal window
ssh claw@<IP_ADDRESS>
# Make the binaries executable
chmod +x llama-server llama-cli
# Quick sanity check
./llama-cli --version

Before setting up the service, run a quick test to make sure the model loads and generates output:

Terminal window
./llama-cli \
-m qwen2.5-0.5b-instruct-q4_k_m.gguf \
-p "What is the boiling point of water?" \
-n 64 \
-t 4

You should see the model load into memory (takes a few seconds on the CM4) and then start generating tokens. Pay attention to the token generation speed reported at the end. If you’re getting somewhere in the range of 5-10 tokens/second on the Qwen 0.5B model, everything is working as expected.

Now we’ll install llama-server as a systemd service, following the same hardening approach we used for ZeroClaw in Part 1.

Terminal window
sudo useradd --system \
--home /var/lib/llama \
--create-home \
--shell /usr/sbin/nologin \
llama
Terminal window
# Install the server binary
sudo mkdir -p /opt/llama
sudo cp llama-server /opt/llama/
sudo chown -R llama:llama /opt/llama
sudo chmod 755 /opt/llama/llama-server
sudo ln -sf /opt/llama/llama-server /usr/local/bin/llama-server
# Install the model
sudo mkdir -p /var/lib/llama/models
sudo cp qwen2.5-0.5b-instruct-q4_k_m.gguf /var/lib/llama/models/
sudo chown -R llama:llama /var/lib/llama
Terminal window
sudo nano /etc/systemd/system/llama-server.service

Add the following:

[Unit]
Description=llama.cpp Inference Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=llama
Group=llama
ExecStart=/usr/local/bin/llama-server \
--model /var/lib/llama/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 2048 \
--threads 4 \
--no-webui
WorkingDirectory=/var/lib/llama
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/llama
RestrictNamespaces=true
RestrictRealtime=true
LockPersonality=true
[Install]
WantedBy=multi-user.target

A few things to note about the configuration:

  • --host 127.0.0.1 binds the server to localhost only. ZeroClaw runs on the same device, so there’s no reason to expose this on the network. Nothing outside the FR201 can reach it.
  • --port 8080 is the port the OpenAI-compatible API will listen on.
  • --ctx-size 2048 limits the context window to 2048 tokens. On the CM4, larger contexts eat into your limited RAM fast. For most edge tasks (parsing sensor data, generating short responses), 2048 is plenty.
  • --threads 4 uses all four Cortex-A72 cores. Since ZeroClaw itself is lightweight and mostly idle between requests, giving all cores to llama-server during inference is the right call.
  • --no-webui disables the built-in web interface. We don’t need it on a headless edge device.

Enable and start the service:

Terminal window
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server

Check that it’s running:

Terminal window
systemctl is-active llama-server
journalctl -u llama-server -n 50 --no-pager

Once the model is loaded (give it 10-15 seconds on the CM4), test the API endpoint:

Terminal window
curl -s http://127.0.0.1:8080/v1/models | head -20

You should see a JSON response listing the loaded model. Now test an actual completion:

Terminal window
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-0.5b-instruct",
"messages": [
{"role": "user", "content": "What is the boiling point of water in Fahrenheit?"}
],
"max_tokens": 64
}'

If you get a JSON response with the model’s answer, the inference server is working correctly.

Now we need to reconfigure ZeroClaw to use the local llama-server endpoint instead of Anthropic’s API.

Edit the ZeroClaw configuration:

Terminal window
sudo nano /var/lib/zeroclaw/.zeroclaw/config.toml

Update the configuration to point at the local server:

default_provider = "openai-compatible"
default_model = "qwen2.5-0.5b-instruct"
api_base = "http://127.0.0.1:8080/v1"
api_key = "not-needed"

Since llama-server runs on localhost with no authentication, the api_key value doesn’t matter, but the field may still be required by ZeroClaw’s config parser, so we set it to a placeholder.

Restart ZeroClaw to pick up the new configuration:

Terminal window
sudo systemctl restart zeroclaw

Since ZeroClaw now depends on llama-server being available, we should update the ZeroClaw systemd unit to reflect that dependency. Edit the service file:

Terminal window
sudo nano /etc/systemd/system/zeroclaw.service

Update the [Unit] section to include the llama-server dependency:

[Unit]
Description=ZeroClaw Service
After=network-online.target llama-server.service
Wants=network-online.target
Requires=llama-server.service

The Requires=llama-server.service directive tells systemd that ZeroClaw can’t run without llama-server. If llama-server fails, ZeroClaw will be stopped too. The After=llama-server.service ensures llama-server starts first.

Reload and restart:

Terminal window
sudo systemctl daemon-reload
sudo systemctl restart zeroclaw
Terminal window
sudo -u zeroclaw -s
zeroclaw agent -m "Summarize the current system uptime in one sentence."

The response will be noticeably slower than when using the Anthropic API, but if you get a coherent answer, the full local stack is working. No internet required.

There are a few things you can tweak to squeeze more performance out of the CM4:

Reduce context size. If your use case only needs short prompts and responses, drop --ctx-size to 512 or 1024. Smaller context means less memory used for the KV cache, which leaves more room for the model and OS.

Use a smaller quantization. If you’re willing to trade some quality for speed, use the Q2_K or IQ2_M quantization variants instead of Q4_K_M. They’re smaller and faster to process, though the output quality takes a hit.

Overclock the CM4. The Cortex-A72 can be overclocked from 1.5 GHz to around 2.0 GHz by editing /boot/config.txt. This gives a meaningful speed boost for inference, but make sure your FR201 has adequate thermal headroom. Add the following to /boot/config.txt:

over_voltage=6
arm_freq=2000

This is at your own risk, and you should monitor temperatures with vcgencmd measure_temp under load. The FR201’s fanless design handles stock clocks well, but overclocking generates more heat.

Disable unnecessary services. Every bit of RAM and CPU that goes to background services is RAM and CPU that isn’t doing inference. Disable anything you’re not using:

Terminal window
sudo systemctl disable bluetooth
sudo systemctl disable avahi-daemon

With two systemd services running, you’ll want a quick way to check on both:

Terminal window
# Check status of both services
systemctl is-active zeroclaw llama-server
# View recent logs from both services interleaved
journalctl -u zeroclaw -u llama-server --since "1 hour ago" --no-pager

To be clear, this local model setup makes sense in a narrow set of circumstances:

  • Air-gapped environments where the device has no internet access and never will.
  • Data sovereignty requirements where process data cannot leave the device or the local network under any circumstances.
  • Extremely latency-sensitive tasks where the round-trip to a cloud API is too slow (though at 5-10 tokens/second locally, this only applies if you need the first token fast and the response is very short).
  • Experimentation and learning. If you’re an OT developer getting into AI and want to understand how inference works at the metal level, this is an excellent hands-on project.

For everything else, the Anthropic API (as configured in Part 1) will give you dramatically better results, faster responses, and access to far more capable models. The local model is a tool for specific situations, not a general replacement.

The CM4 is the floor, not the ceiling. The same cross-compilation workflow, systemd service configuration, and ZeroClaw integration in this guide will work with faster hardware as it becomes available. A future CM5-based OnLogic device with more RAM and a faster processor could comfortably run 3B-7B parameter models at usable speeds, which opens the door to much more capable local inference on the plant floor.

The important thing is that the architecture is right. One static binary for the inference server, one static binary for the agent, both managed by systemd, both hardened, no dependencies. The model file is the only thing that changes when you upgrade.