# Running a Local Model on the OnLogic FR201 with llama.cpp and ZeroClaw

This is Part 2 of the ZeroClaw on the OnLogic FR201 series. In [Part 1](../onlogic-zeroclaw/), we deployed ZeroClaw as a static Rust binary managed by systemd, pointed at Anthropic's API. That's the recommended setup for most use cases. But what if you can't make outbound API calls?

In some manufacturing environments, the devices on the plant floor are completely air-gapped. No internet access, no cloud endpoints, period. Maybe it's a regulatory requirement, maybe it's an IT policy, or maybe you're running in a classified facility. Whatever the reason, you still want an agent that can process data locally.

This guide walks through running a small language model directly on the FR201 using [llama.cpp](https://github.com/ggml-org/llama.cpp), served as an OpenAI-compatible API endpoint, and pointing ZeroClaw at it instead of Anthropic's cloud API.

## A Realistic Disclaimer

Let's be upfront about this: the FR201 was not designed to run language models. It's built around a Raspberry Pi Compute Module 4 with a quad-core Cortex-A72 at 1.5 GHz and a maximum of 8 GB of LPDDR4 RAM. There's no NPU, no GPU acceleration for inference, and the CPU is an older architecture compared to what's available today. Everything runs on the CPU.

What does that actually mean in practice? With the models that fit in 8 GB of RAM, you're looking at roughly 1-8 tokens per second depending on the model size and quantization level. For reference, a typical API call to Claude returns tokens at 50-100+ per second. So this isn't a replacement for a cloud model. It's a fallback for when a cloud model isn't an option.

That said, this is still a worthwhile exercise. If OnLogic releases an FR201 successor built on the CM5 (which has a faster Cortex-A76 and potentially more RAM), these same steps will apply and the performance picture gets a lot more interesting. Consider this a foundation you can build on as the hardware improves.

## Choosing a Model

With 8 GB of total system RAM (shared between the OS, llama.cpp, the model weights, and the KV cache), we need to be selective. Here are the realistic options, ordered from most to least comfortable:

**Qwen2.5 0.5B Instruct (Q4_K_M)** — ~400 MB on disk, uses under 1 GB of RAM. This is the sweet spot for the CM4. Benchmarks on comparable Pi 4 hardware show around 8 tokens/second, which is genuinely usable for short tasks like parsing sensor data, classifying alerts, or generating structured responses. Qwen2.5 punches well above its weight for a 0.5B model.

**TinyLlama 1.1B Chat (Q4_K_M)** — ~669 MB on disk, uses about 1 GB of RAM. A solid step up in capability from the 0.5B models. Expect roughly 2-4 tokens/second on the CM4. Good enough for simple summarization and structured output tasks, but you'll feel the latency on anything longer than a sentence or two.

**Qwen2.5 1.5B Instruct (Q4_K_M)** — ~1 GB on disk, uses about 1.5 GB of RAM. Better reasoning ability, but you're down to 1-2 tokens/second on the CM4. Still fits comfortably in memory with room for the OS and ZeroClaw, but response times start to stretch.

**Phi-3 Mini 3.8B (Q4_K_M)** — ~2.2 GB on disk, uses about 3.5 GB of RAM. This is the upper boundary of what fits. You'll have about 4 GB left for the OS and KV cache, which is tight. Performance will be well under 1 token/second. Only worth considering if you need the reasoning capability and can tolerate very slow responses.

For this guide, we'll use **Qwen2.5 0.5B Instruct** as the primary example since it offers the best balance of speed and capability on this hardware. We'll note where the steps differ if you want to swap in a larger model.

## Prerequisites

This guide assumes you've already completed Part 1 and have a working FR201 with:

- Raspberry Pi OS Lite (64-bit) installed
- SSH key authentication configured
- The `claw` user account
- ZeroClaw installed and running as a systemd service

## Cross-Compiling llama.cpp

Just like we did with ZeroClaw, we're going to cross-compile llama.cpp on our workstation to produce a static ARM64 binary. No Docker, no Python, no pip installs on the FR201. Just a binary.

### Install Build Dependencies

You'll need CMake and the same musl cross-compilation toolchain from Part 1.

On macOS:

```bash
brew install cmake
```

On Windows:

```bash
winget install --id=Kitware.CMake -e
```

Make sure you still have the `aarch64-linux-musl-gcc` toolchain installed from Part 1:

```bash
which aarch64-linux-musl-gcc
```

### Clone and Build

```bash
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
```

Create a CMake toolchain file for cross-compilation. Save this as `aarch64-musl-toolchain.cmake` in the project root:

```cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)

set(CMAKE_C_COMPILER aarch64-linux-musl-gcc)
set(CMAKE_CXX_COMPILER aarch64-linux-musl-g++)

set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
```

Now build with static linking and ARM64 optimizations:

```bash
cmake -B build \
  -DCMAKE_TOOLCHAIN_FILE=aarch64-musl-toolchain.cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_NATIVE=OFF \
  -DGGML_CPU_AARCH64=ON \
  -DCMAKE_EXE_LINKER_FLAGS="-static"

cmake --build build --config Release -j $(nproc)
```

This will produce the binaries we need in `build/bin/`. The two we care about are:

- `llama-server` — the OpenAI-compatible HTTP server
- `llama-cli` — useful for quick testing from the command line

Verify they're the right architecture:

```bash
file build/bin/llama-server
# Should output: ELF 64-bit LSB executable, ARM aarch64

file build/bin/llama-cli
# Should output: ELF 64-bit LSB executable, ARM aarch64
```

## Downloading the Model

We need the GGUF model file. On your workstation, download the Qwen2.5 0.5B Instruct model in Q4_K_M quantization from Hugging Face:

```bash
# If you don't have the Hugging Face CLI installed
pip install huggingface-hub

# Download the model
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct-GGUF \
  qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --local-dir ./models
```

The file is roughly 400 MB. If you want to use TinyLlama instead:

```bash
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
  tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --local-dir ./models
```

## Deploying to the FR201

Copy the llama.cpp binaries and the model file to the FR201:

```bash
rsync -avz --progress build/bin/llama-server claw@<IP_ADDRESS>:/home/claw/
rsync -avz --progress build/bin/llama-cli claw@<IP_ADDRESS>:/home/claw/
rsync -avz --progress models/qwen2.5-0.5b-instruct-q4_k_m.gguf claw@<IP_ADDRESS>:/home/claw/
```

The model transfer will take a few minutes depending on your network speed.

SSH into the FR201 and verify everything works:

```bash
ssh claw@<IP_ADDRESS>

# Make the binaries executable
chmod +x llama-server llama-cli

# Quick sanity check
./llama-cli --version
```

### Test the Model

Before setting up the service, run a quick test to make sure the model loads and generates output:

```bash
./llama-cli \
  -m qwen2.5-0.5b-instruct-q4_k_m.gguf \
  -p "What is the boiling point of water?" \
  -n 64 \
  -t 4
```

You should see the model load into memory (takes a few seconds on the CM4) and then start generating tokens. Pay attention to the token generation speed reported at the end. If you're getting somewhere in the range of 5-10 tokens/second on the Qwen 0.5B model, everything is working as expected.

## Setting Up the Inference Server

Now we'll install llama-server as a systemd service, following the same hardening approach we used for ZeroClaw in Part 1.

### Create a Service User

```bash
sudo useradd --system \
  --home /var/lib/llama \
  --create-home \
  --shell /usr/sbin/nologin \
  llama
```

### Install the Binary and Model

```bash
# Install the server binary
sudo mkdir -p /opt/llama
sudo cp llama-server /opt/llama/
sudo chown -R llama:llama /opt/llama
sudo chmod 755 /opt/llama/llama-server
sudo ln -sf /opt/llama/llama-server /usr/local/bin/llama-server

# Install the model
sudo mkdir -p /var/lib/llama/models
sudo cp qwen2.5-0.5b-instruct-q4_k_m.gguf /var/lib/llama/models/
sudo chown -R llama:llama /var/lib/llama
```

### Create the Systemd Service

```bash
sudo nano /etc/systemd/system/llama-server.service
```

Add the following:

```ini
[Unit]
Description=llama.cpp Inference Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=llama
Group=llama

ExecStart=/usr/local/bin/llama-server \
  --model /var/lib/llama/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 2048 \
  --threads 4 \
  --no-webui

WorkingDirectory=/var/lib/llama

Restart=on-failure
RestartSec=10

StandardOutput=journal
StandardError=journal

# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/llama
RestrictNamespaces=true
RestrictRealtime=true
LockPersonality=true

[Install]
WantedBy=multi-user.target
```

A few things to note about the configuration:

- **`--host 127.0.0.1`** binds the server to localhost only. ZeroClaw runs on the same device, so there's no reason to expose this on the network. Nothing outside the FR201 can reach it.
- **`--port 8080`** is the port the OpenAI-compatible API will listen on.
- **`--ctx-size 2048`** limits the context window to 2048 tokens. On the CM4, larger contexts eat into your limited RAM fast. For most edge tasks (parsing sensor data, generating short responses), 2048 is plenty.
- **`--threads 4`** uses all four Cortex-A72 cores. Since ZeroClaw itself is lightweight and mostly idle between requests, giving all cores to llama-server during inference is the right call.
- **`--no-webui`** disables the built-in web interface. We don't need it on a headless edge device.

Enable and start the service:

```bash
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
```

Check that it's running:

```bash
systemctl is-active llama-server
journalctl -u llama-server -n 50 --no-pager
```

### Verify the API

Once the model is loaded (give it 10-15 seconds on the CM4), test the API endpoint:

```bash
curl -s http://127.0.0.1:8080/v1/models | head -20
```

You should see a JSON response listing the loaded model. Now test an actual completion:

```bash
curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b-instruct",
    "messages": [
      {"role": "user", "content": "What is the boiling point of water in Fahrenheit?"}
    ],
    "max_tokens": 64
  }'
```

If you get a JSON response with the model's answer, the inference server is working correctly.

## Pointing ZeroClaw at the Local Model

Now we need to reconfigure ZeroClaw to use the local llama-server endpoint instead of Anthropic's API.

Edit the ZeroClaw configuration:

```bash
sudo nano /var/lib/zeroclaw/.zeroclaw/config.toml
```

Update the configuration to point at the local server:

```toml
default_provider = "openai-compatible"
default_model = "qwen2.5-0.5b-instruct"
api_base = "http://127.0.0.1:8080/v1"
api_key = "not-needed"
```

Since llama-server runs on localhost with no authentication, the `api_key` value doesn't matter, but the field may still be required by ZeroClaw's config parser, so we set it to a placeholder.

Restart ZeroClaw to pick up the new configuration:

```bash
sudo systemctl restart zeroclaw
```

### Service Startup Order

Since ZeroClaw now depends on llama-server being available, we should update the ZeroClaw systemd unit to reflect that dependency. Edit the service file:

```bash
sudo nano /etc/systemd/system/zeroclaw.service
```

Update the `[Unit]` section to include the llama-server dependency:

```ini
[Unit]
Description=ZeroClaw Service
After=network-online.target llama-server.service
Wants=network-online.target
Requires=llama-server.service
```

The `Requires=llama-server.service` directive tells systemd that ZeroClaw can't run without llama-server. If llama-server fails, ZeroClaw will be stopped too. The `After=llama-server.service` ensures llama-server starts first.

Reload and restart:

```bash
sudo systemctl daemon-reload
sudo systemctl restart zeroclaw
```

### Test It End-to-End

```bash
sudo -u zeroclaw -s
zeroclaw agent -m "Summarize the current system uptime in one sentence."
```

The response will be noticeably slower than when using the Anthropic API, but if you get a coherent answer, the full local stack is working. No internet required.

## Performance Tuning

There are a few things you can tweak to squeeze more performance out of the CM4:

**Reduce context size.** If your use case only needs short prompts and responses, drop `--ctx-size` to 512 or 1024. Smaller context means less memory used for the KV cache, which leaves more room for the model and OS.

**Use a smaller quantization.** If you're willing to trade some quality for speed, use the Q2_K or IQ2_M quantization variants instead of Q4_K_M. They're smaller and faster to process, though the output quality takes a hit.

**Overclock the CM4.** The Cortex-A72 can be overclocked from 1.5 GHz to around 2.0 GHz by editing `/boot/config.txt`. This gives a meaningful speed boost for inference, but make sure your FR201 has adequate thermal headroom. Add the following to `/boot/config.txt`:

```
over_voltage=6
arm_freq=2000
```

This is at your own risk, and you should monitor temperatures with `vcgencmd measure_temp` under load. The FR201's fanless design handles stock clocks well, but overclocking generates more heat.

**Disable unnecessary services.** Every bit of RAM and CPU that goes to background services is RAM and CPU that isn't doing inference. Disable anything you're not using:

```bash
sudo systemctl disable bluetooth
sudo systemctl disable avahi-daemon
```

## Monitoring Both Services

With two systemd services running, you'll want a quick way to check on both:

```bash
# Check status of both services
systemctl is-active zeroclaw llama-server

# View recent logs from both services interleaved
journalctl -u zeroclaw -u llama-server --since "1 hour ago" --no-pager
```

## When to Use This vs. the Cloud API

To be clear, this local model setup makes sense in a narrow set of circumstances:

- **Air-gapped environments** where the device has no internet access and never will.
- **Data sovereignty requirements** where process data cannot leave the device or the local network under any circumstances.
- **Extremely latency-sensitive tasks** where the round-trip to a cloud API is too slow (though at 5-10 tokens/second locally, this only applies if you need the first token fast and the response is very short).
- **Experimentation and learning.** If you're an OT developer getting into AI and want to understand how inference works at the metal level, this is an excellent hands-on project.

For everything else, the Anthropic API (as configured in Part 1) will give you dramatically better results, faster responses, and access to far more capable models. The local model is a tool for specific situations, not a general replacement.

## Looking Ahead

The CM4 is the floor, not the ceiling. The same cross-compilation workflow, systemd service configuration, and ZeroClaw integration in this guide will work with faster hardware as it becomes available. A future CM5-based OnLogic device with more RAM and a faster processor could comfortably run 3B-7B parameter models at usable speeds, which opens the door to much more capable local inference on the plant floor.

The important thing is that the architecture is right. One static binary for the inference server, one static binary for the agent, both managed by systemd, both hardened, no dependencies. The model file is the only thing that changes when you upgrade.