Private LLM — Operon

Pick your stack

Three paths to a private model.

Easiest

Ollama

A single command installs a daemon that serves local models on an OpenAI-compatible endpoint. Perfect for laptops and lab workstations with a GPU.

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama serve

Runs on: Mac · Windows · Linux Hardware: CPU ok · GPU preferred

Most performant

vLLM

Production-grade serving with paged attention, speculative decoding, and tensor parallelism. Drop it on your lab's A100 / H100 node and share across the team.

pip install vllm

vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 2

Runs on: Linux · NVIDIA GPUs Hardware: A100 / H100 / 4090+

Most UI-friendly

LM Studio

A desktop GUI for browsing, pulling, and running GGUF models. Exposes the same OpenAI-compatible endpoint — toggle it in the Server tab.

# Install LM Studio, then:
# Server tab → toggle "Start Server"
# Default: http://localhost:1234/v1

Runs on: Mac · Windows · Linux Hardware: Apple Silicon · NVIDIA · AMD

OpenAI-compatible bridge

One config,
forty backends.

Operon ships a translation proxy that turns any OpenAI-compatible endpoint into a first-class Claude backend. Point it at anything that speaks /v1/chat/completions and Operon handles the rest — streaming, tool-calling, system prompts.

LiteLLM — unified gateway to 100+ providers
OpenRouter — pay-as-you-go with no lock-in
Together · Groq · DeepInfra · Cerebras · Anyscale
Self-hosted Ollama / vLLM / LM Studio / llama.cpp

# ~/.operon/backends.toml

[[backend]]
name = "local-ollama"
url  = "http://localhost:11434/v1"
model = "llama3.1:8b"

[[backend]]
name = "lab-vllm"
url  = "http://gpu-node-01:8000/v1"
model = "Qwen2.5-Coder-32B"
api_key_env = "VLLM_TOKEN"

[[backend]]
name = "openrouter"
url  = "https://openrouter.ai/api/v1"
model = "anthropic/claude-sonnet-4"
api_key_env = "OPENROUTER_KEY"

Built for private data

Zero telemetry. OS-grade secret storage.

Secrets in the OS keychain

API keys live in macOS Keychain, Windows Credential Manager, or libsecret on Linux — never in plain-text config files.

No telemetry

Operon collects nothing. Not a session count, not a crash breadcrumb, not a ping. The source is on GitHub — verify it yourself.

Reverse-tunnel to your cluster

Host a 70B model on an A100 node, tunnel it over SSH, and query it from your laptop like it's localhost. Fully documented.

Per-session backend switching

Switch between local Ollama, on-prem vLLM, and cloud Claude per-session — not per-install. Granular audit trail included.

Streaming & tool calls

The translation proxy preserves streaming tokens and tool-calls end-to-end — local models feel as responsive as hosted ones.

Logs stay local

Every prompt, tool call, and response is logged to ~/.operon/logs/. Your data, on your disk — not Anthropic's, not ours.

Runs fully air-gapped

No license server, no update pings, no anonymous analytics. Unplug the Ethernet cable and Operon still starts, still runs Ollama, still executes protocols.

Per-model token budgets

Set hard ceilings on tokens-per-session and dollars-per-month. Operon will stop and ask before it crosses a limit — no surprise invoices from cloud backends.

Recipe: Cluster-hosted 70B

70B on your cluster,
querying from your laptop.

One of the most-requested patterns. Here's the full walkthrough — no magic, just SSH and vLLM.

1

Start vLLM on the GPU node

ssh gpu-node
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --tensor-parallel-size 4

2

Tunnel from your laptop

ssh -N -L 8000:localhost:8000 \
  -J login.hpc.edu gpu-node.hpc.edu

The -J (ProxyJump) flag handles the login-node bounce automatically.

3

Point Operon at localhost

# ~/.operon/backends.toml

[[backend]]
name = "lab-70b"
url  = "http://localhost:8000/v1"
model = "meta-llama/Llama-3.3-70B-Instruct"

Pick lab-70b from Operon's model selector. Every chat now runs on your cluster's GPU.

Your network. Your rules.

Free. Open-source. Works fully offline.

Download Operon Read the docs