Private LLM stack

Your data. Your model.
Your machine.

Clinical cohorts. Embargoed sequencing data. Industry collaborations under NDA. Some of your work simply cannot leave your network. Operon was built for that reality from day one.

Ollama · local vLLM · on-prem LM Studio · desktop 40+ OpenAI-compat backends
Pick your stack

Three paths to a private model.

Easiest

Ollama

A single command installs a daemon that serves local models on an OpenAI-compatible endpoint. Perfect for laptops and lab workstations with a GPU.

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama serve
Runs on: Mac · Windows · Linux Hardware: CPU ok · GPU preferred
Most performant

vLLM

Production-grade serving with paged attention, speculative decoding, and tensor parallelism. Drop it on your lab's A100 / H100 node and share across the team.

pip install vllm

vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 2
Runs on: Linux · NVIDIA GPUs Hardware: A100 / H100 / 4090+
Most UI-friendly

LM Studio

A desktop GUI for browsing, pulling, and running GGUF models. Exposes the same OpenAI-compatible endpoint — toggle it in the Server tab.

# Install LM Studio, then:
# Server tab → toggle "Start Server"
# Default: http://localhost:1234/v1
Runs on: Mac · Windows · Linux Hardware: Apple Silicon · NVIDIA · AMD
OpenAI-compatible bridge

One config,
forty backends.

Operon ships a translation proxy that turns any OpenAI-compatible endpoint into a first-class Claude backend. Point it at anything that speaks /v1/chat/completions and Operon handles the rest — streaming, tool-calling, system prompts.

  • LiteLLM — unified gateway to 100+ providers
  • OpenRouter — pay-as-you-go with no lock-in
  • Together · Groq · DeepInfra · Cerebras · Anyscale
  • Self-hosted Ollama / vLLM / LM Studio / llama.cpp
# ~/.operon/backends.toml

[[backend]]
name = "local-ollama"
url  = "http://localhost:11434/v1"
model = "llama3.1:8b"

[[backend]]
name = "lab-vllm"
url  = "http://gpu-node-01:8000/v1"
model = "Qwen2.5-Coder-32B"
api_key_env = "VLLM_TOKEN"

[[backend]]
name = "openrouter"
url  = "https://openrouter.ai/api/v1"
model = "anthropic/claude-sonnet-4"
api_key_env = "OPENROUTER_KEY"
Built for private data

Zero telemetry. OS-grade secret storage.

Secrets in the OS keychain

API keys live in macOS Keychain, Windows Credential Manager, or libsecret on Linux — never in plain-text config files.

No telemetry

Operon collects nothing. Not a session count, not a crash breadcrumb, not a ping. The source is on GitHub — verify it yourself.

Reverse-tunnel to your cluster

Host a 70B model on an A100 node, tunnel it over SSH, and query it from your laptop like it's localhost. Fully documented.

Per-session backend switching

Switch between local Ollama, on-prem vLLM, and cloud Claude per-session — not per-install. Granular audit trail included.

Streaming & tool calls

The translation proxy preserves streaming tokens and tool-calls end-to-end — local models feel as responsive as hosted ones.

Logs stay local

Every prompt, tool call, and response is logged to ~/.operon/logs/. Your data, on your disk — not Anthropic's, not ours.

Runs fully air-gapped

No license server, no update pings, no anonymous analytics. Unplug the Ethernet cable and Operon still starts, still runs Ollama, still executes protocols.

Per-model token budgets

Set hard ceilings on tokens-per-session and dollars-per-month. Operon will stop and ask before it crosses a limit — no surprise invoices from cloud backends.

Recipe: Cluster-hosted 70B

70B on your cluster,
querying from your laptop.

One of the most-requested patterns. Here's the full walkthrough — no magic, just SSH and vLLM.

1

Start vLLM on the GPU node

ssh gpu-node
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --tensor-parallel-size 4
2

Tunnel from your laptop

ssh -N -L 8000:localhost:8000 \
  -J login.hpc.edu gpu-node.hpc.edu

The -J (ProxyJump) flag handles the login-node bounce automatically.

3

Point Operon at localhost

# ~/.operon/backends.toml

[[backend]]
name = "lab-70b"
url  = "http://localhost:8000/v1"
model = "meta-llama/Llama-3.3-70B-Instruct"

Pick lab-70b from Operon's model selector. Every chat now runs on your cluster's GPU.

Your network. Your rules.

Free. Open-source. Works fully offline.