Ollama
A single command installs a daemon that serves local models on an OpenAI-compatible endpoint. Perfect for laptops and lab workstations with a GPU.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama serve
Clinical cohorts. Embargoed sequencing data. Industry collaborations under NDA. Some of your work simply cannot leave your network. Operon was built for that reality from day one.
A single command installs a daemon that serves local models on an OpenAI-compatible endpoint. Perfect for laptops and lab workstations with a GPU.
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama serve
Production-grade serving with paged attention, speculative decoding, and tensor parallelism. Drop it on your lab's A100 / H100 node and share across the team.
pip install vllm
vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 2
A desktop GUI for browsing, pulling, and running GGUF models. Exposes the same OpenAI-compatible endpoint — toggle it in the Server tab.
# Install LM Studio, then:
# Server tab → toggle "Start Server"
# Default: http://localhost:1234/v1
Operon ships a translation proxy that turns any OpenAI-compatible endpoint into a first-class Claude backend. Point it at anything that speaks /v1/chat/completions and Operon handles the rest — streaming, tool-calling, system prompts.
# ~/.operon/backends.toml
[[backend]]
name = "local-ollama"
url = "http://localhost:11434/v1"
model = "llama3.1:8b"
[[backend]]
name = "lab-vllm"
url = "http://gpu-node-01:8000/v1"
model = "Qwen2.5-Coder-32B"
api_key_env = "VLLM_TOKEN"
[[backend]]
name = "openrouter"
url = "https://openrouter.ai/api/v1"
model = "anthropic/claude-sonnet-4"
api_key_env = "OPENROUTER_KEY"
API keys live in macOS Keychain, Windows Credential Manager, or libsecret on Linux — never in plain-text config files.
Operon collects nothing. Not a session count, not a crash breadcrumb, not a ping. The source is on GitHub — verify it yourself.
Host a 70B model on an A100 node, tunnel it over SSH, and query it from your laptop like it's localhost. Fully documented.
Switch between local Ollama, on-prem vLLM, and cloud Claude per-session — not per-install. Granular audit trail included.
The translation proxy preserves streaming tokens and tool-calls end-to-end — local models feel as responsive as hosted ones.
Every prompt, tool call, and response is logged to ~/.operon/logs/. Your data, on your disk — not Anthropic's, not ours.
No license server, no update pings, no anonymous analytics. Unplug the Ethernet cable and Operon still starts, still runs Ollama, still executes protocols.
Set hard ceilings on tokens-per-session and dollars-per-month. Operon will stop and ask before it crosses a limit — no surprise invoices from cloud backends.
One of the most-requested patterns. Here's the full walkthrough — no magic, just SSH and vLLM.
ssh gpu-node
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--host 127.0.0.1 --port 8000 \
--tensor-parallel-size 4
ssh -N -L 8000:localhost:8000 \
-J login.hpc.edu gpu-node.hpc.edu
The -J (ProxyJump) flag handles the login-node bounce automatically.
# ~/.operon/backends.toml
[[backend]]
name = "lab-70b"
url = "http://localhost:8000/v1"
model = "meta-llama/Llama-3.3-70B-Instruct"
Pick lab-70b from Operon's model selector. Every chat now runs on your cluster's GPU.