Ornith-1.0 35B as a Local Cursor and Cline Backend in 2026: MIT License, $0 API Cost, and Whether It Matches Claude Sonnet 5
TL;DR: Ornith-1.0 35B is a MoE coding model that activates ~3B parameters per token, fits a single 24 GB GPU at Q4_K_M, and runs your Cursor Chat, Cmd+K, and Cline agent loop at $0 API cost under an MIT license. It does not match Claude Sonnet 5 on the hard agentic benchmarks — 64.2 vs 80.4 on Terminal-Bench 2.1 — but for local, private, ship-anywhere coding on a used RTX 3090, it’s the strongest option that fits.
| Ornith-1.0 35B (local) | Claude Sonnet 5 (cloud) | Ornith-1.0 397B (rented GPU) | |
|---|---|---|---|
| Best for | Private, $0-marginal-cost daily coding on one 24 GB card | Hardest multi-file agentic tasks, top scores | Near-frontier open weights when you can afford the VRAM |
| Cost | Hardware only (~$1,070 used RTX 3090) | $2/$10 per M intro, $3/$15 std after Aug 31 | ~8×80 GB rental or FP8 multi-GPU |
| SWE-Bench Verified (vendor) | 75.6 | 85.2 | 82.4 |
| Terminal-Bench 2.1 (vendor) | 64.2 | 80.4 | 77.5 |
| The catch | Trails Sonnet 5 on agentic tasks; benchmarks are vendor-reported | Metered billing; nothing runs offline | Won’t fit a single consumer GPU |
Honest take: If your reason for going local is privacy, an MIT license you can ship commercial code on, or killing a metered API bill, the 35B on a 24 GB card is the pick — it clears the daily-driver bar for most edits and refactors. If you need the best possible agent on a genuinely hard 200-step task, Sonnet 5 is still ahead by a real margin, and you should pay for it.
Why Ornith-1.0 is worth wiring in at all
Most open-weight coding models released in 2026 are either too big to run locally (Kimi K2.7 at 1T params, GLM 5.2 at 743B) or good-but-generic 7B–32B dense models that plateau on multi-step agent tasks. Ornith-1.0, released by DeepReinforce on June 25, 2026, is interesting for two concrete reasons.
First, the architecture. The family ships in four sizes — 9B dense, 31B dense, 35B MoE, and a 397B MoE flagship — all post-trained on Gemma 4 and Qwen 3.5 bases. The 35B is a mixture-of-experts build that activates roughly 3B parameters per token, so it runs at small-model speed while carrying the knowledge of a much larger network. That’s the combination that matters for a local backend: you want 30B-class quality at 3B-class latency, because Cursor’s Chat and Cline’s agent loop are chatty.
Second, the license. Every Ornith checkpoint is MIT, with no regional restrictions. After the June 12 export-control mess that took Claude Fable 5 and Mythos 5 offline globally with zero notice, “MIT and downloadable” is not a footnote — it’s the whole point. Nobody can revoke a model that lives on your own SSD.
The training method (DeepReinforce calls it self-scaffolding RL — the model learns to write its own test harness and tool-use loop during training, then jointly optimizes that scaffold alongside the solution) is a genuinely novel research angle, but for a working developer it’s a curiosity. What you care about is: does it run on my GPU, does it wire into my editor, and is it good enough. Let’s answer those.
The VRAM math, honestly
The 35B MoE has 35B total parameters. At Q4_K_M quantization, the GGUF that DeepReinforce publishes is 21.2 GB on disk, which loads into roughly the same VRAM footprint plus a KV cache. On a 24 GB card that leaves ~2–3 GB for context — enough for an 8K–16K working window, which covers most single-task Cursor and Cline sessions but not a whole-repo dump.
| GPU / VRAM | Realistic Ornith setup | Notes |
|---|---|---|
| 24 GB (RTX 3090 / RTX 4090) | 35B MoE @ Q4_K_M (21.2 GB) | The sweet spot; ~2–3 GB left for KV at 8–16K ctx |
| 32 GB+ (RTX 5090 / dual card) | 35B MoE @ Q5_K_M (~25 GB) | More headroom, longer context |
| 8–16 GB (RTX 4060/4070) | 9B dense @ Q4 | Usable for explain/small edits; modest expectations |
| 8×80 GB (rented) | 397B MoE FP8 via vLLM | Near-frontier, but not a home setup |
The used RTX 3090 remains the value pick here: 24 GB, 936 GB/s of memory bandwidth, and a used average around $1,070 as of June 2026. Amortized over a year against a $50–100/month cloud coding bill, the card pays for itself in under a year and then runs at zero marginal cost. For the full GPU-by-VRAM breakdown across every Ornith size, runaihome.com’s Ornith-1.0 GPU guide does the hardware analysis this article won’t repeat.
On throughput: with ~3B active parameters, expect roughly 90–120 tokens/sec on an RTX 3090 at Q4_K_M for generation. Treat that as an estimate — it’s extrapolated from comparable ~3B-active MoE models, not an independently measured Ornith number, because no community tok/s benchmark on this exact model existed at time of writing.
Step 1 — Serve Ornith locally
There are two clean paths. Ollama is the fastest to stand up; llama.cpp’s server gives you an explicit OpenAI-compatible endpoint and a context flag you control.
Ollama (simplest): Ornith is in the official Ollama library, so you don’t even need the Hugging Face path.
$ ollama pull ornith:35b
pulling manifest
pulling 4f2a... 100% ▕████████████████▏ 21 GB
verifying sha256 digest
writing manifest
success
$ ollama run ornith:35b "write a python function that reverses words in a string"
<think>
The user wants word-order reversal, not character reversal. Split on
whitespace, reverse the list, join with a single space.
</think>
def reverse_words(s: str) -> str:
return " ".join(s.split()[::-1])
If you’d rather pull the exact GGUF DeepReinforce publishes rather than the community library tag, use the Hugging Face reference directly:
$ ollama run hf.co/deepreinforce-ai/Ornith-1.0-35B-GGUF
llama.cpp (explicit OpenAI endpoint): this gives you a localhost:8000/v1 endpoint and a context-length flag, which is what Cursor and Cline talk to.
$ llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF --port 8000 -c 32768
...
main: server is listening on http://127.0.0.1:8000 - starting the main loop
Note I set -c 32768 rather than the model’s full 262144 (256K) context. On a 24 GB card you cannot fit a 256K KV cache alongside 21 GB of weights — a 32K window is the honest practical ceiling for this hardware. If you need the full 256K, you’re on a rented multi-GPU node running the FP8 build with vLLM, not a home rig.
Step 2 — Wire it into Cursor
Cursor’s AI features split into two systems, and this distinction trips up everyone the first time. Chat, Cmd+K, and Agent mode use the OpenAI API format and are called from the Cursor client on your machine — those you can point at a local model. Tab autocomplete runs through Cursor’s own proprietary server-side FIM model and cannot be swapped, no matter what base URL you set. If Tab is 80% of your Cursor value, a local backend adds nothing; see our Cursor + Ollama setup guide for the full explanation of that wall.
To route Chat/Cmd+K/Agent through Ornith:
- Open Settings → Models.
- Turn on Override OpenAI Base URL and set it to your local endpoint:
http://localhost:8000/v1(llama.cpp) orhttp://localhost:11434/v1(Ollama). - In the API key field, enter any non-empty string —
sk-localworks. Local servers don’t check it, but Cursor requires the field populated. - Add a custom model name matching what you serve:
ornith:35b(Ollama) ordeepreinforce-ai/Ornith-1.0-35B-GGUF(llama.cpp--served-model-name). - Disable the other cloud models in the picker so Cursor doesn’t silently fall back to them.
If you’re on Ollama, set the CORS origin before Cursor will connect, or the first request throws a CORS error:
# macOS/Linux — add to ~/.zshrc or ~/.bashrc
export OLLAMA_ORIGINS="*"
# then restart the Ollama service
Step 3 — Wire it into Cline
Cline is more forgiving because it never had a proprietary Tab model — the whole extension is agent-driven, so 100% of it runs on your chosen backend.
- Open the Cline settings pane in VS Code.
- Set API Provider to OpenAI Compatible.
- Base URL:
http://localhost:11434/v1(Ollama) orhttp://localhost:8000/v1(llama.cpp). - API Key:
ollama(any non-empty string). - Model ID:
ornith:35b.
That’s it. Cline’s plan/act loop will now run entirely locally. For a deeper walkthrough of Cline with local models — including context-window discipline and the tool-call quirks worth knowing — our Cline privacy-first local setup covers the workflow end to end.
The problem I hit: the <think> block leaking into diffs
Ornith is reasoning-first. Every assistant turn opens with a chain-of-thought inside <think>…</think> tags, and the intended behavior is for the serving stack to parse that into a separate reasoning_content field so the editor shows it as collapsible “thinking” and applies only the final answer. vLLM and SGLang do this correctly with --reasoning-parser qwen3.
The trap: Ollama’s GGUF path does not always strip the reasoning block the way vLLM does. When I first wired the 35B into Cline via Ollama, the agent occasionally pasted the entire <think>…</think> monologue into the top of a file before the actual code, because Cline treated the reasoning text as part of the edit payload. Cmd+K in Cursor did the same on longer prompts.
Two fixes, in order of preference:
- Serve via llama.cpp or vLLM instead of raw Ollama for agent work.
llama-serverwith the model’s built-in chat template handles the reasoning field more reliably than Ollama’s default template did in my testing. - If you stay on Ollama, add a one-line rule in Cursor (
.cursorrules) or Cline’s custom instructions: “Do not include<think>blocks or reasoning commentary in file edits; output only final code in diffs.” This is a workaround, not a real fix — it costs you the visible reasoning trace — but it stopped the leak immediately.
This is the kind of rough edge you accept with a two-week-old model. It’s fixable in minutes once you know it’s there.
How it actually stacks up
Here’s the honest benchmark picture. All Ornith numbers are vendor-reported by DeepReinforce as of July 4, 2026 — there was no independent community SWE-Bench or Aider Polyglot run on these exact checkpoints yet, so weight them accordingly.
| Model | SWE-Bench Verified | Terminal-Bench 2.1 | License | API cost |
|---|---|---|---|---|
| Claude Sonnet 5 | 85.2 | 80.4 | Proprietary | $2/$10 (intro), $3/$15 std |
| Ornith-1.0 397B | 82.4 | 77.5 | MIT | $0 local (needs 8×80 GB) |
| Ornith-1.0 35B | 75.6 | 64.2 | MIT | $0 local (fits 24 GB) |
| Ornith-1.0 9B | 69.4 | 43.1 | MIT | $0 local (fits 8–16 GB) |
The gap is real and you should not pretend otherwise. On SWE-Bench Verified the 35B trails Sonnet 5 by ~10 points; on Terminal-Bench 2.1 — the multi-step agentic shell benchmark — it trails by 16. That second gap is the one that shows up in daily use: on a hard task that needs the agent to run tests, read the failure, and iterate ten times, Sonnet 5 finishes more often.
But the 35B’s 75.6 SWE-Bench Verified is not a toy score. It clears the bar for the bulk of what you actually ask a coding assistant to do — write a function, fix a typed error, refactor a module, explain unfamiliar code, generate tests. For that work, running locally at $0 with an MIT license beats paying per token for marginal quality you won’t notice. The 397B closes most of the gap (82.4 SWE-V) but needs rented multi-GPU hardware, which puts it in the same “you’re paying for compute” bucket as the API — at which point Sonnet 5 is simpler.
If cost per capability is your axis and you don’t need local, a cheap cloud model like DeepSeek V4-Flash at $0.14/M tokens is worth comparing before you commit to a GPU purchase.
FAQ
Can I run the 35B on 16 GB of VRAM? Not at a useful quant. Q4_K_M is 21.2 GB and needs ~24 GB with KV cache. On 16 GB, run the 9B dense model instead — it’s genuinely fine for explanation and single-function edits, just weaker on multi-file work.
Does Ornith work with Cursor’s Tab autocomplete? No. Tab uses Cursor’s proprietary server-side FIM model and can’t be replaced by any local backend. Ornith powers Chat, Cmd+K, and Agent mode only. Cline has no such restriction — all of it runs locally.
Is it really free for commercial use? Yes. All Ornith-1.0 checkpoints are MIT-licensed with no regional restrictions, so you can ship code produced with it and self-host it in a commercial product. That’s the main structural advantage over any cloud model.
Why is the 35B slower to feel “smart” than the benchmark suggests?
Two reasons. The reasoning-first design spends tokens on <think> before answering, so first-token latency is higher. And on a 24 GB card you’re capped near a 32K context, so it can’t hold as much of your repo as a 256K cloud model. Both are hardware tradeoffs, not model flaws.
Should I switch off Claude Sonnet 5 for this? Only if your driver is privacy, licensing, or a metered bill you want to kill. On raw capability for the hardest tasks, Sonnet 5 is still ahead — see our Claude Sonnet 5 coding review. Many developers run both: Ornith local for the 80% of routine work, Sonnet 5 for the 20% that’s genuinely hard.
Sources
- Ornith-1.0-35B model card — DeepReinforce (Hugging Face)
- Ornith-1.0-35B-GGUF quantized weights — DeepReinforce (Hugging Face)
- Ornith-1 repository and serving recipes — DeepReinforce (GitHub)
- DeepReinforce Releases Ornith-1.0 — MarkTechPost
- Open-Source Coding Model Ornith-1.0 Writes Its Own Training Scaffold — TechTimes
- Ornith-1.0 for Local AI: Which GPU Runs It? — runaihome.com
- Claude Sonnet 5 benchmarks (SWE-Bench Verified, Terminal-Bench 2.1) — Vellum
- Claude Sonnet 5 launch and pricing — Anthropic
Last updated July 4, 2026. Ornith-1.0 benchmark figures are vendor-reported by DeepReinforce and were not yet independently reproduced at time of writing; pricing and model availability change frequently — verify current state before purchasing.
Recommended Gear
- RTX 3090 (used, 24 GB) — the value pick for running the 35B MoE at Q4_K_M
- RTX 4090 (24 GB) — faster generation on the same 24 GB envelope
Was this article helpful?
Thanks for the feedback — it helps improve future articles.