Ollama's Default Context Silently Breaks Cline, Continue.dev, and Aider — the num_ctx Fix in 2026

ollamaclineaidercontinue-devlocal-llmsetup-guide

TL;DR: Ollama truncates any prompt longer than its context window from the top, silently, with no error. On a laptop GPU that window defaults to 4,096 tokens — smaller than a single Cline system prompt. That’s why your local coding agent “forgets” instructions, loops, or edits the wrong file. Set num_ctx to at least 32K (64K if you can afford the VRAM) and the same model goes from broken to usable.

ClineContinue.devAider
Where to set contextModel settings → context window, or a ModelfiledefaultCompletionOptions.contextLength in config.yaml.aider.model.settings.ymlextra_params.num_ctx
Silent-truncation symptomAgent loops, re-reads files, “forgets” the taskChat answers ignore the open file after a few turnsEdits drift, repo map gets dropped
Safe value for coding32K–64K16K–32K (chat) / higher for agent mode32K–65K
Most reliable fixBake num_ctx into a ModelfileSame Modelfile + set contextLength to matchSame Modelfile + extra_params.num_ctx

Honest take: Don’t rely on each tool’s context setting alone — bake num_ctx into a custom Ollama model with a Modelfile, then set the matching number in the client. That’s the only way to know all three tools are actually using the window you think they are.

The bug that isn’t a bug

You wire up a local model — say qwen2.5-coder:14b — point Cline at Ollama, and it works for the first message. Then it starts re-reading files it already read, “forgetting” the rule you gave it two turns ago, or editing a function that isn’t the one you asked about. Swap in the cloud Claude Sonnet 5 backend and the same task runs clean. So you blame the local model’s quality.

It’s usually not the model. It’s Ollama’s context window quietly throwing away most of what you sent.

Ollama caps how many tokens a model will actually attend to with a parameter called num_ctx. When your prompt plus conversation history plus tool output exceeds that number, Ollama does not error out and does not warn you. It truncates from the beginning of the prompt and keeps going. Aider’s own documentation is blunt about it: Ollama “silently discards context that exceeds the window,” and “many users don’t even realize that most of their data is being discarded.”

For a chatbot, dropping the oldest few messages is survivable. For a coding agent, the beginning of the prompt is where the system prompt, your rules, and the task description live. Truncate that and the agent is flying blind — which looks exactly like a dumb model.

Why the window is so small by default

Here’s the part most 2026 write-ups still get wrong. They tell you Ollama defaults to 2,048 tokens. That was true historically, and it’s the number burned into a lot of old forum threads. The current behavior is different — and, on the hardware most people run local coding models on, still a problem.

Recent Ollama releases scale the default context window to your available VRAM. Per the official docs, the tiers are:

Available VRAMDefault context
< 24 GiB4,096 tokens
24–48 GiB32,768 tokens
≥ 48 GiB262,144 tokens

Read the top row again. If you’re on a single RTX 4060, 4070, a 12–16 GB laptop GPU, or an 8–16 GB Mac, you land in the < 24 GiB bucket and get 4,096 tokens — even though qwen2.5-coder supports 128K natively. A 24 GB card (RTX 3090 / 4090) jumps you to 32K, which is workable but still below what Ollama itself recommends for agents.

And Ollama’s docs recommend, in plain text: “Tasks which require large context like web search, agents, and coding tools should be set to at least 64,000 tokens.” The auto-default is a boot-safely baseline, not a number tuned for the way Cline or Aider actually use the model.

To see what your model is really running with, ask Ollama directly:

$ ollama ps
NAME                    ID              SIZE      PROCESSOR    CONTEXT    UNTIL
qwen2.5-coder:14b       9ec8897f747a    11 GB     100% GPU     4096       4 minutes from now

That CONTEXT column is the truth. If it says 4096 and you’re running an agent, that’s your bug, sitting in plain sight.

The fix that survives every client: a Modelfile

You can set context per-tool, and I’ll cover that below. But the setting that every tool respects — because it lives on the model itself — is a Modelfile. Bake the context window in once, and Cline, Continue.dev, Aider, OpenCode, and anything else that calls that model name all inherit it.

Create a file called Modelfile (no extension):

FROM qwen2.5-coder:14b
PARAMETER num_ctx 32768

Then build a new named model from it:

$ ollama create qwen2.5-coder-14b-32k -f Modelfile
gathering model components
using existing layer sha256:...
creating new layer sha256:...
writing manifest
success

Point your tools at qwen2.5-coder-14b-32k instead of the base model, and the 32K window travels with it. Verify:

$ ollama run qwen2.5-coder-14b-32k "hi" >/dev/null && ollama ps
NAME                         ID              SIZE      PROCESSOR    CONTEXT    UNTIL
qwen2.5-coder-14b-32k        3f1a...         14 GB     100% GPU     32768      ...

One trap that eats an afternoon: the parameter is num_ctx with an underscore. Write num-ctx with a hyphen and Ollama ignores it silently — no error, default window, same broken behavior. Check the spelling before you check anything else.

If you’d rather not manage extra model names, set the server-wide default instead:

OLLAMA_CONTEXT_LENGTH=64000 ollama serve

The precedence matters: a num_ctx baked into a Modelfile overrides OLLAMA_CONTEXT_LENGTH. So if you set the env var to 64000 but built a model with PARAMETER num_ctx 8192, you get 8,192. When two settings disagree, the Modelfile wins.

A per-request override is also possible through the API for anything that speaks Ollama’s native endpoint:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:14b",
  "prompt": "refactor this function",
  "options": { "num_ctx": 32768 }
}'

Most GUI tools don’t expose that field cleanly, which is exactly why the Modelfile route is the reliable one.

Cline

Cline is the worst offender for this symptom because it’s a full agent — system prompt, file contents, and a growing tool-call history all pile into the window, and it blows past 4,096 tokens inside the first couple of steps. After that it truncates, then loops or re-reads files because the record of what it already did fell off the top.

Cline’s Ollama provider does have a context-window field in its model settings, and you should set it to match your model (32K or higher). But rely on it carefully: Cline has had reported behavior where the client applies its own context number when a task starts rather than deferring to whatever the model was configured with (see cline/cline issue #7726 on documenting context-window config). The behavior has shifted across versions, which is another reason to pin num_ctx at the Ollama layer with a Modelfile — that number holds no matter what the client sends.

Practical setup:

  1. Build a -32k (or -64k) model with a Modelfile as above.
  2. In Cline → API Provider → Ollama, select that model.
  3. Set Cline’s context-window field to the same number you baked in.
  4. Run a task, then check ollama ps mid-task. If CONTEXT matches, you’re set.

If your card can’t hold both the weights and a 32K KV cache, drop to a smaller quant (Q4_K_M) or a smaller model before you drop the context — an agent with a 7B model at 32K context beats a 14B model at 4K context every time, because the 14B never sees your instructions.

For a full privacy-first Cline-on-local walkthrough, see our Cline local LLM setup guide, and if Cline is spinning on tool calls specifically, the Cline + Ollama tool-use loop fix covers the adjacent failure mode.

Continue.dev

Continue.dev shows the truncation more subtly. Chat works, autocomplete works, and then after a few turns the assistant starts answering as if it can’t see the file you have open. That’s the window overflowing and the file context — sent early in the prompt — getting dropped.

Continue sets context through defaultCompletionOptions.contextLength in config.yaml. Note the key name: it’s contextLength, not num_ctx, on Continue’s side — Continue then passes an appropriate num_ctx to Ollama.

models:
  - name: Qwen Coder 14B (32K)
    provider: ollama
    model: qwen2.5-coder-14b-32k
    defaultCompletionOptions:
      contextLength: 32768
      temperature: 0.2

Continue picks up config.yaml changes without a VS Code restart, so you can edit, save, and test in the same session. Two things to get right:

  • Set contextLength to a value your model and VRAM actually support. Continue will happily request a bigger window than the model can hold, and you’ll trade the silent-truncation bug for an out-of-memory reload.
  • Keep contextLength and the model’s baked num_ctx in sync. If the Modelfile says 32768 and Continue asks for 8192, you get the smaller of what’s negotiated — and confusion about why.

For the base install and model-selection choices, our Continue.dev + Ollama local setup covers the rest of the config.

Aider

Aider is the most honest of the three about this — its docs warn you directly — and the fix is a single file. Aider reads model-specific settings from .aider.model.settings.yml, placed in your project root (or home directory):

- name: ollama/qwen2.5-coder:14b
  extra_params:
    num_ctx: 32768

extra_params.num_ctx passes straight through to Ollama, so this is the cleanest of the three — no Modelfile strictly required, though building one still helps if you also use the model in other tools. The name must match exactly how you invoke the model, ollama/ prefix included.

Aider’s quality is unusually sensitive to context because of its repo map — the summarized index of your codebase it feeds the model so it knows what exists beyond the files in the chat. That map is one of the first things sacrificed when the window is too small, which is why an under-configured Aider gives you edits that ignore functions in other files. Bump num_ctx as high as your VRAM allows: on a 12 GB card, a 14B model at Q4 with a 16K–32K window is the sweet spot; on a 24 GB card you can push a coder model to 64K.

Our Aider + Ollama setup guide walks through model choice and the repo-map tuning that pairs with this.

The VRAM reality nobody mentions

Raising num_ctx isn’t free. The KV cache — the memory that holds those tokens — grows with the context window, and it comes out of the same VRAM budget as the model weights. Set a 64K window on a card that barely fits the 14B weights and Ollama either spills to system RAM (slow) or refuses to load.

Rough order of magnitude: a 14B model at Q4_K_M is ~9–11 GB of weights, and a 32K KV cache on top can add several more GB depending on the model’s attention config. On a 16 GB card that’s tight; on 24 GB it’s comfortable. This is the real reason Ollama ties its default to VRAM in the first place — it’s protecting you from an OOM on boot, not from a bad coding experience.

If you’re hitting the wall, the honest fix is hardware-shaped: a used 24 GB RTX 3090 turns “4K default, constant truncation” into “32K by default, 64K when you ask.” We break down what each VRAM tier actually buys for local coding models over at runaihome.com’s local-AI GPU guide. Until then, the move is smaller-model-bigger-window, not the reverse.

FAQ

How do I know if truncation is actually my problem? Run ollama ps while a task is active and read the CONTEXT column. If it’s 4096 (or anything smaller than your prompt + files + history), you’re truncating. Cross-check by giving the model a rule at the start of a session and asking about it several turns later — if it’s forgotten, the window ate it.

Does Ollama really default to 4,096 now, not 2,048? On current releases, yes — for cards under 24 GiB VRAM. The old 2,048 figure predates the VRAM-scaled defaults. Either way it’s far below the 64,000 Ollama’s own docs recommend for coding tools, so the fix is the same.

num_ctx vs contextLength vs OLLAMA_CONTEXT_LENGTH — which wins? A Modelfile PARAMETER num_ctx overrides the OLLAMA_CONTEXT_LENGTH server default. Client settings (Cline’s field, Continue’s contextLength, Aider’s extra_params.num_ctx) request a window per session but can’t exceed what the model/server allow. Set the Modelfile as your floor and match it in the client.

Why not just set 128K everywhere? Because the KV cache for 128K tokens won’t fit alongside model weights on consumer GPUs, and Ollama will OOM or offload to CPU and crawl. Set the largest window that fits your VRAM with room to spare — 32K is the practical target for most local coding setups, 64K if you have 24 GB+.

I set num-ctx and nothing changed. It’s num_ctx — underscore, not hyphen. The hyphenated version is silently ignored. This is the single most common self-inflicted version of this bug.

Sources

Last updated July 5, 2026. Ollama defaults and tool config keys change between versions; run ollama ps to confirm the context your model is actually using before trusting any number here.

Was this article helpful?