Token savings
CTXone’s core pitch is fewer tokens per turn, more useful context. This doc explains how the savings are measured, how to interpret the ratio, and how to maximize it.
For the business case and enterprise math, see TOKEN_ECONOMICS.md. For the architecture, see ARCHITECTURE.md.
The baseline: flat memory
Section titled “The baseline: flat memory”Imagine you stored everything CTXone knows as a single JSON file and shipped that file as context to the model on every turn. That’s the flat memory baseline.
Concretely, the Hub computes the baseline by:
- Serializing the entire memory graph at
/to JSON - Measuring the character count of that string
- Dividing by 4 (the standard tokens-per-character estimate for English)
flat_tokens = len(json.dumps(graph_root_as_nested_dict)) / 4This is an upper bound on what “carry everything on every turn” would cost.
The actual cost: per-response
Section titled “The actual cost: per-response”Every time a recall runs, the Hub measures the length of the response it
actually returned (pinned sections + topic matches) and divides by 4. That’s
ctx_tokens_sent.
sent_tokens = len(response_string) / 4The ratio
Section titled “The ratio”savings_ratio = flat_tokens / sent_tokensA ratio of 13.0x means: for this specific query, you sent 13× fewer tokens
than you would have with flat memory.
The response includes three fields:
{ "ctx_tokens_sent": 34, "ctx_tokens_estimated_flat": 451, "ctx_savings_ratio": 13.26}How to read the numbers in ctx commands
Section titled “How to read the numbers in ctx commands”After a recall, the CLI prints:
5 pinned + 2 topic matches, 34 tokens sent (flat would be ~451, 13.3x savings)5 pinned + 2 topic matches— how much the Hub returned and why34 tokens sent— actual cost of this responseflat would be ~451— what carrying the whole graph would cost13.3x savings— the ratio for this single recall
ctx stats shows the cumulative session totals:
CTXone Token Savings graph size: 451 tokens tokens sent: 98 tokens saved: 1706 savings: 18.4xgraph size— the flat-memory baseline right nowtokens sent— total tokens returned across all recall / context / remember calls in this Hub sessiontokens saved—(number_of_recalls × flat_size) - tokens_sentsavings— overall ratio
Important caveat: the Hub tracks Hub-session totals, not LLM-session
totals. It resets when you restart the Hub. ctx stats gives you the
running cumulative savings for the current running Hub process.
What drives the ratio up
Section titled “What drives the ratio up”Tight queries. recall "BSL-1.1 licensing" is tighter than
recall "stuff about our project". The more specific, the fewer incidental
matches, the higher the ratio.
Small pinned set. Pinned memories take half the budget. If you pin 50 sections, each recall ships 50 sections — still a huge savings vs flat, but lower ratio than pinning just 5.
Large total graph. The bigger your graph, the more dramatic the savings. A 10,000-fact graph recalling 3 facts is a 3,000× savings even with a generous pinned set.
Focused contexts. ctx remember "..." --context licensing groups facts
into /memory/licensing/*. Recalls can then hit that one sub-tree cleanly.
What drives the ratio down
Section titled “What drives the ratio down”Very small graph. If you have 5 facts total and recall them all, the ratio will be near 1.0. That’s expected — savings don’t kick in until you have more than you need.
Overly-broad recall queries. recall "project" on a project-heavy
graph matches everything and returns everything. Same tokens as flat.
Over-pinning. If you pin your entire README plus five other docs, pinned content alone is near-flat. The budget math works: pinned takes half, topic gets the other half, but if pinned is already huge, “half” is huge.
A concrete example
Section titled “A concrete example”Start with the demo data:
$ ctx demo...Seeded 21 facts. recall "licensing" → 2 matches, 34 tokens sent vs 451 flat (13.0x savings) recall "architecture" → 1 matches, 13 tokens sent vs 451 flat (32.8x savings) recall "tokens" → 1 matches, 26 tokens sent vs 451 flat (17.4x savings) recall "Lens" → 1 matches, 25 tokens sent vs 451 flat (17.5x savings)21 facts total = 451 flat tokens. A recall returning just the relevant 1–2 facts averages 24 tokens. Ratio: 18.4× cumulative.
Now prime a pinned doc:
$ ctx prime ./docs/VISION.md --pin --source projectpinned 5 sections from ./docs/VISION.md under source 'project'Recall the same topics:
$ ctx recall "licensing"[PINNED] The Insight ...[PINNED] The Product ...[PINNED] The Roadmap ...
--- topic matches ---CTXone is licensed under BSL-1.1...The engine (AgentStateGraph) is BSL-1.1...
5 pinned + 2 topic matches, 620 tokens sent (flat would be ~1191, 1.9x savings)The ratio dropped to 1.9×. That’s not a bug — you’re now carrying the entire VISION.md on every call, which is the price you pay for having critical project context always available. The ratio is still >1.0, meaning you’re still saving tokens vs pure flat memory, and every response carries the context the agent actually needs.
Rule of thumb: ratio > 5× means pinned is tight and recall is focused. Ratio < 2× means you’ve pinned a lot and should review whether every pinned section is really critical.
Why 4 tokens per character?
Section titled “Why 4 tokens per character?”It’s a rough estimate. Actual tokenization depends on the model and the content (code tokenizes differently than prose). 4 chars/token is the standard back-of-envelope for English text used by most model providers.
For precise accounting you’d need to call the model’s tokenizer. CTXone uses 4 because:
- It’s fast (no tokenizer dependency)
- It’s conservative enough that the reported “tokens sent” is in the right order of magnitude
- Both sent and flat use the same estimator, so the ratio is accurate even if the absolute numbers are rough
If you need exact counts, run ctx recall --exact. The CLI re-tokenizes
the response locally using tiktoken’s cl100k_base encoding (GPT-3.5 /
GPT-4 family) and prints both the fast estimate and the exact count
side by side:
0 pinned + 2 topic matches, 34 tokens sent (flat would be ~451, 13.0x savings) exact (cl100k_base): 75 sent, 553 flat, 7.4x savingsThe exact numbers are often smaller than the 4-char estimate because
BPE tokenizers compress common words and punctuation efficiently. The
ratio is therefore usually more conservative under --exact, which
is the right direction — you never want to inflate savings claims.
You can also tokenize arbitrary text directly:
$ ctx tokens "The quick brown fox jumps over the lazy dog"43 chars9 tokens (cl100k_base, exact)10 tokens (4-char estimate)
$ echo "any text from stdin" | ctx tokens -Caveat: cl100k_base is OpenAI’s tokenizer. Claude, Gemini, and Grok use different proprietary tokenizers, so the exact counts won’t match those models byte-for-byte. The ratio is still meaningful as a consistent reference point.
Why not vector similarity?
Section titled “Why not vector similarity?”Vector search returns results ranked by embedding distance. That works, but:
- Recall is opaque — you can’t tell why a result was returned
- Results drift when you re-embed with a newer model
- Token savings are harder to compute because the “relevance threshold” is fuzzy
Structural search + confidence scoring + pinned context gives you:
- Blame-able results (every fact has a commit trail)
- Predictable ranking (token matches + pinned-first)
- Clean token math (you know exactly what went into each response)
See ARCHITECTURE.md for more on this design choice.