How Far Can Attention Reach?
Stretching a 4k model to 128k---what changes, what breaks, what costs.
Last time we took apart a 4k transformer. We saw the whole attention pipeline: project Q, K, V, score, scale, mask, softmax, aggregate. Fifteen tokens. Small room. Tidy story.
Now the same model family claims to handle 128,000 tokens. That’s not a sentence anymore. That’s a county records office, or a romance novel with an alarming filing system.
How do you get from a paragraph to a novel without changing the attention equation?
Let’s find out. Same model, same architecture, same suspiciously load-bearing softmax. Different config file, bigger room. This is the second post in our trilogy on attention.
The Config Diff
First, a smoke test. We have two models in the Phi-3 Mini family:
Phi-3-mini-4k-instruct—the one we dissected last timePhi-3-mini-128k-instruct—the “long context” variant
Are these different models? Or the same model wearing a bigger hat, a longer coat patagonia vest, and a more ambitious pitch deck?
from transformers import AutoConfig
config_4k = AutoConfig.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
config_128k = AutoConfig.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
Let’s diff every field in the config:
d4, d128 = config_4k.to_dict(), config_128k.to_dict()
for k in sorted(set(d4) | set(d128)):
if d4.get(k) != d128.get(k):
print(f" {k}: {d4.get(k)} → {d128.get(k)}")
max_position_embeddings: 4096 → 131072
rope_parameters: default → longrope
sliding_window: 2047 → 262144
Three fields. That’s it.
Everything else is identical: hidden_size: 3072, num_hidden_layers: 32, num_attention_heads: 32, vocab_size: 32064, the activation function, the normalization epsilon, and 26 other fields. Same architecture. Same parameter count. Same Q, K, V projections. Same attention heads. Same feed-forward networks.
The 128k model is not a different model. It’s the same model with three knobs turned.
So the question is not “where did the extra intelligence come from?” It is “how are three config fields getting away with this?”
LongRoPE
In Post 1 we introduced Rotary Positional Embeddings (RoPE): before computing QKᵀ, we rotate the Q and K vectors by an angle that depends on position. Nearby tokens get similar angles. Distant tokens get different angles. Word order is encoded through rotation.1
The 4k model’s RoPE works out of the box for sequences up to 4,096 tokens. The rotation frequencies were chosen during training for that range.
Push past it and the angles start wrapping around. The model begins confusing far-apart positions in ways that are mathematically elegant and operationally useless.
The 128k model fixes this with LongRoPE2: a non-uniform stretching of the rotation frequencies. Instead of scaling every dimension by the same factor, it scales each one differently.
We can extract the scaling factors directly from the config.
d = config_128k.to_dict()
rope = d.get("rope_scaling") or d.get("rope_parameters") # non-standard => `or`
long_factors = rope["long_factor"]
Phi-3’s attention heads have dimension 96, which means 48 rotation pairs. Each pair gets its own scaling factor. Here they are:
dim long_factor
0 1.07 ← barely touched
1 1.07
2 1.07
3 1.10
4 1.18
...
22 46.34
23 49.52
24 52.40
25 54.94
26 57.08
...
43 64.49
44 64.49
45 64.62
46 64.69
47 64.76 ← scaled 65xUniform scaling would be 32x everywhere. That’s just the ratio of 131,072 to 4,096. LongRoPE ranges from 1.07x to 64.76x.
Why the asymmetry?
Because RoPE’s low-numbered dimensions rotate fast. They encode local position: the difference between “two tokens apart” and “three tokens apart.” You do not want to blur those. A 128k model still needs to know that buying comes right after stop, not vaguely somewhere in the same chapter.
The high-numbered dimensions rotate slowly. They encode global position: “somewhere in the first thousand tokens” versus “somewhere in the last thousand.” Those are the dimensions that need to stretch to cover 128k positions instead of 4k. So they get scaled up to 65x.
The result is surgical. Local position resolution is preserved almost exactly,3 while global position capacity grows 32-fold.
Same model. Same attention equation. Bigger filing cabinet, smarter shelves.
The Quadratic Wall, Revisited
In Post 1, we timed a single transformer layer up to 2,048 tokens. O(n²): double the tokens, quadruple the time. But at 2k, the numbers were still small enough that you could squint and call it fine.
Let’s stop squinting and price the thing.
I tried to run this on my laptop: Apple Silicon, 32 GB of unified memory. The same setup that handled all of Post 1 without complaint. The 128k model loaded fine. At 4k tokens, no problem. At 8k, it slowed down. At 12k, it ran out of memory!
The irony scans nicely. The point of this section is to show that long context is expensive, and we couldn’t even measure how expensive it was without renting the sort of hardware that makes it expensive.4
So: one NVIDIA B200 GPU, Phi-3-128k, float32, timing a single transformer layer at each sequence length:
| n | Time (s) | Ratio | KV cache |
|---|---|---|---|
| 128 | 0.001 | — | 96 MB |
| 512 | 0.003 | 3.1× | 384 MB |
| 2,048 | 0.009 | 2.9× | 1.5 GB |
| 8,192 | 0.042 | 4.7× | 6 GB |
| 32,768 | 0.269 | 6.4× | 24 GB |
| 65,536 | 0.821 | 3.1× | 48 GB |
| 131,072 | 2.819 | 3.4× | 96 GB |
Read the Ratio column like a politely typeset threat. Every 4x increase in n should mean 4x the time. At small n, overhead dominates. But at 8k -> 32k, the ratio hits 6.4x. That’s worse than quadratic. Memory access patterns pile onto the arithmetic and make a bad situation ruder.
Now the last column. The KV cache stores key and value tensors for every token at every layer so the model doesn’t recompute them during generation. At 131k tokens in float32:
2 × 32 layers × 32 heads × 96 dims × 131,072 tokens × 4 bytes = 96 GB
That is 96 GB just to remember the conversation. The model weights are about 15 GB. The memory storing what the model knows is one-sixth the memory needed to store what you said.
And the compute bill is just as rude: 2.8 seconds per layer at 131k. Phi-3 has 32 layers. That’s about 90 seconds just for the initial pass through the prompt, before the model drafts so much as one syllable token of response.5
Does It Actually Work?
Three knobs turned, frequencies stretched, and the sort of GPU you and I can never afford.6 Fine. But does the 128k model actually use all 128k tokens, or merely invoice you for them?
There’s a standard test for this: needle-in-a-haystack.7 Bury a random fact somewhere in a long document, ask the model to retrieve it, vary the length and the position. If the model finds the needle everywhere, the context window works.
Our needle is a marked transit schedule buried in pages of operations bulletins:8
★ PRIORITY ROUTE — Vashon Island Ferry
Dispatch code: 421 | Terminal: Fauntleroy Terminal | Status: ACTIVE
Question: “What is the dispatch code for the Vashon Island Ferry?” Answer: 421.
The filler is transit operations bulletins: maintenance advisories, fleet utilization reports, safety notices, all procedurally generated and semantically inert. Pure retrieval. This is not reasoning. This is glorified filing.
We ran 63 probes on Phi-3 128k—7 context lengths × 9 depths:
63 out of 63 correct through 106,641 tokens. Every depth, every context length, all the way to the edge of the 128k window. 5% depth, 95% depth, 5k tokens, 107k tokens: doesn’t matter.
That’s the correct result. The config diff from section 1 set the position limit. LongRoPE from section 2 stretched the frequencies. And here the heatmap confirms that the stretch is real.
But we should not throw a parade after one green heatmap. Is this a LongRoPE story, a training story, or both?
The same needle, different haystacks
There’s one way to find out: run the same experiment on other models.9 Same needle, same filler, same nine depths. Different architectures, different context limits, different position encoding schemes.
We started with the model from Liu et al. (2023), the paper that documented “lost in the middle”—models degrading when relevant information sits in the middle of their context.10 If anyone should show a U-shaped retrieval curve, it’s Llama-2:
Llama-2 7B and Mistral 7B pass every probe. No U-shape. No lost middle. No drama.
We were surprised at first. Then we realized that, carried away by the transit theme, we had benchmarked a very different task from the one in Liu et al.. Their models struggled when they had to reason across 20 documents. Our needle test, by contrast, hides a single marked schedule inside procedural filler. Attention is naturally good at this kind of key-value retrieval.11 Asking whether a transformer can find a marked fact is a bit like asking whether a hash table can do lookups. Yes. That was the lookup, not the hard part.
The failures are more interesting.
Qwen 2.5 7B claims 128k context via YaRN but starts dropping needles at 84k tokens. The failures concentrate in the early-to-middle positions (5–60% depth) while the end of the context (85–95%) stays reliable. At 127k tokens, most depths fail. This is the opposite of Liu et al.’s U-shape—the beginning loses resolution first, not the middle.12
Then there’s Phi-3 Mini 4k—the short-context sibling of our 128k model. Same architecture, same parameter count, no LongRoPE:
Three failures, clustered between 5% and 35% depth at the longest context that fits (3.1k tokens). This is the closest thing to “lost in the middle” we found, and it shows up in exactly the model that was never trained for long context.
So the story so far is clean: position encoding sets the ceiling, and training data determines whether the model actually reaches it.
But here’s the catch: three out of five models got a perfect score. Needle-in-a-haystack catches broken position encodings and obvious capacity walls. That’s useful. It is also only a smoke test. It tells you almost nothing about whether the model can use what it retrieves.13
Not All Documents Are Created Equal
The needle test has a small but important flaw: the needle is marked. ★ PRIORITY ROUTE stands out from the surrounding transit operations bulletins. The model does not have to understand the document. It just has to notice the weirdly important-looking line. A sufficiently confident Ctrl-F could do this too.
Liu et al. (2023) tested something harder: multi-document question answering (QA). Put 20 identically formatted documents in the context and make one relevant. Now the model has to read and discriminate. It has to identify the right document, not just spot the one wearing a party hat.
We replicated this. Twenty transit schedules, all identical format:14
ROUTE 847 — Ballard–Fremont Loop
Capacity: 62 riders | Frequency: 18 min | Fleet: 6 vehicles | Depot: Interbay
ROUTE 293 — Capitol Hill Express
Capacity: 48 riders | Frequency: 12 min | Fleet: 9 vehicles | Depot: Atlantic Base
The fields do not matter semantically here. They are just neutral attributes so every schedule looks equally detailed and the answer cannot be recovered from one quirky formatting cue.
Question: “What is the fleet size for the Ballard–Fremont Loop?” Answer: 6.
We swept the position of the target schedule (1st through 20th of 20), three context lengths, and five different target routes per position: 150 probes per model.15
Phi-3 4k shows it. At about 4k tokens, positions 7 and 10, the exact center of 20 documents, drop to 0/5. Positions 1 and 15-20 stay at 5/5. That is Liu et al.’s U-shaped curve, live, in our data.
But look at the modern long-context models. Mistral 7B: 149/150. Phi-3 128k: 149/150. The U-shape is gone. Long-context training fixed document discrimination the same way it fixed needle retrieval.
Llama-2 7B is the interesting outlier: 59% overall, no U-shape, just uniformly bad. This does not look like a positional problem. It looks like a 2023 model being handed too many nearly identical forms and quietly losing the thread.
Three config changes. Non-uniform positional scaling. The context window grows 32-fold, and the model can find a marked fact anywhere in 107,000 tokens of hay. It can also discriminate between 20 identical-looking documents and pick the right one. Retrieval and discrimination: genuinely, almost clerically, solved.
The cost is real: 96 GB of KV cache, roughly 90 seconds of prefill, and a $6/hour GPU just to measure the pain. But LongRoPE delivers what it promises. Training teaches the model how to use the extra reach. And attention, the same QKᵀ we built by hand in Post 1, handles the retrieval without breaking a sweat.
So: finding a fact in a novel? Solved. Finding the right document among 20 lookalikes? Also solved.
But a bigger filing cabinet does not automatically create a better clerk. More shelves, same judgment. Finding the folder is one thing. Connecting two folders, resisting distractors, and not confidently inventing paperwork from the wrong drawer is another.
That is Post 3.
Footnotes
-
Su et al. (2021), “RoFormer.” We treated it as a black box last time. Now we need to look inside the box. ↩
-
Ding et al. (2024), “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens.” The backstory: kaiokendev (2023) showed a single-line RoPE frequency change could extend context. Chen et al. (2023) at Meta independently discovered position interpolation. YaRN (Peng et al., 2024) added NTK-aware scaling with temperature adjustment—it’s what Qwen 2.5 uses for its 128k claim. LongRoPE takes the next step: per-dimension, non-uniform factors learned via search. ↩
-
There’s also a
short_factorarray (range 1.1x–3.0x) used for sequences shorter than the original 4,096. The model applies different factors depending on whether the sequence exceeds the original training length. Belt and suspenders. ↩ -
We used Modal to run on NVIDIA B200 GPUs—180 GB of HBM3e each, which is 5x the laptop’s unified memory. The scripts are in the
some-attentionrepo. ↩ -
In practice you’d use Flash Attention (Dao, 2022) and fp16, which dramatically reduces both the compute and memory. But the O(n²) scaling is fundamental—these are constant-factor improvements, not algorithmic ones. ↩
-
We’re talking tens of thousands of dollars per chip, and hundreds of thousands of dollars per system. ↩
-
Introduced by Greg Kamradt in November 2023, who tested GPT-4 128K by hiding a fact about pizza toppings in Paul Graham essays. The resulting heatmap went viral. There’s no single canonical implementation—everyone writes their own. Ours uses procedurally generated filler instead of real text, so the haystack’s semantic content can’t help or hurt retrieval. We rotate three different facts across probes to prevent the model from memorizing a single needle. ↩
-
We use the same transit theme across all our experiments (this post and the next one). The
★ PRIORITY ROUTEmarker makes the needle visually distinct from the filler, but the model still has to locate it by content—there’s only one in the entire document. ↩ -
We tested five models on NVIDIA A100-80GB and B200 GPUs via Modal. These are the same five we use for the harder experiments in the next post, chosen to span 4k to 128k context limits and three architecture families. ↩
-
Liu et al. (2023), “Lost in the Middle: How Language Models Use Long Contexts.” An important nuance: their U-shaped curve appeared on multi-document QA—reasoning over 20 Wikipedia articles. That’s harder than finding a single fact. Their simpler key-value retrieval task (closer to our needle) was much easier, and query-aware formatting nearly eliminated the positional bias. Our
★ PRIORITY ROUTEformat gives the model exactly that kind of cue. ↩ -
Jelassi et al. (2024), “Repeat After Me.” They proved this formally for copying tasks and showed that even the smallest transformer (410M params) dramatically outperforms state-space models on retrieval. ↩
-
One plausible mechanism: at extreme context lengths, RoPE’s low-frequency dimensions—the ones that encode global position like “somewhere in the first 10% of 127k tokens”—start to blur. The high-frequency dimensions that encode local position (“this token is right next to that one”) still work fine. The effect is clear: Qwen 2.5 7B fails 11 out of 81 probes, all at 84k+ tokens, mostly in early-middle positions. ↩
-
Needle-in-a-haystack benchmarks catch broken position encodings—that’s about it. RULER (Hsieh et al., 2024) proposes harder retrieval tasks and confirms the gap. BABILong (Kuratov et al., 2024) embeds reasoning tasks in long text up to 10M tokens and finds that LLMs effectively use only 10-20% of their context. Even the Cohere team, whose hybrid RoPE/NoPE architecture was partly motivated by strong needle-in-a-haystack results, acknowledges this: “our focus is solely on testing basic long context capabilities.” ↩
-
Routes from Seattle, the Puget Sound, New York, and San Francisco. All fake schedules, fake vehicle capacities, fake fleet sizes. The model can’t use world knowledge to shortcut — there is no real “Ballard–Fremont Loop” with 6 buses. ↩
-
Every schedule has the same four fields (capacity, frequency, fleet, depot). Bureaucratic filler paragraphs between schedules control the total context length. The question always asks for fleet size — a small integer, easy to exact-match. Five different target routes per position for robustness. ↩
If you wanna gimme your attention, follow along by email.