Too Much Attention?

128k tokens of context. Perfect retrieval. Zero reasoning.

Post 2 ended with a suspiciously upbeat question: if the model can find anything in 128k tokens, can it reason about what it finds?

The short answer is no. Not really.

The long answer is also no, but now with heatmaps and a worse feeling in the gut.

This is the third and final post in our trilogy on attention. In Post 1 we built the mechanism. In Post 2 we stretched it until the filing cabinet hit the ceiling. Now we find out what that extra reach actually buys us.

The Boring Heatmap

First, let’s recall what works. The standard test for long-context models is needle-in-a-haystack: bury a random fact in a long document, ask the model to find it,1 vary the document length and the position, and see whether the model can retrieve it.

We ran 63 probes: 7 context lengths × 9 depths:

Needle-in-a-haystack: 63 probes, all correct. The boring heatmap.

63 out of 63 correct, through 106,641 tokens. Every depth. Every context length. All the way to the edge of the 128k window.

The heatmap is boring. It is a lawn. Neatly mowed, aggressively green. That is the point.

So retrieval works. The “lost in the middle” effect reported by Liu et al. (2023)2 was about models that were not trained for long context. Phi-3-128k was trained with LongRoPE from the start. It finds the needle at 5% depth or 95% depth, in 5k tokens or 107k tokens. No holes.

The boring heatmap tells us something important: if you stuff a fact into a 128k-token document and ask the model to find it, the model will find it.

But “finding a fact” is not the same as “understanding a document.” A Ctrl-F search can find a fact too. Nobody mistakes that for wisdom.

In Post 2, we went further: multi-document QA, where 20 identically formatted transit schedules compete and the model has to find the right one. Modern long-context models solved that too. Mistral 7B and Phi-3 128k both hit 149/150.

Retrieval works. Discrimination works. Great. Now let’s ask for judgment, not just filing.

When Retrieval Isn’t Enough

Now let’s make the model think. Or, more fairly, make it stop being merely a GPU-intensive clerk.

Same transit system, harder task. Instead of finding one schedule, the model has to chain two facts. We plant dispatch memos like this:

DISPATCH: The Vashon Shuttle has been assigned dispatch color JADE.

And somewhere else in the document:

DISPATCH: The route with dispatch color JADE departs from Pier 52.

Question: “Where does the Vashon Shuttle depart from?” Answer: Pier 52.

The answer is not written in any single memo. You need both pieces. The bridge is the dispatch color. This is still tiny reasoning by human standards, but it already exceeds glorified Ctrl-F. Barely, but it does.

We also scatter nine distractor dispatches with the same structure but different routes, colors, and terminals.3 The distractors never share the target’s dispatch color or terminal, so the model cannot shortcut. It has to follow the right chain instead of just vibing aggressively.

We centered the chain at 50% of the context and varied the gap between the two facts: close (5% apart), medium (30%), or far (70%). Fifteen chains per condition, across five models.4

2-hop reasoning accuracy by fact distance. Close = 5% apart, medium = 30%, far = 70%. Fifteen chains per condition.

Close: most models handle it, except Llama-2, which struggles even when the facts are basically holding hands. Medium: Phi-3 4k drops out, Mistral, Qwen, and Phi-3 128k succeed sometimes. Far: near-total failure across the board.

Distance matters more than context length. The main variable is how far apart the facts are, not how many filler tokens surround them. Moving the facts from 5% apart to 70% apart is what kills reasoning.

The model can find any single fact. It just can’t chain facts that are far apart.

The model doesn’t fail—it lies

And here is the part where the joke curdles.

When the model gets a multi-hop question wrong, it does not say “I don’t know.” It does not refuse to answer. It does not hedge. It gives you a confident, well-formatted, completely wrong answer.

On one probe, we ask “Where does the Vashon Shuttle depart from?” The correct answer is Greenwood Yard. The model says Pier 52. That is a real terminal from a distractor dispatch, not the Vashon Shuttle’s terminal.

Ask “Where does the Kingston Crossing depart from?” Correct answer: Tacoma Depot. The model says Edmonds Dock.

Ask “Where does the Green Lake Crosstown depart from?” Correct answer: Lynnwood Base. The model says Olympia Base.

Wrong terminal. Wrong chain. Right format. Bureaucratically flawless. Substantively useless.

Without ground truth, you would have no way to tell these are wrong. The answers are plausible, structured correctly, and drawn from the actual context, just from the wrong chain.

The model is still doing retrieval. It is picking up a distractor dispatch that shares the same structure (The route with dispatch color X departs from Y), and at distance it loses track of which chain it is following.

This is way worse than failure. A blank stare would be better. A refusal is honest and actionable. A confident wrong answer is neither. You take it at face value and move on. The error propagates silently.

Does position matter for reasoning?

The distance experiment above confounds two things. When facts are at 10% and 80%, you can’t tell whether the failure is because they are far apart or because one of them lives in an awkward part of the context. So we disentangled that.

New experiment: 2-hop chains with a constant 5% gap between facts, slid through six positions in the context (10% to 85%). Same transit theme: The Ballard Express has dispatch color AMBER + The route with dispatch color AMBER departs from Pier 52Where does the Ballard Express depart from?

Nine distractors, five different chains per condition, three context lengths, ninety probes per model.5

Multi-hop position sweep: chain center position (x-axis) × context length (y-axis). Same 5% hop gap everywhere—only position varies. Compare with the near-perfect multi-doc QA results from the previous post.

The numbers:

Multi-document QA (Post 2) vs multi-hop reasoning (this post). Same models, same context. The only difference: one fact or two.

Every model drops. Phi-3 128k goes from 99% retrieval (multi-doc QA from Post 2) to 71% reasoning. Qwen goes from 97% to 66%. Same context. Same positions. Same distractors. The only difference is whether you need one fact or two.

And the positional effects are real but different from the U-shape. Llama-2 7B shows a monotonic cliff: 5/5 at the front, 0/5 at the back. The modern models show a middle dip. 40% center position is the worst for Phi-3 128k (0/5 at 4k tokens) and Qwen (1/5 at longer contexts). The edges, 10% and 85%, are more reliable.

This is the key finding of the post. Retrieval is position-invariant for modern models. Reasoning is not. The model can find each fact almost anywhere in the context. It just can’t chain them when they land in certain positions.

What the research says

We’re not the first to notice this. This observation is smeared all over the literature.

The first hop works fine. Yang et al. (2024) found strong evidence that models reliably identify bridge entities. It is the second hop where things fall apart: “moderate” evidence, “highly contextual” utilization. The model picks up the first fact and then wanders off.

With distractors, it’s worse. Agarwal et al. (2025) showed that pre-trained LLMs “resort to random guessing among all plausible conclusions,” exactly what we see when facts are far apart. The hopeful part is that minimal fine-tuning creates a “sharp phase transition” to near-perfect accuracy. The hardware can do it. It just has not been taught.

The pattern holds at scale. Kuratov et al. (2024) introduced BABILong, a benchmark of 20 reasoning tasks over long documents, and found that popular LLMs effectively utilize only 10–20% of the context window.6 Performance drops sharply with reasoning complexity—single-fact retrieval works, multi-fact chaining does not. Sound familiar?

The failure is specific to latent composition. Balesni et al. (2025) demonstrated this cleanly: when A → B and B → C appear in separate training documents, A → C accuracy drops to chance despite perfect recall of each individual fact. Give the model chain-of-thought prompting? It succeeds. Put both facts in one document? It succeeds. The failure is composition across distance, not composition itself. Johnston and Belrose (2025) quantified the cost: latent two-hop QA requires roughly 2x the parameter capacity of one-hop. Chain-of-thought eliminates this overhead.

Li et al. (2024) proposed a practical mitigation: prompt the model to supply attributions for each assertion, forcing it to show its work. It helps—but it’s a workaround, not a fix.

And the distractors do not just degrade performance. They steal attention. Lee et al. (2026) found drops of up to 80% with contextual distractors, plus an inverse scaling trend where more test-time compute makes things worse. Patel et al. (2024) saw the same in formal logic: accuracy drops from 68% to 43% across reasoning depths, and larger models sometimes perform worse than smaller ones.

The model is not ignoring the noise. It is attending to it. That is the whole problem.

How many hops can the model follow?

So far, every chain has exactly two hops: route → color → terminal. Two facts, one link. What happens when the chain gets longer?

We extended the experiment to 4-hop and 8-hop chains. A 4-hop chain links five attributes: route → color → capacity → zone → terminal. An 8-hop chain links nine: route → color → capacity → zone → line code → depot → frequency code → fleet ID → terminal. The question is always the same — “Where does the route depart from?” — so the terminal is always the final attribute. Only the chain length changes.7

Same distance conditions (close/medium/far), same five models, fifteen chains per condition:

N-hop accuracy by model, hop count, and distance. The expected monotonic decline from 2→4→8 hops doesn't appear. Instead, most models show a V-shape: 4-hop is worst.

The numbers, close distance only:

N-hop accuracy at close distance (5% gap). The expected monotonic decline doesn't appear — 4-hop is worst, 8-hop recovers.

This is not what we expected. More hops should mean more chances to lose the chain. A monotonic decline, 2-hop good, 4-hop worse, 8-hop worst, would make sense. It would also be much tidier. Experiments declined.

Instead, four out of five models show a V-shape: 4-hop is the worst, and 8-hop recovers to near 2-hop levels. Phi-3 128k goes 97% → 20% → 87%. Qwen goes 77% → 33% → 70%.

The pattern holds at medium and far distances too. At medium distance, Phi-3 128k goes 43% → 7% → 83%. At far distance, Mistral goes 17% → 37% → 77%.

What’s going on? We think the answer is signal density. Our experiment uses a fixed nine distractors regardless of hop count. At 2 hops, the target chain is 2 facts out of 11 total dispatch memos (18%). At 8 hops, it is 8 out of 17 (47%). The target chain gradually takes over the document. Eight dispatch memos chained through color → capacity → zone → line code → depot → frequency code → fleet ID → terminal create a strong structural signal that is hard to miss, especially at close distance where all eight facts sit within 5% of the context.

Four hops hits an awkward middle. There are enough links to require real multi-step reasoning, but not enough to overwhelm the distractors. The chain is 4 out of 13 memos (31%): neither sparse enough to disappear nor dense enough to become obvious.

We can test this directly. Rerun the same experiment but scale distractor count with hop count: 2-hop → 9 distractors, 4-hop → 18, 8-hop → 36. This holds signal ratio constant at ~18% across all conditions.8

N-hop accuracy at close distance with scaled distractors (constant 18% signal ratio). The V-shape disappears for Phi-3 128k — accuracy now declines monotonically.

The V-shape disappears for Phi-3 128k. With fixed distractors: 97% → 20% → 87%. With scaled distractors: 97% → 53% → 17%. A clean monotonic decline — exactly what you’d expect if reasoning degrades with chain length.

The story is messier for other models. Mistral still shows a dip-then-recovery (73% → 10% → 63%), and Phi-3 4k actually improves at 4-hop (87% → 93%) before crashing at 8-hop (47%). These models may have other confounds — attention patterns, training data, chain-following heuristics — that our signal-density correction doesn’t fully account for. But the cleanest long-context model, Phi-3 128k, confirms the hypothesis: when you hold signal density constant, more hops means worse performance.

This is a useful lesson. Multi-hop benchmarks need distractor counts that scale with chain length. A fixed distractor count confounds reasoning difficulty with signal-to-noise ratio. The V-shape in our original data was real but an artifact—an accidental boost from the target chain becoming an increasingly large fraction of the document.

Prior work on chain length is consistent with this. Golovneva et al. (2025) tested 1–4 hop QA over 64k–128k-token novel excerpts and found a steady ~12-point decline per hop with no V-shape—supporting our signal-density explanation.9 The earlier MuSiQue benchmark (Trivedi et al., 2022) showed the same pattern: answer F1 drops from 57.9 at 2 hops to 28.1 at 4 hops. Li et al. (2026) found a sharp drop when required hops exceed the training distribution, with errors concentrating at specific “erroneous processing heads”—suggesting the failure is localized in the architecture, not diffuse.

Press et al. (2023) measured the “compositionality gap”: how often a model answers each sub-question correctly but fails to compose them. The gap does not shrink with model scale.10 Models memorize more facts but do not necessarily chain them better. Our results add a spatial dimension: even when the model has the facts, distance between them is what breaks the chain.

What is unambiguous is that distance remains the dominant factor. Even with scaled distractors, far-distance accuracy collapses for every model. Phi-3 128k at 8-hop: 17% close vs 3% far. Qwen at 8-hop: 47% close vs 3% far. The pattern holds at every chain length.

The model can follow a long chain when the facts stay nearby. It cannot reliably follow even a short chain when the facts are far apart.

How much noise can the model tolerate?

The scaled-distractor experiment answered one question (is the V-shape an artifact?) but raised another. We held signal ratio constant while varying hops. What if we hold hops constant and vary signal ratio? This directly measures noise tolerance at a fixed reasoning depth — and tells us whether there is a cliff or a gradual slope.

Fixed: 2-hop chains, close/medium/far distance, five models. Variable: distractor count. We swept from 3 distractors (40% signal ratio) through our 9-distractor baseline (18%) to 18 (10%), 36 (5%), and 72 (3%).11 At 72 distractors, the two target facts are buried among seventy-four dispatch memos. The chain is 2.7% of the document.

2-hop accuracy at close distance as signal ratio decreases from 40% (d3) to 3% (d72). No cliff—just a noisy slope.

No cliff. Just a slope, and a rude one. Mistral goes 100% → 73% → 80% → 50% → 33%. Qwen goes 100% → 77% → 87% → 73% → 63%. Phi-3 128k goes 93% → 97% → 67% → 73% → 57%. The trend is downward but noisy — non-monotonic wiggles at intermediate distractor counts, even with 30 probes per condition for the long-context models.

The short-context models are a mess. Phi-3 4k is 73% at d3, then improves to 87% at d9 before crashing to 0% at d36. Llama-2 is near floor everywhere except d3. These models don’t have enough baseline accuracy for a noise-tolerance curve to emerge.

The strong models tell a clearer story: roughly a 40–70% drop across a 15× increase in distractors. That is a gradual slope, not a phase transition. There is no magic threshold where the model suddenly can’t cope. It just gets progressively worse, with each additional batch of distractors stealing a little more attention budget from the real chain.

This matches recent findings in a different domain. Yang et al. (2025) built GSM-DC, a math-reasoning benchmark with controlled distractor injection, and found that error follows a power law with distractor count: E(m) ∝ m^δ, where the exponent δ grows with reasoning depth.12 GPT-4.1’s step accuracy at 5 reasoning steps drops from 26% with one distractor to 2% with fifteen—steep, but smooth. No cliff. Our results tell the same story in the factual-reasoning domain: gradual degradation, amplified by depth.

This is consistent with the softmax competition picture from the next section. More distractors means more plausible-looking chains competing for attention weight. The model does not catastrophically fail — it just makes the wrong choice more often. Biran et al. (2024) showed where the failure happens: the bridge entity (our dispatch color) resolves in early transformer layers, but the second hop must complete in later layers that may no longer encode the necessary associations.13 Yu and Belinkov (2025) traced this further, identifying four stages of multi-hop reasoning via logit flow: subject recall, relation retrieval, relation attribute extraction, and answer output. Failures concentrate at the third stage, where conflicting logits from distractor entities reduce extraction accuracy.14 More distractors means more competing bridge entities in those early layers, which degrades the signal passed to the later layers that need it.

Where Attention Actually Goes

So the model can find any single fact but cannot chain facts across distance. Why?

Let’s look inside. We ran a forward pass at three context lengths — 512, 2,048, and 8,192 tokens — and extracted the attention weights from three layers: 0 (first), 15 (middle), 31 (last). For each layer, we computed three statistics across all 32 heads:15

LayerMetricn=512n=2,048n=8,192
0Sink weight0.0140.0040.001
0Local weight0.5200.3130.084
0Entropy4.755.937.69
15Sink weight0.6090.5590.566
15Local weight0.2790.1750.102
15Entropy1.922.453.00
31Sink weight0.6160.5660.487
31Local weight0.4030.3480.161
31Entropy1.571.943.70
Attention statistics across three layers and three context lengths. Sink weight dominates deep layers. Local weight dilutes with context. Entropy reveals the focus pattern.
Text description
Attention stats for Phi-3 128k. Layer 0: sink weight near zero (0.001–0.014), local weight drops from 52% to 8.4% as context grows, entropy high (4.75–7.69). Layers 15 and 31: sink weight 49–62%, local weight 10–40%, entropy low (1.57–3.70). Deep layers concentrate on the sink and a few positions.

Three patterns jump out.

1. Attention sinks are real, and they’re huge. Layers 15 and 31 send 50-62% of all attention weight to token 0 regardless of context length. This is the “attention sink” phenomenon identified by Xiao et al. (2023): the first token acts as a garbage collector. When a head does not have useful work to do, it dumps its weight on token 0 instead of spreading it across irrelevant positions. Yu et al. (2024) traced the cause to early feed-forward layers amplifying the first token’s hidden-state norm,16 which turns it into a strong attractor for later attention computation.

2. Local attention dilutes with context. Layer 0’s local weight, the fraction of attention paid to the 64 nearest tokens, drops from 52% at n=512 to 8.4% at n=8,192. More tokens means each nearby token gets a smaller share of the fixed attention budget. This is the competition effect from Post 1: softmax normalizes to 1.0, so every new token steals a little budget from every old one.

3. Early layers look local, deep layers look selective. Layer 0 has almost no sink weight (1.4%) but high local weight (52%). The model’s first layer focuses on nearby tokens: syntax, local patterns, subword assembly. Layers 15 and 31 flip. Heavy sink weight, lower local weight, lower entropy. The deep layers concentrate attention on a few positions, the sink plus a handful of informative tokens, instead of spreading it around.

The entropy column tells the full story. Layer 0’s entropy grows from 4.75 to 7.69 as context expands (max possible for uniform over 8,192 positions is ln(8192) ≈ 9.0). It is distributing attention broadly, trying to absorb everything. Deep layers resist that increase: layer 31 goes from 1.57 to 3.70. Even with 16x more tokens, the deep layers stay focused on a few positions.

This is the under-the-hood answer to “why can’t the model chain facts across distance?” The architecture can see all 128k tokens. But in practice, the deep layers, the ones doing semantic integration, spend most of their attention budget on token 0 and a small local window. The model can convene a meeting. It cannot reliably convene the right meeting.

A fact at position 5,000 and a fact at position 80,000 are both visible to the model. That does not mean they are getting enough attention weight, at the right layer, at the same time, for chaining to happen. Line of sight is not the same thing as thought.

Attention statistics by layer and context length. Deep layers concentrate on the sink. Early layers spread broadly but dilute with longer context.

Softmax Can’t Say “I Don’t Know”

There’s a structural reason this happens, and it goes deeper than training data.

Every attention head, at every layer, computes a softmax over all positions. Softmax produces a probability distribution: non-negative values that sum to exactly 1. Every token gets some weight. No zero. No shrug. No abstain button.17

When two structurally similar chains compete, the real chain (Vashon Shuttle → JADE → Greenwood Yard) and a distractor (Ballard Express → COBALT → Edmonds Dock), attention has to pick. It can allocate 70% to the right chain and 30% to the wrong one, or vice versa. It cannot allocate 70% to “I’m not sure.”

There is no way for attention to represent uncertainty about which associative path to follow. It must commit. Softmax is many things. Uncertain is not one of them.

The attention sink, token 0 absorbing 50-60% of weight in deep layers, is the closest thing the model has to a “none of the above” option. When a head does not know what to attend to, it defaults to the sink. But this is a learned hack, not a principled uncertainty signal. The sink absorbs the weight silently. Downstream layers cannot distinguish “this head had nothing useful to attend to” from “this head attended to the first token because it was informative.”

The Credal Transformer18 proposes a fix: replace softmax with a credal set — a set of possible distributions rather than a single distribution. When evidence is ambiguous, the set stays wide rather than collapsing to a point estimate. It’s an elegant idea, and it directly addresses the mechanism we’ve identified. Standard softmax “collapses ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer.” A credal set preserves the ambiguity.

Whether credal sets are the answer or not, the problem is clear. The longer the context, the more distractors compete, and the model has no principled mechanism to hedge. A 4k-token document has a few competing chains. A 128k-token document has hundreds.

The attention budget does not grow. The competition does. The paperwork piles up.

What We Learned

Three posts. Five models. One equation and a steadily more suspicious filing cabinet.

In Post 1, we took attention apart: project Q, K, V, score, scale, mask, softmax, aggregate. Fifteen tokens. Everything fit on a laptop.

In Post 2, we stretched it: three config changes, non-uniform positional scaling, and the context window grows 32x. So does the cost: 96 GB of KV cache, roughly 90 seconds of prefill, and a $6/hour GPU just to watch the arithmetic happen. We also tested multi-document QA, 20 identical-looking schedules and one relevant answer, and modern models solved it (97-99%).

In this post, we broke the illusion. Not the attention equation—that works exactly as designed. What breaks is the assumption that capacity equals capability. And it breaks the same way across five different models.

Across the trilogy, we tested five tasks of increasing difficulty:

  1. Needle retrieval (Post 2) — find one marked fact. Every modern model aces this. The 128k context window works.
  2. Multi-document QA (Post 2) — find the right document among 20 identical-looking alternatives. Modern models handle this too (97–99%). The U-shape from Liu et al. is gone.
  3. Multi-hop reasoning (this post) — chain two facts to answer a question. Every model drops. Phi-3 128k: 99% → 71%. Qwen 2.5 7B: 97% → 66%. The position of the chain matters. The middle of the context is worst.
  4. N-hop chains (this post) — extend from 2 to 4 and 8 hops. With fixed distractors, longer chains show a V-shape (4-hop worst, 8-hop recovery) — a signal-density artifact. Scaling distractors to hold signal ratio constant eliminates the V-shape for the strongest model (Phi-3 128k: 97% → 53% → 17%), confirming that more hops means worse reasoning.
  5. Signal-ratio sweep (this post) — hold hops at 2 and vary distractor count from 3 to 72. No cliff — accuracy degrades gradually as signal ratio drops from 40% to 3%. Distance remains dominant regardless of chain length or distractor count.

The gap between task 2 and task 3 is the entire story. Retrieval is position-invariant. Discrimination is position-invariant. Reasoning is not. The model can find each fact and identify which document is relevant, but it cannot reliably connect facts across positions, especially in the middle of the context and especially at longer lengths.

And when it fails, it fails silently. A confident, well-formatted, completely wrong answer drawn from the right context but the wrong chain. Without ground truth, you can’t tell.

The dangerous part is not that long-context models fail at reasoning. It is that they succeed at everything except reasoning, which makes the reasoning failures hard to see. The output is fluent. The format is correct. The answer comes from the real context. It is just from the wrong chain.

So here is the trilogy in one sentence: attention gives the model line of sight, not judgment.

Long context is necessary but not sufficient.19 The attention mechanism provides the capacity for global interaction across distance. Training determines whether that capacity becomes capability. Right now, for multi-hop reasoning over distance, it mostly has not, across five models, three architectures, and context windows from 4k to 128k.

Maybe the answer is not even longer context windows, but smarter ones: prompt compression, context distillation, learning what to forget.20 If attention is expensive and most of the context goes underused, the bottleneck may not be how much the model can see. It may be how much it can think about at once.

Reproducing These Results

All experiments run on Modal with NVIDIA A100-80GB and B200 GPUs. The companion code lives in some-attention/. You’ll need a Modal account and uv installed.

If you’d like to reproduce this personally, here you go:

# Clone and set up
git clone https://github.com/initsecret/some-attention
cd some-attention

# 2-hop distance sweep (15 chains per condition)
uv run modal run modal/multihop.py --mode distance --model phi3-128k
uv run modal run modal/multihop.py --mode distance --model phi3-4k
uv run modal run modal/multihop.py --mode distance --model llama2-7b
uv run modal run modal/multihop.py --mode distance --model mistral-7b
uv run modal run modal/multihop.py --mode distance --model qwen25-7b

# N-hop distance sweep (4 and 8 hops, all models)
uv run python scripts/post3/run_nhop_experiments.py

# Scaled-distractor ablation (constant 18% signal ratio)
uv run python scripts/post3/run_scaled_distractors.py

# Signal-ratio sweep (fixed 2-hop, vary distractor count)
uv run python scripts/post3/run_signal_sweep.py

# Multi-hop position sweep (90 probes per model)
uv run modal run modal/multihop.py --mode position --model phi3-128k
uv run modal run modal/multihop.py --mode position --model phi3-4k
uv run modal run modal/multihop.py --mode position --model llama2-7b
uv run modal run modal/multihop.py --mode position --model mistral-7b
uv run modal run modal/multihop.py --mode position --model qwen25-7b

# Generate N-hop comparison heatmaps
uv run python scripts/post3/gen_nhop_heatmap.py
uv run python scripts/post3/gen_scaled_heatmap.py
uv run python scripts/post3/gen_signal_heatmap.py

Results are saved to results/multihop_*.json and results/multihop-distance_*.json. Total cost across all experiments: about $15 in GPU time, which is cheaper than we expected.

Footnotes

  1. Three different facts rotated across probes to avoid memorization effects. The “haystack” is procedurally generated transit operations bulletins — maintenance advisories, fleet reports, safety notices. The needle is a marked schedule: ”★ PRIORITY ROUTE — Vashon Island Ferry / Dispatch code: 421.” The question: “What is the dispatch code?” Not exactly the Turing test, but that’s the point—this is pure retrieval.

  2. Liu et al. (2023), “Lost in the Middle: How Language Models Use Long Contexts.” They found that models performed best when relevant information was at the beginning or end of the context, with degraded performance in the middle. That finding motivated much of the long-context training work that followed — including LongRoPE.

  3. For example: “The Ballard Express has been assigned dispatch color COBALT” and “The route with dispatch color ONYX departs from Edmonds Dock.” Same format, different values. Fifteen chains per condition, generated programmatically from pools of 20 routes, 20 colors, and 20 terminals.

  4. Short-context models (Phi-3 4k, Llama-2 7B) ran at one standardized context length (~1.5k tokens), 15 probes per distance. Long-context models (Mistral 7B, Qwen 2.5 7B, Phi-3 128k) ran at two context lengths (~1.5k and ~25k tokens), 30 probes per distance. This lets us compare across context length for the long-context group without making the task trivial for the short-context group.

  5. The 5% gap means facts are always close together — at center position 40%, fact 1 is at 37.5% and fact 2 at 42.5%. Only the position of the chain varies, not the distance between hops. Distractors never share the target’s dispatch color or terminal, preventing shortcut retrieval.

  6. Kuratov et al., “BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack” (NeurIPS 2024). Among context extension methods, recurrent memory transformers performed best, handling up to 50 million tokens — but only after fine-tuning. The off-the-shelf models we test in this post are representative of what BABILong’s results predict.

  7. Each hop adds one dispatch memo linking attribute k to attribute k+1. Attribute pools (20 values each) are drawn from transit operations: dispatch colors, vehicle capacities, zones, line codes, depots, frequency codes, fleet IDs, and terminals. Nine distractors per chain, regardless of hop count — a design choice we’ll revisit below.

  8. The formula: distractors = round(hops × 9 / 2). At every hop count, the target chain is hops / (hops + distractors) ≈ 2/11 ≈ 18.2% of all dispatch memos. This isolates reasoning depth from signal-to-noise ratio.

  9. Golovneva et al., “NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts” (EMNLP 2025). Their benchmark uses full novels as context, avoiding the signal-density confound inherent in synthetic designs like ours.

  10. Press et al., “Measuring and Narrowing the Compositionality Gap in Language Models” (EMNLP 2023). They also proposed “self-ask” prompting, where the model decomposes multi-hop questions into sub-questions — a form of chain-of-thought that narrows the gap.

  11. At each distractor count, signal ratio is 2 / (2 + distractors). The d9 condition reuses the existing baseline results. All other conditions are new runs: 5 models × 4 distractor counts × 3 distances × 15 chains = 900 new probes.

  12. Yang et al., “How Is LLM Reasoning Distracted by Irrelevant Context?” (EMNLP 2025). Their GSM-DC benchmark constructs symbolic DAG-based math problems with precise distractor injection, testing m = 1 to 15 distractors across reasoning depths 2–5. The power-law fit explains why we see a slope rather than a threshold. Building on the foundational observation by Shi et al. (2023) that LLMs are “dramatically” distracted by irrelevant context in math problems (GSM-IC, ICML 2023).

  13. Biran et al., “Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries” (EMNLP 2024). Their “back-patching” method — injecting later-layer representations back into earlier layers — fixes 32–66% of previously incorrect multi-hop queries, confirming that the failure is localized to specific layers, not diffuse.

  14. Yu and Belinkov, “Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models” (EMNLP 2025). Their “back attention” intervention — copying later-layer information back to earlier layers — improves multi-hop accuracy, complementing Biran et al.’s “back-patching” finding. The common insight: multi-hop reasoning fails because information needed in early layers only becomes available in later ones.

  15. Sink weight: average attention paid to token 0 (the first token). Local weight: average attention within a 64-token window around each query position. Entropy: Shannon entropy of the attention distribution — higher means more spread out, lower means more concentrated. Max possible entropy for uniform attention over n tokens is ln(n).

  16. Yu et al. (2024), “When Attention Sink Emerges in Language Models: An Empirical View” (ICLR 2025). They found that the sink emerges early in training and is driven by norm amplification in the first few FFN layers, not by the attention mechanism itself.

  17. In plain English: softmax exponentiates every score and divides by the sum of all exponentiated scores. Every output is therefore strictly positive. The only true zero would require a score of negative infinity.

  18. “Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models” (2025). They show that credal attention reduces hallucinations in summarization and question-answering tasks. The connection to our multi-hop failures is direct: when two chains compete, a credal head can say “both are plausible” instead of picking one.

  19. This isn’t just about distractors. Du et al. (2025) demonstrated that even with perfect retrieval and no distractors, context length alone degrades reasoning performance by 13.9–85%. Masking irrelevant tokens doesn’t help. The degradation comes from the computational overhead of processing long sequences, not from failing to find relevant information. “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval” (EMNLP Findings 2025).

  20. There’s active research on all of these: retrieval-augmented generation (pull in only what you need), context distillation (compress a long document into a short summary the model can reason over), and selective attention (train the model to ignore irrelevant positions). The field is moving fast!

Still paying attention? Follow along by email.