Length Generalization in Hierarchical Sparse Attention
Plain-language takeaway
This paper shows how to make long-context LLMs actually work when the text gets enormous. It studies chunk-based sparse attention and finds three design rules that let a model trained on 4K tokens keep working on up to 32 million tokens.
LLMs usually read text with full attention, which becomes too slow and expensive as the text grows. Chunk-based sparse attention tries to read in blocks and only focus on a few important chunks. But different designs behave very differently, and it was unclear which parts were truly necessary for good long-context performance.
Method / 方法
The authors break down hierarchical sparse attention models and identify three “must-have” design rules:
Expressive chunk encoder + CLS token: Each chunk needs a strong mini-encoder and a dedicated summary token to represent the chunk for retrieval.
Bypassing Residual Path (BRP): Global information should be injected back into token representations, not only used for selection.
Enforced selection sparsity in pre-training: The model must learn to choose a small set of chunks during training, so test-time doesn’t feel like a distribution shift.
Experiments & Results / 实验与结果
The paper runs comprehensive ablations to test these components. The full design reaches strong length extrapolation without extra training on long inputs.
4K → 32M
Trained on 4K context; evaluated up to 32,000,000 tokens.
Benchmarks
Generalizes on RULER and BABILong.
Training-free
No extra long-context finetuning required.
Key finding
All three design rules are critical.
Limitations / 局限
The abstract does not list explicit limitations, so the following are reasonable cautions: results are shown on specific long-context benchmarks, the paper is a preprint, and code/resources are not listed on arXiv. Real-world long-document performance and efficiency trade-offs still need more evidence.