DeepSeek's next-generation technology revealed in advance, paper signed by Liang Wenfeng wins Best Paper at ACL 2025

Mengchen from Aojiensi
Quantum Bit | Public Account QbitAI

At the ACL 2025 awards ceremony, a paper co-authored by Liang Wenfeng from DeepSeek and jointly published with Peking University won the Best Paper Award.

This year's ACL 2025 was unprecedented in scale, with a total of 8,360 submissions, almost doubling last year's 4,407, making the competition extremely fierce.

In simple terms, the native sparse attention (NSA) mechanism they proposed improves the processing speed of long texts by 11 times through algorithm and hardware collaboration. Even more impressive is that the performance not only didn't decrease but also surpassed traditional full-attention models.

The first author, Yuan Jingyang, gave a speech at the conference, revealing that this technology can extend the context length to 1 million tokens and will be applied to the next frontier model.

Combined with the publication of the paper after the release of DeepSeek-R1, the experimental setup also mentioned that the new model was fine-tuned using distilled data from DeepSeek-R1.

People are guessing that this technology will be used in the next generation of DeepSeek-V4 and DeepSeek-R2.

Slimming Down Attention Mechanisms, Speed Soars 11 Times

For a long time, large language models processing long texts have been like dancing with shackles. Traditional full-attention mechanisms have a computational complexity that grows quadratically with the sequence length, and when processing texts of 64k length, attention calculations can account for 70-80% of the total delay.

The solution proposed in this paper is clever: since not all relationships between words are equally important, why not let the model learn to "focus on the key points"?

NSA adopts a dynamic hierarchical sparsity strategy, with three parallel attention branches working together:

Compression attention, which captures coarse-grained global information patterns, similar to quickly scanning the entire text to grasp the main idea;
Selective attention, which focuses on the most important word blocks in the sequence, equivalent to carefully reading key paragraphs;
Sliding attention, which is responsible for obtaining local contextual information, ensuring no details are lost.

The most ingenious part of this design is that it doesn't simply discard information, but balances computational density through a carefully designed algorithm.

More importantly, the entire architecture has been deeply optimized for modern GPU hardware, achieving an end-to-end native trainable mode.

In practical tests, NSA demonstrated remarkable speed advantages throughout the entire lifecycle of decoding, forward propagation, and backward propagation when processing sequences of 64k length.

Decoding speed increased by 11.6 times, forward propagation by 9 times, and backward propagation by 6 times. Whether for model inference or training, real efficiency gains were achieved.

Faster and More Accurate, New Breakthroughs in Long Text Processing

Speed is just one aspect of NSA. What surprised people even more was its performance in various benchmark tests.

In general benchmark tests, a 27B parameter model pre-trained with NSA exceeded the full-attention baseline in seven out of nine evaluation metrics. Especially in reasoning-related benchmarks, DROP improved by 0.042, and GSM8K improved by 0.034, demonstrating the unique advantage of sparse attention in forcing the model to focus on key information.

The test results for long text processing were even more impressive. In the "needle-in-a-haystack" test with a 64k context, NSA achieved perfect retrieval accuracy at all positions. On the LongBench benchmark, NSA scored 0.469 on average, surpassing the full-attention baseline (+0.032) and significantly outperforming other sparse attention methods.

It is worth noting that in multi-hop question-answering tasks requiring complex reasoning, NSA improved by 0.087 (HPQ) and 0.051 (2Wiki), respectively, compared to full-attention; in code understanding tasks (LCC), it improved by 0.069; and in paragraph retrieval tasks (PassR-en), it improved by 0.075.

The research team conducted an interesting experiment:

They fine-tuned the model using DeepSeek-R1's mathematical reasoning data and then tested it on the American Invitational Mathematics Examination (AIME 24).

The results showed that NSA-R achieved an accuracy of 0.121 under an 8k context setting, while the full-attention model only reached 0.046. Even under a 16k context, NSA-R maintained an accuracy of 0.146, far exceeding the full-attention model's 0.092.

These results fully prove that NSA does not sacrifice performance to gain speed, but truly achieves a win-win situation in efficiency and capability.

Three More Things

A total of four best papers were awarded this time, and the other three include:

The Peking University team's "Language Models Resist Alignment: Evidence From Data Compression"

This study explores the "elasticity" of large language models, referring to the fact that after alignment training (making the model conform to human values and reduce harmful outputs), the model is easily reverted to its pre-training state due to subsequent fine-tuning, similar to a spring being stretched and then rebounding.

This means that existing alignment methods may only superficially change the model, and are not stable enough. Future efforts need more effective alignment techniques to ensure that models truly align with human needs, especially in open-source models, to prevent malicious fine-tuning from easily breaking security mechanisms.

The Stanford team's "Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs"

This study introduces a new perspective on "difference awareness" in the fairness of large models. Simply put, the model should make distinctions among different groups in appropriate situations, rather than blindly treating everyone equally.

The study found that models performing well in traditional fairness tests did not score high in "difference awareness"; the stronger the model's ability (such as higher MMLU scores), the better its situational awareness, but difference awareness might not necessarily improve; existing "bias-reduction" methods (like prompting the model to "remain unbiased") could actually make the model ignore differences and even alter correct answers.

The team from the Helmholtz Information Security Center and others' "A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive."

This paper points out that the sampling mechanism in large models generating responses is similar to human decision-making, including descriptive components (reflecting statistical norms) and prescriptive components (implied ideal states of concepts).

Through experiments, it was verified that samples generated by LLMs for both new and existing concepts (covering 10 fields with 500 concepts) deviate from the statistical mean, shifting toward their perceived "ideal values," and this phenomenon is significantly present in 15 different models. Case studies show that such biases could lead to biased decisions in fields like healthcare, raising ethical issues.

DeepSeek paper address:

https://arxiv.org/abs/2502.11089

Reference links:

[1]https://x.com/aclmeeting/status/1950572483637067786

[2]https://x.com/casper_hansen_/status/1950649481617342803

Original article: https://www.toutiao.com/article/7533028900473979426/

Statement: This article represents the views of the authors. Please express your opinion by clicking the [Like/Dislike] buttons below.