Just now, Liang Wenfeng from DeepSeek's NSA paper, and Yang Yaodong's team from Peking University won the best paper at ACL 2025.

Machine Intelligence Report

Machine Intelligence Editorial Department

In this ACL conference, Chinese teams have achieved remarkable success.

ACL is the top international conference in computational linguistics and natural language processing, organized by the Association for Computational Linguistics (ACL), and held annually. For a long time, ACL has ranked first in academic influence in the NLP field, and it is also a CCF-A recommended conference. This year's ACL conference is the 63rd, held in Vienna, Austria from July 27 to August 1, 2025.

This year's total number of submissions reached a historical high, exceeding 8,000 papers (4,407 last year), with main conference papers and Findings, which had acceptance rates of 20.3% and 16.7%, respectively.

According to official data analysis, more than half of the first authors of all papers are from China (51.3%), while less than a third (30.6%) last year. Following China, the number of American authors ranks second, but only accounts for 14.0%.

A total of 4 best papers, 2 best social impact papers, 3 best resource papers, 3 best topic papers, 26 outstanding papers, 2 TACL best papers, 1 best Demo paper, and 47 SAC Highlights were selected this year.

Here are the specific award details.

Best Paper Award

Among the four best papers this year, the DeepSeek (with Liang Wenfeng as a co-author) team and the Yang Yaodong team from Peking University won two of them, while the other two were won by the CISPA Helmholtz Center for Information Security & TCS Research & Microsoft team and the Stanford University & Cornell Tech team.

Paper 1: A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive

Authors: Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz
Institutions: CISPA Helmholtz Center for Information Security, TCS Research, Microsoft
Paper Address: https://arxiv.org/abs/2502.01926

Abstract: The application of large language models (LLMs) in autonomous decision-making is increasing, and they sample options from a wide action space. However, the heuristics guiding this sampling process remain underexplored. This team studied this sampling behavior and showed that its underlying heuristics are similar to human decision-making heuristics: composed of a descriptive component (reflecting statistical norms) and a prescriptive component (the implicit ideal values encoded in LLMs).

The team showed that the deviation of samples from statistical norms toward prescriptive components exists consistently in concepts across various real-world fields such as public health, economic trends, etc. To further clarify this theory, the team demonstrated that concept prototypes in LLMs are influenced by prescriptive norms, similar to humans' "normal" concepts.

Through case studies and comparisons with human studies, the team showed that the bias of LLM outputs toward ideal values in real-world applications can lead to significant decision-making biases, raising ethical concerns.

Paper 2: Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs

Authors: Angelina Wang, Michelle Phan, Daniel E. Ho, Sanmi Koyejo
Institutions: Stanford University, Cornell Tech
Paper Address: https://arxiv.org/abs/2502.01926

Abstract: Traditional algorithmic fairness has adopted the race-blind (i.e., non-discriminatory) perspective, which is mathematically convenient. However, the team believes that in a series of important contexts, group difference awareness is essential. For example, differentiating between groups may be necessary in legal contexts and risk assessments. Therefore, unlike most fairness research, we study fairness from the perspective of differential treatment — in appropriate contexts.

The team first introduced the important distinction between descriptive (fact-based), prescriptive (value-based), and relevance (association-based) benchmarks. This distinction is crucial because each category requires separate interpretation and mitigation based on its specific characteristics.

Then, they proposed a benchmark suite consisting of eight different scenarios, totaling 16,000 questions, enabling us to evaluate difference awareness.

Finally, the study demonstrated the results of ten models, which showed that difference awareness is a unique dimension of fairness, and existing bias mitigation strategies might backfire.

Paper 3: Language Models Resist Alignment: Evidence From Data Compression

Paper Address: https://aclanthology.org/2025.acl-long.1141.pdf
Project Address: https://pku-lm-resist-alignment.github.io

This paper systematically reveals for the first time that large models are not blank slates that can be arbitrarily shaped, and their parameter structure contains an elastic mechanism - this mechanism originates from the pre-training stage, and has a structural inertia that drives the model distribution to return, making the model possibly bounce back to the pre-training state after fine-tuning, thus resisting new instructions given by humans, leading to the model's resistance to alignment. This means that the difficulty of alignment is far beyond expectations, and the resources and computing power required for post-training may not only not decrease, but may even need to be comparable or even greater than those in the pre-training stage.

The paper points out that the larger the model size and the more thorough the pre-training, the stronger the elasticity, and the higher the risk of rebound during alignment. In other words, the current effective alignment methods may only be superficial and shallow. Achieving a robust alignment that delves into the internal mechanisms of the model remains a long way to go. This discovery poses a serious challenge to AI safety and alignment: the model may not only be unable to learn, but may even pretend to have learned, meaning that the pre-training and post-training fine-tuning alignment processes of current LLMs, VLMs, and VLAs face new challenges.

The reviewers and chair of ACL 2025 highly recognized this research. They unanimously believe that the concept of "elasticity" proposed in the paper breaks through the resistance and rebound mechanisms of large language models during alignment, providing a new theoretical perspective and solid foundation for the long-standing issue of alignment fragility in the field. The area chair further pointed out that the paper builds a bridge between compression theory, model scalability, and safe alignment, not only with solid empirical evidence and deep theory, but also with far-reaching implications for governance and safety.

The (independent) corresponding author of the paper is Dr. Yang Yaodong, currently a researcher at the Institute for Artificial Intelligence of Peking University, a Zhiyuan scholar (head of the large model security), and the chief scientist of the Peking University-Lingchu Intelligent Joint Laboratory.

The first authors of the paper are all members of the Yang Yaodong research group, including: Ji Jiaming, Wang Kailue, Qiu Tianyi, Chen Boyuan, Zhou Jiayi. Collaborators include Dr. Dai Juntao, a researcher at the Zhiyuan Institute for Safety, and Professor Liu Yunhuai from the School of Computer Science at Peking University.

Paper 4: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
Institutions: DeepSeek, Peking University, University of Washington
Paper Address: https://arxiv.org/pdf/2502.11089

Abstract: This paper, signed by Liang Wenfeng, the founder of Huanfang Technology and DeepSeek, proposes a new attention mechanism called NSA. It is a native trainable sparse attention mechanism for ultra-fast long context training and inference, and also has hardware-aligned features.

Long context modeling is a key capability of the next generation of large language models (LLMs), driven by diverse practical applications, including deep reasoning, warehouse-level code generation, and multi-turn automatic agent systems.

The natural way to achieve efficient long context modeling is to utilize the inherent sparsity of softmax attention by selectively computing key-query pairs, which can significantly reduce computational overhead while maintaining performance. Recent progress in this direction includes various strategies: KV cache eviction methods, block-wise KV cache selection methods, and selection methods based on sampling, clustering, or hashing. Although these strategies show promise, existing sparse attention methods often perform poorly in actual deployment. Many methods fail to achieve the acceleration gains promised by their theoretical benefits; additionally, most methods mainly focus on the inference phase, lacking effective training-time support to fully leverage the sparse attention patterns.

To overcome these limitations, deploying effective sparse attention must address two key challenges: hardware-aligned inference acceleration and training-aware algorithm design. These requirements are crucial for achieving fast long context inference or training in practical applications. When considering both aspects, existing methods still fall short.

Therefore, to achieve more effective and efficient sparse attention, DeepSeek proposes a native trainable sparse attention architecture, NSA, which integrates hierarchical token modeling.

As shown in the figure below, NSA reduces per-query computation by organizing keys and values into temporal blocks and processing them through three attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local context information. Subsequently, the authors implemented specialized kernels to maximize their actual efficiency.

The study evaluates NSA through comprehensive experiments on real-world language corpora. The authors pre-trained on a 27B parameter Transformer backbone with 260B tokens and evaluated NSA's performance in general language evaluation, long context evaluation, and chain-of-thought reasoning evaluation. The authors also further compared kernel speed on A100 GPUs with an optimized Triton implementation. The experimental results show that NSA achieves performance comparable to or better than the Full Attention baseline, outperforming existing sparse attention methods.

Additionally, compared to Full Attention, NSA provides significant acceleration in decoding, forward, and backward phases, with the acceleration ratio increasing as sequence length increases. These results validate that the hierarchical sparse attention design effectively balances model capability and computational efficiency.

Outstanding Paper Award

ACL 2025 selected 26 outstanding papers, filling up 6 pages of slides:

1、A New Formulation of Zipf's Meaning-Frequency Law through Contextual Diversity.

2、All That Glitters is Not Novel: Plagiarism in Al Generated Research.

3、Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases.

4、Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization

5、 Bridging the Language Gaps in Large Language Modeis with inference-Time Cross-Lingual Intervention.

6、Byte Latent Transformer: Patches Scale Better Than Tokens.

7、Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law.

8、From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding.

9、HALoGEN: Fantastic tiM Hallucinations and Where to Find Them,

10、HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter.

11、IoT: Embedding Standardization Method Towards Zero Modality Gap.

12、IndicSynth: A Large-Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages.

13、LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models.

14、Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs.

15、LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts.

16、Mapping 1,0o0+ Language Models via the Log-Likelihood Vector.

17、MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models.

18、PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages.

19、Past Meets Present: Creating Historical Analogy with Large Language Models.

20、Pre3: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation.

21、Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory.

22、Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability.

23、Toward Automatic Discovery of a Canine Phonetic Alphabet.

24、Towards the Law of Capacity Gap in Distilling Language Models.

25、Tuning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling.

26、Typology-Guided Adaptation for African NLP.

Best Demo Paper Award

Winning Paper: OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Authors: Jiacheng Liu et al.
Institutions: Allen Institute for AI, etc.
Link: https://arxiv.org/pdf/2504.07096
Introduction: The paper proposes OLMOTRACE - the first system capable of tracing the output of a language model back to its complete, trillions of token-level training data in real-time.

Best Topic Paper Award

Paper 1: MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection.

Authors: Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D. Pimentel, Anuj Pathania
Institutions: University of Amsterdam
Link: https://arxiv.org/pdf/2505.23870

Introduction: This paper proposes a new adaptation method called MaCP, which stands for Minimal yet Mighty Adaptive Cosine Projection, which achieves excellent performance when fine-tuning large base models with only a minimal number of parameters and memory.

Paper 2: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Authors: Xinlin Zhuang, Jiahui Peng, Ren Ma, etc.
Institutions: Shanghai Artificial Intelligence Lab, East China Normal University
Link: https://arxiv.org/pdf/2504.14194

Introduction: The paper proposes using four dimensions to measure data quality: expertise, readability, reasoning depth, and tidiness, and further proposes Meta-rater: a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weights.

Paper 3: SubLlME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation

Authors: Gayathri Saranathan, Cong Xu, etc.
Institutions: HP Labs, etc.
Link: https://aclanthology.org/2025.acl-long.1477.pdf

Introduction: The rapid expansion of large language models and natural language processing datasets has made exhaustive benchmark testing computationally infeasible. Inspired by high-stakes competitions like the International Mathematical Olympiad, where a few carefully designed questions can distinguish top performers, the paper proposes SubLIME, which can reduce evaluation costs by 80% to 99% while preserving ranking fidelity.

TACL Best Paper Award

ACL 2025 awarded two TACL best papers, as follows:

Paper 1: Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions.

Authors: Yoav Artzi, Luke Zettlemoyer
Institutions: University of Washington
Paper Link: https://www.semanticscholar.org/paper/Weakly-Supervised-Learning-of-Semantic-Parsers-for-Artzi-Zettlemoyer/cde902f11b0870c695428d865a35eb819b1d24b7

Introduction: The context in which language is situated provides strong signals for learning its meaning. This paper demonstrates how to exploit this in a grounded CCG semantic parsing approach, which learns a joint meaning and context model for interpreting and executing natural language instructions and is applicable to various forms of weak supervision.

Paper 2: Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers.

Authors: Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown.
Institutions: Columbia University
Paper Link: https://arxiv.org/pdf/2403.01061

Introduction: This paper evaluates the performance of current mainstream large language models (LLMs) on the challenging task of summarizing short stories. This task involves longer texts and often contains subtle subtext or disrupted timelines. The paper conducted quantitative and qualitative analyses, comparing GPT-4, Claude-2.1, and LLaMA-2-70B models. The study found that these models made factual errors in over 50% of summaries and struggled with understanding detailed content and complex subtext.

Time Test Award

This year, ACL announced two Time Test Awards: 25-Year ToT Award (2000) and 10-Year ToT Award (2015), i.e., the 25-Year Time Test Award and the 10-Year Time Test Award.

25-Year Time Test Award (from ACL 2000): Automatic Labeling of Semantic Roles

Authors: Daniel Gildea, Daniel Jurafsky
Institutions: University of California, Berkeley, Colorado University
Address: https://aclanthology.org/P00-1065.pdf

This paper proposed a system for identifying semantic roles or semantic relations within a semantic frame in sentence components. The system extracts various lexical and syntactic features from syntactic parse trees and uses manually annotated training data to build statistical classifiers. ACL officially stated that this was a foundational paper that laid the groundwork for semantic role labeling and subsequent research. The paper has been cited 2,650 times so far.

The two authors of this paper, Daniel Gildea, is now a professor in the Department of Computer Science at the University of Rochester in the United States; Daniel Jurafsky is a professor in the Department of Linguistics and Computer Science at Stanford University and is a leading figure in the field of natural language processing. He co-authored the textbook "Speech and Language Processing" with James H. Martin, which has been translated into over 60 languages and is one of the most classic textbooks in the NLP field worldwide.

10-Year Time Test Award (from EMNLP 2015): Effective Approaches to Attention-based Neural Machine Translation

Authors: Minh-Thang Luong, Hieu Pham, Christopher D. Manning
Institutions: Department of Computer Science, Stanford University
Address: https://aclanthology.org/D15-1166/

This paper was written by the renowned Christopher D. Manning team. ACL official said it was a milestone work on neural machine translation and attention mechanisms.

At that time, attention mechanisms had already been used to improve neural machine translation by selectively focusing on parts of the source sentence during translation to enhance performance. However, there was little exploration of effective architectures for attention-based neural machine translation. This paper studied two simple and effective attention mechanisms: global methods - always focusing on all source words; local methods - focusing on a subset of source words each time. The paper validated the effectiveness of these two methods on the WMT English-German bidirectional translation task. Using local attention mechanisms, the authors achieved a significant improvement of 5.0 BLEU score points over non-attention systems that had already incorporated dropout and other known techniques. Their integrated models with different attention architectures achieved new SOTA results on the WMT'15 English-to-German translation task, reaching 25.9 BLEU scores, surpassing the best system based on neural machine translation and n-gram reordering by 1.0 BLEU score points.

The paper proposed global and local attention, simplifying Bahdanau's complex structure, introducing the "dot product attention" calculation method, laying the foundation for the dot product similarity calculation of Q/K/V.

Currently, the paper has been cited over 10,000 times. The first author, Minh-Thang Luong, graduated from Stanford University, studying under Professor Christopher Manning at Stanford University. He is now a research scientist at Google.

The second author, Hieu Pham, is currently employed at xAI; previously, he worked at AugmentCode and Google Brain.

As for Professor Manning, he needs no further introduction. This academic giant has over 290,000 citations and has made many pioneering and foundational contributions to the fields of NLP and AI. He has also made great contributions to education and talent cultivation.

Incidentally, Professor Manning's paper "GloVe: Global Vectors for Word Representation" also received the ACL 2024 10-Year Time Test Award; another paper "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" also received the ACL 2023 10-Year Time Test Award. Therefore, this is the third consecutive year that Professor Manning has received the ACL 10-Year Time Test Award.

Lifetime Achievement Award

This year's ACL Lifetime Achievement Award was awarded to Professor Kathy McKeown.

ACL's official tweet read: "For 43 years, she has conducted outstanding, creative, and productive research in the field of natural language processing, covering areas such as natural language generation, summarization, and social media analysis." Professor McKeown not only laid the foundation for NLP but also inspired a generation of researchers through her vision, leadership, and guidance.

Currently, McKeown is the Henry and Gertrude Rothschild Professor of Computer Science at Columbia University. She is also the founding director of the Columbia University Institute for Data Science and served as the director of the institute from July 2012 to June 2017.

From 1998 to 2003, she served as the department chair of the School of Engineering and Applied Science, and then as the associate dean of research for two years.

McKeown received her Ph.D. in computer science from the University of Pennsylvania in 1982 and has been teaching at Columbia University since then. Her research interests include text summarization, natural language generation, multimedia explanation, question answering, and multilingual applications.

According to Google Scholar statistics, Professor McKeown's current total citation count exceeds 33,000.

Distinguished Service Award

ACL 2025 also awarded a Distinguished Service Award, aimed at recognizing individuals who have made outstanding and sustained contributions to the field of computational linguistics.

This year's winner is Professor Julia B. Hirschberg from the Department of Computer Science at Columbia University.

ACL official wrote: "For 35 years, she has dedicated herself to serving ACL and its related journal 'Computational Linguistics' (including serving as editor-in-chief of 'Computational Linguistics' and serving on the ACL Executive Committee from 1993 to 2003), and has also made outstanding contributions to the fields of natural language processing and speech processing."

Regarding the Deepseek NSA paper winning an award, what do you think? Welcome to comment and exchange views.

Original: https://www.toutiao.com/article/7533068752842441256/

Statement: This article represents the views of the author and welcomes you to express your opinion by clicking on the [up/down] buttons below.