Not Round from凹非寺
Quantum Bit | Public Account QbitAI
DeepSeek inference: detailed or fast, now you can choose?
A research team from Tel Aviv University has developed a new method that can monitor and control the length of thinking paths in LLMs.
Adding a progress bar to LLM's reasoning tasks, and controlling the depth of reasoning and adjusting the speed of reasoning.
The accelerated model uses nearly six times fewer tokens than the original model, and all have arrived at the correct answers.
LLMs implicitly track their relative position during the thinking phase and encode this information in hidden states.
The paper proposes a "Thinking Progress Vector" (TPV), which can be used to predict the model's relative position in the reasoning phase in real-time and visualize the model's reasoning dynamics through a progress bar.
By intervening with TPV, the model's reasoning process can be accelerated or slowed down, achieving "overclocking" (overclocking) and "downclocking" (downclocking).
Overclocking can reduce unnecessary reasoning steps, allowing the model to reach conclusions faster while avoiding performance degradation caused by excessive reasoning.
The model is open-sourced on gitHub.
Method: Real-Time Monitoring and Control of Reasoning Depth
In an effective reasoning learning process, the model must implicitly learn to track its progress during the thinking phase and maintain an estimate of how close it is to the final answer.
Since progress tracking depends on the input, this information cannot be stored in the static weights of the model but must be dynamically encoded in the hidden representations passed between layers.
To this end, the research team chose to extract information from the final hidden layer.
The research team focused on models that perform explicit structured reasoning, which are characterized by a clearly defined and continuous reasoning phase bounded by ???? and ???? markers, such as DeepSeek-R1.
This allows the model's progress in the reasoning phase to be quantified by interpolating values between zero and one for each token based on its relative position.
Formally, the dataset is constructed as follows:
where
is the hidden representation of the jth word in the kth thought trajectory,
is the relative position of the word in its thought sequence. K represents the number of sampled trajectories, and the total number of samples in
is
.
On this basis, an optimization function for extracting progress is designed
, mapping hidden representations to their relative positions, in the form of a regression task
.
Using a linear regressor with parameters
as the function
to fit the progress attribute
, the parameter vector is called the "Thinking Progress Vector" (TPV).
To improve prediction accuracy, the self-regressive characteristics of the model are utilized, and exponential smoothing is applied to the prediction history to reduce noise. The TPV prediction results on the Math-500 test set are as follows:
Where Figure (a) shows a summary view of data points from multiple thought trajectories, and Figures (b, c) show TPV predictions and smoothed predictions on a single problem in the Math-500 test set.
It can be seen that both methods successfully predicted the relative position, with the latter producing more accurate results, which can be used to create clearer and more interpretable progress bars.
Inspired by this, to better utilize the temporal structure of the progress bar prediction task, a trainable sequence model is used instead of exponential smoothing, i.e., using the same training samples, but treating the relative position sequence as input rather than performing single-step prediction:
Through this method, the reasoning progress can be visualized.
A key question is whether TPVs reflect the fundamental mechanisms used by the model to track its reasoning progress, or if they are merely residual computations that are correlated with progress but not causal.
To address this, we intervene with TPV: moving the hidden representations by α in the direction of the projection vector, i.e.,
, and the modified representation has a new prediction value
.
By performing this intervention across all attention layers, we can intervene in the prediction of the next word and avoid editing the representations that are cached and used in consecutive decoding steps.
In experiments, α is treated as a hyperparameter determining the intensity of the intervention. Setting α=0 results in no intervention, preserving the original computation. Positive values of α lead to overclocking.
Experiments have shown that overclocking accelerates the model's reasoning phase, making it shorter and more decisive:
The above figure compares two thought sequences generated by the DeepSeek-R1-Distill-Qwen-32B model - before and after the intervention.
The original sequence showed hesitation and verbosity, while the TPV-accelerated version was significantly more concise, using nearly six times fewer tokens.
Moreover, both trajectories ultimately arrived at the correct answers.
Results: Up to 6 Times Faster, Accuracy Remains Unchanged or Even Improves
The effectiveness of TPV was measured on DeepSeek-R1-Qwen-32B and DeepSeek-R1-LLaMA-8B, with the following results:
The experimental results reveal four significant trends:
1. The impact of α: Increasing α from 5 to 100, regardless of whether instruction-based acceleration is used, increases the number of completed, ended, and correctly answered outputs generated by the model, proving that the TPV intervention method affects the length of thinking.
2. Comparing the acceleration baseline with the base model: Baselines (ii) and (iii) accelerate the base model through prompt response and temperature-based integration. In most cases, both methods improved all three metrics, proving they are strong baselines for evaluating the TPV overclocking method.
3. Comparison with baseline methods: Although baseline methods performed well, the temperature-based baseline method required approximately five times more computational resources, but the TPV method outperformed them in performance by generating more correct answers and more explicit responses.
Under lower computational budgets (such as 256 or 512 tokens), the TPV method increased correct answers by 80%, and these increases in correct answers did not come at the cost of increased error rates, which remained unchanged. This indicates that the TPV method shortens the reasoning process without increasing errors, promoting more explicit thinking.
For computational budgets larger than 512, the same trend generally applies, with an increase in the number of correct answers in most cases, and no increase in error rates.
4. Complementary contributions: Although empirical study results confirm that the TPV method is more effective than baseline methods, there are still cases where the method lags behind the instruction-based method (referred to as "instruction"). A prominent example is the mechanism on Math 500 with a 2048-token budget, where the instruction baseline correctly answered 10% more questions than the TPV method.
This observation raises the question: Are these improvements orthogonal or competitive?
Combining instruction-based prompting techniques with the TPV intervention method and comparing it with each method individually. The results are shown in the last two rows of the table: This hybrid method consistently performs best in most cases, improving by an average of 66%, up to 285%; compared to the base model, it improves by an average of 223%, up to 1416%.
These findings indicate that the TPV method complements prompting strategies and can be effectively combined with other acceleration techniques.
A series of intervention experiments were conducted on the Math-500 and GSM8K datasets, by changing the intervention parameter α to overclock the model's thinking phase.
The results showed that increasing α continuously shortened the length of the thinking phase, making the reasoning process more efficient.
These findings support the idea that TPV acts as an active control signal within the model's internal computations, rather than a passive correlation.
When applying the TPV method to the DeepSeek-R1 LLaMA model on the GSM8K dataset using the prompt strategy (baseline iii), the average number of tokens decreased from about 500 to less than 350, reducing the computational load by 30%.
Additionally, all positive values of α consistently accelerated the thinking phase relative to the baseline (α=0) and improved its effectiveness.
To further evaluate the reliability of TPVs in estimating the model's position during its reasoning process, the research team also tested their performance under two additional conditions:
- (i) Different prompt strategies
- (ii) Different reasoning sequence lengths
Figures (a-d) show that TPVs remain effective under various instructions, different from those used during training.
Figure (e) shows that the test loss remains low across different thought sequence length bins, indicating robustness to changes in reasoning depth.
More details can be found in the paper.
Reference link: https://royeisen.github.io/OverclockingLLMReasoning-paper/
Code: https://github.com/royeisen/reasoning_loading_bar
Paper: https://arxiv.org/abs/2506.07240
— End —
Quantum Bit QbitAI · Toutiao Account
Follow us to get the latest updates on cutting-edge technology.
Original article: https://www.toutiao.com/article/7524589729951302153/
Statement: This article represents the views of the author, please express your opinion by clicking on the 【Up/Down】 button below.