DeepSeek has released a new model again, this time an OCR model. On October 20, DeepSeek open-sourced this new model on Github and published the paper "DeepSeek-OCR: Contexts Optical Compression" (translated as "DeepSeek OCR: Contexts Optical Compression"), explaining this achievement.

The paper mentions that current large language models face significant computational challenges during processing because of the long text content. Therefore, the team explored a promising solution: using the visual modality as an efficient medium for compressing textual information.

Specifically, this OCR model can compress text into the visual modality. As the saying goes, "a picture is worth a thousand words," which consumes fewer Tokens. Tests show that the text-to-image method can achieve nearly 10 times lossless context compression, while keeping the OCR accuracy above 97%.

The paper mentions that in practical applications, a single A100-40G GPU can support the generation of more than 200,000 pages of training data per day for large language models or vision-language models.

Simply put, the team's idea is that since one image can contain a large amount of textual information and uses fewer Tokens, text can be converted into images. This is what the title refers to as "optical compression," compressing textual information with the visual modality. This result shows that the method has considerable potential in the research directions of long-context compression and memory forgetting mechanisms in large models.

DeepSeek-OCR consists of two core components: DeepEncoder (encoder) responsible for extracting and compressing image features, and DeepSeek3B-MoE (decoder) responsible for reconstructing text from the compressed visual tokens.

The decoder uses the DeepSeek-3B-MoE architecture. Although it has only 3B parameters, it adopts a MoE (Mixture of Experts) design, activating 6 out of 64 experts, plus 2 shared experts, resulting in approximately 5.7 billion activated parameters. This allows the model to have the expressive power of a 30-billion-parameter model while maintaining the inference efficiency of a 5-billion-parameter model.

Experimental data show that when the number of text tokens is within 10 times the number of visual tokens (i.e., compression ratio less than 10 times), the model's decoding (OCR) accuracy reaches 97%. Even at a compression ratio of 20 times, the OCR accuracy remains around 60%.

In the paper, the DeepSeek team also proposed an imaginative future—using optical compression to simulate human memory forgetting. Human memory fades over time, with more distant events becoming increasingly vague. Can AI do the same? Thus, the team designed to gradually reduce the size of rendered images for more distant contexts, further reducing token consumption. As the image becomes smaller, the content becomes blurrier, ultimately achieving the effect of "text forgetting," similar to the curve of human memory fading.

The paper mentions that this is still an early research direction that requires further investigation. However, it offers a good method for balancing theoretically infinite context information. If truly realized, it would be a major breakthrough in handling ultra-long contexts. Therefore, this released DeepSeek-OCR is not just an OCR model; from another perspective, its research represents a promising new direction.

Some netizens believe this is a wise move. Humans read visual text and understand the spatiotemporal concepts of the physical world. If language and vision can be unified, it might lead to superintelligence.

This OCR model has already received over 1,400 stars on GitHub shortly after its release. From the paper's authorship, this project was jointly completed by three DeepSeek researchers: Haoran Wei, Yaofeng Sun, and Yukun Li. Industry reports indicate that the first author, Haoran Wei, previously worked at StepStone, where he led the development of the GOT-OCR2.0 system aimed at achieving "second-generation OCR." Therefore, it is reasonable for him to lead the DeepSeek OCR project.

However, DeepSeek has been slow to release new models like R2, leading some in the market to believe it is lagging behind. Others argue that DeepSeek is currently focusing on refining its internal capabilities, preparing for the next generation of models.

(This article is from First Financial News)

Original: https://www.toutiao.com/article/7563293614374224430/

Statement: The article represents the views of the author. Please express your opinion below with the [Up/Down] buttons.