对标OpenAI o1, SenseNova V6 makes AI part of everyday life for common people.

April 10th, SenseTime held the 2025 Technology Exchange Day event and officially launched the newly upgraded "SenseNova V6" (hereinafter referred to as "SenseNova V6") large model system. Through breakthroughs in multi-modal long chain training, global memory, and reinforcement learning technologies, it has formed leading multi-modal reasoning capabilities and broken through cost boundaries.

The "SenseNova V6" model's capabilities have significantly improved, with notable advantages in long chains of thought, reasoning, mathematical analysis, and global memory. Its multi-modal reasoning capability ranks first domestically and aligns with OpenAI o1; its data analysis capabilities far surpass GPT-4o. Meanwhile, it achieves a perfect combination of high efficiency and low cost: the overall efficiency of multi-modal training is aligned with language training to achieve the lowest in the industry, and inference costs are also the lowest in the industry. The new lightweight full-modal interaction model SenseNova V6 Omni brings the strongest multi-modal interaction capability domestically; it includes the domestic first large model supporting deep analysis of medium and long videos up to 10 minutes, matching Gemini 2.5 Turbo for the same type of strength.

Xu Li, Chairman and CEO of SenseTime, stated: "The way of AI lies in daily use by the people. SenseNova V6 will cross multiple modalities and release infinite possibilities of reasoning and intelligence."

With multi-modal long chains of thought, reinforcement learning, and global memory, SenseNova V6 has pioneered deep thinking in multi-modal environments.

As an MoE-native multi-modal general large model with over 600 billion parameters, "SenseNova V6" has achieved several technical breakthroughs, completing various tasks such as text and multi-modal processing with a single model:

Long Chain of Thought: Over 200B high-quality multi-modal long-chain-of-thought data, with the longest chain reaching 64K;

Mathematical Ability: Data analysis capabilities significantly surpass GPT-4o;

Reasoning Ability: Multi-modal deep reasoning leads domestically, aligning with OpenAI o1;

Global Memory: First domestically to break through long video understanding, supporting 10-minute video understanding and deep reasoning.

In authoritative evaluations of reasoning capabilities and multi-modal capabilities, "SenseNova V6" achieved SOTA in multiple dimensions:

[Core Indicators] Excellent comprehensive performance in pure text tasks, on par with top international models; leading multi-modal performance, with outstanding performance across all aspects; both pure text reasoning and multi-modal reasoning capabilities align with international top-tier models like GPT-4.5 and Gemini 2.0 Pro.

[Strong Reasoning Ability] The fusion model from 5.5 to V6/V6 Reasoner has significantly enhanced reasoning capabilities. In independent evaluations, it outperforms OpenAI's o1 and Gemini 2.0 flash-thinking in multi-modal and language depth reasoning tasks.

Based on more than 200B high-quality multi-modal long-chain-of-thought data, SenseTime uses multi-agent collaboration for long-chain-of-thought synthesis and verification, enabling "SenseNova V6" to form prominent multi-modal reasoning capabilities, supporting the synthesis of the longest 64K multi-modal long chains of thought, and maintaining the model's long-term thinking ability.

For complex problems in real environments, the powerful hybrid图文understanding and reasoning capabilities of "SenseNova V6" can help users solve various issues.

In complex and cumbersome document processing scenarios, "SenseNova V6" can also solve user problems with multi-modal strong reasoning capabilities. Come experience it at SenseTime Office Xiaohuanrou.

For example, in the insurance claims scenario, "SenseNova V6" can determine whether the submitted materials for commercial medical insurance claims meet the requirements, check for issues like over-prescription, unnecessary tests, missing documents, or misaligned materials.

Although small claims involve small amounts of money, they often take a long time (3-7 days). However, when handled by "SenseNova V6," it can automatically detect risk warnings, cross-validate itself, and finally provide users with very detailed, multi-dimensional conclusions, completing the last mile from model to customer use.

Thanks to breakthroughs in multi-modal reinforcement learning, SenseTime has built a hybrid enhanced learning framework for various图文tasks, based on reinforcement learning training with different difficulty levels and multi-reward models.

Domestically first! Video understanding breaks the 10-minute limit, achieving second-level inference decomposition for ultra-long content.

With its "global memory" capability, "SenseNova V6" breaks the traditional model's limitation of only supporting short videos, enabling full-frame rate parsing of 10-minute videos.

Based on its powerful understanding capabilities, "SenseNova V6" can also intelligently edit and output the highlights of a video, helping users preserve precious moments.

For a recording of the game "Black Myth," gamers can input the recording and live footage into "SenseNova V6" to understand the highlights and recordable moments, automatically cutting out the best moments while customizing解说scripts to share gameplay experiences and in-game highlights.

SenseTime's self-developed technology can align visual information (images), auditory information (voice, sound effects), linguistic information (subtitles, spoken words), and the timeline logic, forming a multi-modal unified temporal representation. On this basis, through fine-grained cascaded information compression and content-sensitive dynamic filtering, long videos can be highly compressed, reducing a 10-minute video to 16K tokens while still retaining key semantics.

Truly interacting like humans, "SenseNova V6 Omni" was released and has already achieved multi-industry implementation.

With the release of SenseNova V6, SenseTime's real-time interactive fusion large model has been upgraded to "SenseNova V6 Omni," which has undergone deep optimization in scenarios such as role-playing, translation pointing, cultural tourism guidance, children's story explanation, and math tutoring.

For instance, in the translation pointing scenario, "SenseNova V6 Omni" enables users to interact precisely with their fingers, accurately understanding the relationship between local and global information, providing a more natural and intuitive pointing and reading interaction experience.

"SenseNova V6 Omni" possesses more human-like perception and expression abilities, emotional understanding capabilities, and has been implemented in multiple industries and scenarios in the field of embodied intelligence, becoming the first commercially available full-modal real-time interactive model domestically.

One More Thing: The full version of "Shang Qian" is now fully online and open for internal testing.

Integrating all the capabilities of SenseNova V6, Shang Qian has also undergone a comprehensive upgrade and launched a new Shang Qian app. Users can experience multi-modal stream interactions including text, images, and videos through a single entry point.

The Shang Qian app has started internal testing, and currently, the capabilities of "SenseNova V6" can be experienced on the Shang Qian web platform.

Original Source: https://www.toutiao.com/article/7492574391298179620/

Disclaimer: This article represents the author's personal views. Please express your opinions by clicking the "Like/Dislike" buttons below.