Machine Heart Report

Editor: Chen Ping

An exciting AI chess competition is about to begin.

After getting tired of researchers constantly updating benchmarks on papers, it's time to bring the models out for a test. Are their performances really as impressive as they claim?

From August 5th to 7th, Pacific Time, a three-day AI chess competition is highly anticipated.

On the first day of the competition, eight cutting-edge AI models will go head-to-head:

The participating models include:

  • o4-mini (OpenAI)
  • DeepSeek-R1 (DeepSeek)
  • Kimi K2 Instruct (Moonshot AI)
  • o3 (OpenAI)
  • Gemini 2.5 Pro (Google)
  • Claude Opus 4 (Anthropic)
  • Grok 4 (xAI)
  • Gemini 2.5 Flash (Google)

Live stream link: https://www.youtube.com/watch?v=En_NJJsbuus

This competition features top-tier AI models (including two open-source Chinese models), and the performance of both sides is evenly matched.

The organizers have also invited world-class chess experts to provide commentary, showing great sincerity.

This competition is mainly based on Kaggle Game Arena, a brand-new, open benchmark testing platform launched by Google. Here, AI models can directly confront each other in strategy games such as chess and others to determine the winner.

To ensure transparency, the game execution framework and the game environment itself will be open-sourced. The final ranking will be determined through a strict all-play-all format, with numerous matches between each pair of models to ensure the reliability of statistical results.

Nobel laureate and co-founder and CEO of Google DeepMind, Demis Hassabis, said excitedly: "Games have always been an important testing ground for AI capabilities (including our research on AlphaGo and AlphaZero). Now we are extremely excited about the progress this benchmark testing platform can drive. As we continue to introduce more games and challenges into the Arena, we expect AI capabilities to improve rapidly!"

"Kaggle Game Arena, this new leaderboard platform, allows AI systems to compete against each other. As model capabilities improve, the difficulty of the competition will also keep increasing."

As to why this competition was organized, Google blog explained: Current AI benchmark tests are no longer keeping up with the development speed of modern models. Although these tests are still useful in measuring model performance on specific tasks, it is difficult to determine whether models trained on the Internet are truly solving problems or just repeating answers they have seen before. As models approach 100% scores on certain benchmarks, these tests are gradually losing their ability to distinguish between model performances.

Therefore, while continuously developing existing benchmark tests, researchers are also exploring new methods for evaluating models. Kaggle Game Arena was born under such circumstances.

Competition Introduction

Each game on the Kaggle Game Arena platform has a detailed page where users can view:

  • Real-time updated match schedules;
  • Dynamic leaderboard data;
  • Open source environment code and technical documentation for the game.

Users can also view the match schedule in real time:

Match Schedule: https://www.kaggle.com/benchmarks/kaggle/chess-text/tournament

Model performance in the game will be displayed on the Kaggle Benchmarks leaderboard.

Competition Rules

Given that current large models are better at text expression, the competition will start with a text-based input method.

Here is a brief explanation of the execution framework:

  • Models cannot use any external tools. For example, they cannot call chess engines like Stockfish to get the best moves.
  • Models will not be informed of the list of legal moves in the current situation.
  • If a model makes an illegal move, the organizer will give it up to 3 retries. If the model still fails to submit a legal move after 4 attempts, the game will end, and the model will lose, with its opponent winning.
  • Each move has a 60-minute timeout limit.

During the competition, the audience will be able to see how each model reasons about its moves, as well as how they self-correct after making illegal moves.

Everyone is already eager to see the results of the competition.

For more information on the competition methods, please refer to: https://www.kaggle.com/game-arena

There are 14 hours until the first match starts. You can now look forward to it. Which model do you think will be the winner?

Original article: https://www.toutiao.com/article/7534978284640895540/

Statement: This article represents the views of the author. Please express your opinion by clicking on the [Upvote/Downvote] buttons below.