Google has announced Android Bench to measure how well AI models perform real Android app development tasks.

Gemini 3.1 Pro tops the Android Bench leaderboard, outperforming Claude Opus and GPT Codex models.

The benchmark tests AI models using real Android coding challenges with varying levels of difficulty.

It's not just about generating images and videos from text anymore. Now you can even build working apps using just a prompt. That said, not every AI model that claims to build apps performs equally well, and Google wants to set a benchmark for which models actually work best.

Vibe coding has quickly become one of the trends of 2026, with more people trying to build their own apps and services using AI. Nothing recently showcased a tool that lets users create small apps using prompts.

But anyone who has worked with Android development knows it takes more than just typing a few prompts, and Google wants to highlight which AI models are actually capable of handling those tasks.

To do that, Google has introduced a new leaderboard called Android Bench. It's a benchmark designed to evaluate large language models specifically for Android development. The tool measures how well AI models perform real-world Android development tasks by testing them against a set of challenges with varying levels of difficulty.

(Image credit: Google)

According to Google, the tested models were able to complete between 16% and 72% of the tasks successfully. The model that performed best was Google's Gemini 3.1 Pro Preview with a score of 72.2%. Claude Opus 4.6 followed with a score of 66.6%, while GPT 5.2 Codex finished third with 62.5%.

The results show that AI models are already getting quite capable at helping with Android development. Google says the goal of Android Bench is to "close the gap between concept and quality code." In the long run, the company believes people could build Android apps simply by describing what they want.

To ensure transparency, Google has also made the methodology, dataset, and testing tools publicly available on GitHub.

Android Central's Take

It may not matter much to the average user, but benchmarking LLMs specifically for Android development is great for the developer community. It makes it easier to identify which models are actually useful for building apps instead of relying on guesswork or trying multiple tools before finding one that works well.