---
title: "The first lobster large model ranking is here! Two domestic AI models have entered the global top three, a must-read for shrimp farming"
type: "News"
locale: "en"
url: "https://longbridge.com/en/news/278291174.md"
description: "The first lobster large model ranking has been released, with Tencent's OpenClaw being hailed as an important software, and AI performing excellently in personalized environments. The new benchmark test PinchBench evaluates 32 large language models, with Google's Gemini 3 Flash winning with a success rate of 95.1%, surpassing other mainstream models and demonstrating the results of model efficiency optimization"
datetime: "2026-03-09T00:13:51.000Z"
locales:
  - [zh-CN](https://longbridge.com/zh-CN/news/278291174.md)
  - [en](https://longbridge.com/en/news/278291174.md)
  - [zh-HK](https://longbridge.com/zh-HK/news/278291174.md)
---

> Supported Languages: [简体中文](https://longbridge.com/zh-CN/news/278291174.md) | [繁體中文](https://longbridge.com/zh-HK/news/278291174.md)


# The first lobster large model ranking is here! Two domestic AI models have entered the global top three, a must-read for shrimp farming

How many lobsters are you raising now?

This is the most common question to greet each other nowadays. Last week, Tencent's Shenzhen headquarters lined up to get OpenClaw for free, truly a generational trend.

Even Jensen Huang praised OpenClaw as "the most important software release in history," believing it has proven that AI can perfectly replicate the complex workflows of humans in highly personalized environments.

The craze for raising lobsters has led to the emergence of a benchmark specifically for OpenClaw called PinchBench, used to evaluate the performance of large language models on OpenClaw tasks.

PinchBench's scoring method is quite rigorous; some tasks check whether the code runs (automated checks), some assess the quality of writing (with Claude Opus as the judge), and others combine both. All questions and answers are open-sourced on GitHub, allowing anyone to verify.

Today, OpenClaw founder Peter Steinberger shared this lobster benchmark leaderboard.

PinchBench tested 32 mainstream large models at once, examining success rates, speed, and costs to see which model is best suited for raising lobsters.

PinchBench official website? ??? https://pinchbench.com/

## **Gemini 3 Flash has the highest success rate, and domestic models are also performing exceptionally well**

Let's look at the most significant success rate rankings.

Google's Gemini 3 Flash Preview topped the chart with a success rate of 95.1%, which honestly surprised me. The Flash series has always been the "lightweight version" of Gemini, focusing on speed and cost-effectiveness, and I didn't expect it to surpass its Pro older brother and the Claude and GPT series in accuracy this time.

This indicates that Google has indeed put effort into optimizing model efficiency. A lightweight model does not mean weak capability; it all depends on how it is tuned Gemini 3.1 Flash-Lite More information can be found in the APPSO tweet: Just now, the GPT-5.3 new model collided with Gemini, OpenClaw: Thank you all.

The second place is MiniMax M2.1, with a success rate of 93.6%. Domestic models have really stood up, and MiniMax's performance is quite impressive, successfully surpassing Claude Sonnet 4.5 (92.7%) and GPT-4o (85.2%).

Kimi K2.5 follows closely behind, with a success rate of 93.4%. Kimi has always had strong long-text capabilities, and this time it proved itself in programming tasks. Together with MiniMax, the domestic duo directly occupies two spots in the TOP3.

Looking further down, Claude Sonnet 4.5 ranks fourth (92.7%), Gemini 3 Pro fifth (91.7%), and Claude Haiku 4.5 sixth (90.8%).

Interestingly, Claude Opus 4.6, as Anthropic's flagship large model, has a success rate of only 90.6%, ranking seventh.

It seems that "big" does not necessarily mean "strong"; at least in the programming scenario, mid-range models are more appealing.

## **Speed is Key, MiniMax Wins Big**

In developing these heavy tasks, no one wants to wait idly in front of the screen. Speed affects the mood for work.

MiniMax M2.5 took the speed championship with a time of 105.96 seconds, completing all test tasks. What does this mean? It is only 0.09 seconds faster than the second place Gemini 2.0 Flash, but first is still first.

Third place Llama 3.1 70B (106.14 seconds), fourth place Gemini 1.5 Pro (106.85 seconds), fifth place Mistral Large (107.72 seconds) — the differences among these are not large, basically in the same tier.

But looking further down is interesting.

Claude Sonnet 4 took 137.66 seconds, 30 seconds slower than the first tier. Gemini 3 Pro took even longer at 239.55 seconds, more than twice that of MiniMax M2.5.

This indicates a rule: **Lightweight models are generally faster**. If you are doing rapid prototyping and need frequent iterations, choosing a lightweight model is definitely the right choice. But if it’s a task that only needs to be run once, waiting for a large model is also fine.

## **How to Raise Lobsters Most Cost-Effectively**

Raising lobsters requires careful budgeting, as many OpenClaw tasks are token black holes, and a moment's inattention can make you question life 
GPT-5 Nano becomes the cheapest option at a cost of $0.03, with a success rate of 85.8%. Although its accuracy is not top-notch, at this price... what more could you want? It is suitable for scenarios with a limited budget and a high tolerance for errors.

Gemini 2.5 Flash Lite ranks second, costing only $0.05, with a success rate of 83.2%. This cost-performance ratio is quite impressive—its cost is less than twice that of GPT-5 Nano, while its success rate is only 2.6 percentage points lower.

MiniMax M2.1 ranks fifth, costing $0.14, but don't forget its success rate is 93.6%. Calculating the cost per percentage point, it is only $0.0015, offering excellent value.

Looking at the costs of high-end models, it becomes a bit shocking.

Claude Opus 4.6 costs $5.89 to complete the test, nearly 200 times that of GPT-5 Nano. However, its success rate is only 90.6%, which is 3 percentage points lower than MiniMax M2.1.

No matter how you calculate it, this doesn't seem cost-effective. Unless you have a special brand loyalty to Claude, from a pure cost-performance perspective, mid-range models are clearly the more rational choice.

## **How to Choose Lobster Farming**

After reviewing the rankings across three dimensions, I believe you have made your own judgments. Here are a few scenario-based suggestions from APPSO:

If you prioritize success rate, blindly choose Gemini 3 Flash.

With a success rate of 95.1% and a cost of $0.72, it currently offers the best overall performance. It is suitable for production environments where code quality is highly demanded, and the cost of errors far exceeds the model cost—choosing it is a sure bet.

⚡ If you prioritize speed, choose MiniMax M2.5 or Gemini 2.0 Flash.

Both complete all tasks in about 106 seconds, suitable for rapid prototyping and scenarios requiring frequent iterations. Time is money, and these two can save you a lot of patience.

If you prioritize cost-performance ratio, choose Gemini 2.5 Flash Lite.

With a cost of $0.05 and a success rate of 83.2%, it is the best choice for entering "lobster farming." It is ideal for personal projects, small teams, and scenarios with limited budgets—just go for it.

If you want to minimize hassle and prefer domestic models, both MiniMax M2.1 and Kimi K2.5 are very competitive.

MiniMax M2.1 has a success rate of 93.6%, ranking second, while Kimi K2.5 has a success rate of 93.4%, ranking third. Both domestic models have entered the top tier. Moreover, MiniMax has the fastest speed and excellent cost-performance ratio, making it worthy of special attention From this PinchBench list, it can be seen that the Agent has entered an era of "a hundred flowers blooming." Google's Gemini series leads in efficiency and cost, followed closely by domestic models MiniMax and Kimi, while OpenAI and Anthropic maintain competitiveness in the high-end market.

For developers, the good news is that there are more choices than ever. The bad news is... choice paralysis may become more severe.

But don't worry, remember one principle: **there is no best model, only the model that is most suitable for your scenario**. In production environments, focus on success rates; for prototype development, prioritize speed; for personal projects, consider cost-effectiveness; choose according to your needs.

Moreover, APPSO would like to remind everyone that **installing OpenClaw may not incur much cost, but the Tokens consumed for "raising lobsters" can be much higher than what we used to spend on conversations with AI**.

A few days ago, at a gathering held by OpenClaw in New York, many users shared their experiences in lobster farming, with some spending as much as $1,000 to $2,000 per month on Tokens, and one "wealthy" player burning through 1 billion tokens a day; without faith, one cannot afford such spending.

Trying out OpenClaw is fine, but it is not suitable for everyone; currently, there are many tasks where using lobsters is not the optimal solution. The greater significance lies in experiencing the new interactive experience brought by AI.

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Invest at your own risk

### Related Stocks

- [Alphabet Inc. (GOOG.US)](https://longbridge.com/en/quote/GOOG.US.md)
- [State StreetSPDRS&PSftwr&SvcsETF (XSW.US)](https://longbridge.com/en/quote/XSW.US.md)
- [Alphabet Inc. (GOOGL.US)](https://longbridge.com/en/quote/GOOGL.US.md)
- [Roundhill GOOGL WeeklyPay ETF (GOOW.US)](https://longbridge.com/en/quote/GOOW.US.md)
- [Direxion Daily GOOGL Bull 2X Shares (GGLL.US)](https://longbridge.com/en/quote/GGLL.US.md)
- [TENCENT (00700.HK)](https://longbridge.com/en/quote/00700.HK.md)
- [Tencent Holdings Limited (TCEHY.US)](https://longbridge.com/en/quote/TCEHY.US.md)
- [iShares Expanded Tech-Software Sect ETF (IGV.US)](https://longbridge.com/en/quote/IGV.US.md)
- [Tencent Holdings Limited (TCTZF.US)](https://longbridge.com/en/quote/TCTZF.US.md)

## Related News & Research

- [Alphabet Inc. $GOOG Shares Purchased by HUB Investment Partners LLC](https://longbridge.com/en/news/278263989.md)
- [Google’s Gemini rolls out Canvas in AI mode to all US users](https://longbridge.com/en/news/277822208.md)
- [Google embraces third party app stores and payments to put Epic Games case behind it](https://longbridge.com/en/news/277879517.md)
- [Google DeepMind executive invites Qwen team to join amid leadership changes at Qwen](https://longbridge.com/en/news/278013018.md)
- [Big Google Home update lets Gemini describe live camera feeds](https://longbridge.com/en/news/277600918.md)