Japan's Sakana AI Unveils Fugu, Claims Edge Over Claude Fable 5 in Coding
According to benchmark results shared by the firm, Fugu’s value is most evident in long, messy, real-world workflows.
News
- Qatar Launches AI-Focused Scholarship Program to Build Future Digital Workforce
- China's LineShine Tops TOP500, Becomes World's Fastest Supercomputer
- Japan's Sakana AI Unveils Fugu, Claims Edge Over Claude Fable 5 in Coding
- Oracle Cuts 21,000 Jobs as AI Silently Reshapes Its Operating Model
- Nvidia’s New Cooling System Cuts Data Center Water Use to Near Zero—But Not AI’s
- 38% of UAE and Saudi Firms Have Agentic AI in Production, Highest Globally
[Image: Chetan Jha/MITSMR Middle East]
Riding the current AI momentum, Japanese AI startup Sakana AI has unveiled Fugu, a multi-agent orchestration platform, which claims to surpass Anthropic’s Claude Fable 5 on select coding and reasoning benchmarks.
For the uninitiated, instead of relying on a single large language model, orchestration models coordinate multiple AI agents, databases, and APIs to function seamlessly.
The company has launched two versions of the platform. The standard Fugu model is for everyday tasks like coding and chat, while Fugu Ultra is built for harder, more complex work.
“Today, this orchestration is no longer just a technical optimization; it has become a geopolitical and operational imperative. Recent disruptions in the AI landscape have demonstrated the severe risk of single-vendor dependency. For an organization or a nation, relying on a single company’s APIs for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality,” the official release read.
According to benchmark results shared by the firm, Fugu’s value is most evident in long, messy, real-world workflows. The Fugu Ultra and the standard Fugu model achieved scores of 93.2 and 92.9 on LiveCodeBench, respectively. Anthropic’s Claude Fable 5 recorded 89.8 while OpenAI’s GPT 5.5 recorded the lowest figure of 85.3.
Other key benchmarks included GPQA-D, CharXiv Reasoning, SWEBench Pro, SciCode, Terminal Bench 2.1, and CTI-REALM.
Notably, the Humanity’s Last Exam (text), an ultra-challenging academic AI benchmark consisting of 2,500 highly specific questions across more than 100 subject domains, had all participating models score between 44.3 and 53.3.
Fugu and Fugu Ultra both scored the highest of 95.5 on GPQA-Diamond, a benchmark of graduate-level multiple-choice questions used to test the limits of scientific reasoning.
“One of the clearest signals came from automated data science research: early users running Sakana Fugu in an almost fully automated research mode saw it drive meaningful progress with little to no human intervention,” the release said.
Early users reported that Fugu Ultra outperformed GPT-5.5 in code review, citing its ability to catch bugs that other models missed and to deliver more comprehensive answers. Regarding security tasks, one user noted that Fugu completed a full end-to-end assessment without prompting—staying within scope and avoiding destructive actions throughout.
Going forward, it will expand the pool of expert agents, including open models and Sakana AI’s own models, to strengthen coordination for long-running and agentic tasks.
