OpenAI Raises the Bar for Enterprise AI with GPT-5.2

New model aims to turn AI time savings into productivity gains across spreadsheets, code and complex reasoning.

MITSloan ME Editorial December 12, 2025

Topics

OpenAI has launched GPT-5.2, its latest generation of large language models designed to boost performance on professional and knowledge-work tasks such as spreadsheets, presentations, coding, long-form reasoning and complex tool use.

The release comes as many enterprise users say AI tools already save them up to an hour a day, with heavy adopters reporting more than 10 hours saved weekly.

The GPT-5.2 family, available in Instant, Thinking and Pro variants, is designed to push that further.

Early testers from companies such as Notion, Box, Shopify, Harvey, and Zoom report stronger long-horizon reasoning and smoother tool use, while teams at Databricks, Hex, and Triple Whale observed sharper data-science and document-analysis capabilities.

Coding-focused firms, including Cognition, JetBrains, and Warp, say the new model shows measurable gains in debugging, refactoring, and interactive programming tasks.

GPT-5.2 rolled out on Thursday, December 11, across ChatGPT paid tiers, while the API versions are already available to developers.

The company says the model offers broad improvements in general intelligence, context handling, and end-to-end task execution.

One of the most notable changes appears in “GPT-5.2 Thinking,” which the company positions as its top performer for professional use.

On GDPval, an evaluation of knowledge-work tasks across 44 occupations, the model reportedly matches or surpasses expert humans on more than 70% of comparisons, producing outputs like presentations and spreadsheets at a fraction of the time and cost.

Internal tests also show an uptick in performance on investment-banking-style spreadsheet modelling, rising nearly 10 percentage points over GPT-5.1.

Software engineering benchmarks also show a jump. GPT-5.2 Thinking sets a new high score on SWE-Bench Pro, a multi-language, real-world coding test. Early feedback points to better handling of complex front-end and 3D interface work.

The model also shows fewer factual errors, improved document-level reasoning across long contexts, sharper interpretation of charts and software interfaces, and stronger image-layout understanding.

Tool-calling accuracy has climbed as well, hitting 98.7% on a telecom-task benchmark.

For research communities, the Pro and Thinking variants hit over 92% on GPQA Diamond, a graduate-level science and math benchmark, indicating growing relevance for technical and scientific workflows.

Topics

About the Author

Tags:

Topics

Share