OpenAI Makes Coding Leap With GPT-5.1-Codex-Max Launch

Computer keyboard with a yellow light bulb from above

Image: Adobe Stock

Verfasst von

Nov 20, 2025

5 minute read

eWeek Inhalte und Produktempfehlungen sind redaktionell unabhängig. Wir können Geld verdienen, wenn Sie auf Links zu unseren Partnern klicken. Mehr erfahren

OpenAI has released GPT-5.1-Codex-Max, describing it as a major step forward for long-running, agentic coding systems.

The model is now available to all Codex users across the CLI, IDE extension, cloud environment, and code-review tools, with API access expected to follow soon.

According to the announcement, the new system is considerably faster, more intelligent, and more efficient than its predecessors, marking a shift toward AI that can sustain extensive, multi-hour software-engineering workflows with far fewer interruptions and failures.

Long-horizon coding models
Software engineering performance
Token efficiency
Sustaining work
Cybersecurity capabilities and risks
Human oversight in the era of autonomous agents
Rollout across plans and products
Broader implications

Long-horizon coding models

OpenAI characterizes GPT-5.1-Codex-Max as the company’s first coding model explicitly trained to operate across multiple context windows through a technique called compaction. The model can maintain coherent reasoning across millions of tokens, enabling tasks that traditionally exceeded AI systems’ memory limits.

This architecture allows the model to handle project-scale refactors, deep debugging sessions, continuous tool-use loops, and sustained autonomous work. According to OpenAI, GPT-5.1-Codex-Max can operate for hours without needing human intervention, and internal tests demonstrated uninterrupted sessions exceeding 24 hours.

The ability to manage such prolonged workflows is seen by researchers as a precursor to more general AI agents capable of bigger, more complex projects. OpenAI notes that GPT-5.1-Codex-Max represents tangible progress down that path.

Software engineering performance

OpenAI reports that GPT-5.1-Codex-Max outperforms prior Codex models across several coding benchmarks that simulate real-world engineering tasks. These include PR authoring, code review, frontend implementation work, and Q&A-based debugging.

The model is also OpenAI’s first Codex variant trained to operate reliably in Windows environments — a significant shift, given that many enterprise engineering teams rely on Windows-based infrastructure.

OpenAI states that GPT-5.1-Codex-Max was trained on tasks designed not only to improve its reasoning but also to make it a more collaborative partner inside the Codex CLI, with better interactive responsiveness and more robust tool-call behavior.

Token efficiency

One of the key advertised improvements is token efficiency. OpenAI indicates that GPT-5.1-Codex-Max uses substantially fewer thinking tokens to achieve better results with the same reasoning effort settings. For example, on SWE-bench Verified, the model reportedly achieves higher accuracy than GPT-5.1-Codex while using 30% fewer thinking tokens.

The company is introducing an Extra High reasoning mode, labeled “xhigh,” intended for tasks where latency is not a concern and where deeper reasoning may provide more reliable outcomes. However, medium-effort reasoning is still recommended for daily use.

In practical terms, OpenAI claims this efficiency leads to real cost savings. During internal tests, GPT-5.1-Codex-Max produced complex frontend designs with similar functionality and visual polish but required significantly fewer resources to do so.

Sustaining work

The compaction mechanism is central to what differentiates GPT-5.1-Codex-Max from earlier systems. As sessions approach the model’s context window limit, it automatically compresses the session history — pruning low-importance information while preserving critical context.

According to OpenAI, this process frees up space to continue the task without losing progress, effectively allowing a single project to span multiple independent context windows without context collapse or catastrophic forgetting.

The company showcased a 24-hour autonomous refactor of the open-source Codex CLI repository. The model repeatedly tested its own work, adapted to failing tests, and iterated until results met the criteria. Such demonstrations highlight how agentic coding models could increasingly perform high-value engineering tasks end-to-end.

Cybersecurity capabilities and risks

GPT-5.1-Codex-Max shows enhanced long-horizon reasoning in cybersecurity domains as well. The model does not yet qualify as “High” capability under OpenAI’s Preparedness Framework but is described as the most advanced cybersecurity system the company has deployed to date.

This increased capability brings both opportunity and risk.

On one hand, defenders could benefit from automated vulnerability scanning, patch synthesis, and continuous monitoring. On the other hand, the same capabilities could be misused by malicious actors to identify weaknesses or automate offensive operations.

Acknowledging this dual-use challenge, OpenAI says it has implemented stronger cybersecurity-specific monitoring. The company reports it has already disrupted attempted misuse and is expanding protections to prepare for models that approach or exceed High capability classifications.

Codex continues to run in a restricted sandbox environment by default. OpenAI strongly advises developers to avoid enabling network access unless absolutely necessary, warning that exposure to untrusted content can introduce risks such as prompt-injection attacks.

Human oversight in the era of autonomous agents

As GPT-5.1-Codex-Max becomes capable of completing longer and more complex tasks autonomously, OpenAI stresses the importance of human oversight. Developers are encouraged to review logs, tool calls, and test results before deploying any model-generated code.

OpenAI says Codex should be treated as “an additional reviewer and not a replacement for human reviews,” even though its automated checks significantly reduce the risk of bugs reaching production.

This represents an emerging paradigm in software engineering where AI agents may handle much of the implementation and iteration, while humans serve as supervisors, auditors, and final approvers.

Rollout across plans and products

GPT-5.1-Codex-Max is now available across the Codex ecosystem for ChatGPT Plus, Pro, Business, Edu, and Enterprise customers. It also becomes the default Codex model starting today, replacing GPT-5.1-Codex in all major user-facing surfaces.

The company notes that GPT-5.1 remains a general-purpose model, while GPT-5.1-Codex-Max is specifically optimized for agentic coding tasks and is best used within Codex environments or tools that closely resemble them.

API access for developers using Codex CLI via API key is planned to launch in the near future.

Broader implications

OpenAI highlights the internal impact of Codex adoption: 95% of the company’s engineers use Codex weekly, and teams reportedly ship about 70% more pull requests since integrating Codex into their workflow.

This suggests a broader shift underway in how development teams might operate. As AI models grow capable of sustaining deeper reasoning, coordinating multi-step tasks, and managing codebases over long time horizons, the relationship between developers and their tools could evolve toward more collaborative human-AI workflows.

Looking ahead, OpenAI frames GPT-5.1-Codex-Max as another milestone in the journey toward general-purpose agentic AI systems capable of handling entire engineering projects autonomously.

AI systems love to insist they’re neutral. We tested five leading text-generation models — ChatGPT, Claude, Gemini, Microsoft Copilot, and Perplexity. See the results here.