Claude Opus 4.7 Review the New Coding Benchmark Leader From Anthropic

Key Takeaways

Claude Opus 4.7, released on April 16, 2026, scores 87.6% on SWE-bench Verified, placing it ahead of Gemini 3.1 Pro (80.6%) and GPT-5.4 on pure software engineering tasks.
On CursorBench, Opus 4.7 hits 70%, a major jump from Opus 4.6’s 58%, making it the most capable Claude model yet for real-world IDE-style coding.
API pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6, though a new tokenizer can increase token consumption by up to 35% for fixed text.
A new xhigh effort level sits between the previous “high” and “max” settings, and Claude Code defaults to it automatically for all plans.
Image resolution support jumped from 1,568px (1.15MP) to 2,576px (3.75MP), with computer-use vision accuracy rising from 54.5% to 98.5% on one developer’s internal benchmark.
The model introduces self-verification: before reporting a task complete, it writes tests, runs them, and fixes failures internally, reducing the need for manual review cycles.
On SWE-bench Pro (harder real-world GitHub issues), Opus 4.7 scores 64.3%, ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%), according to DataCamp’s comparison.
Claude.ai subscribers on Pro ($20/month), Max ($100-$200/month), Team ($30/user/month), and Enterprise plans all get access to Opus 4.7.
A 1 million token context window is included at standard pricing, though recall accuracy can degrade when approaching the upper limit in long agentic tasks.

Anthropic has been on a rapid release cycle through 2025 and into 2026, shipping Opus 4.5 in November 2025, Opus 4.6 in February 2026, and now Opus 4.7 in April 2026. Each iteration has pushed coding and agentic benchmarks measurably forward. This Claude Opus 4.7 review covers what actually changed, what the benchmark numbers mean in practice, where the model falls short, and who should consider upgrading.

Unlike incremental point releases that shift numbers by a percentage point or two, Opus 4.7 introduced three changes that are genuinely material for production use: a substantial vision resolution increase, a new effort tier (xhigh), and built-in output self-verification. Combined with continued strength on SWE-bench and CursorBench, these additions make a real difference for teams running agentic coding pipelines or computer-use automation. The question is whether those gains are worth navigating the new tokenizer’s higher consumption, and whether the $5/$25 per million token price point is still competitive.

This review covers the exact model version claude-opus-4-7-20260416 as of its April 2026 release. All benchmark numbers cited are from published sources and Anthropic’s official documentation.

What Is Claude Opus 4.7?

Claude Opus 4.7 is the flagship model from Anthropic’s Claude 4 family, positioned above Claude Sonnet 4 and Haiku 4 in the capability tier. It is a hybrid reasoning model, meaning it can operate in standard response mode for fast outputs or activate extended thinking for problems requiring multi-step reasoning. Both modes support tool use, including web search during thinking, so the model can alternate between reasoning and external data retrieval in a single pass.

The Opus 4 lineage started with the original Claude Opus 4 in May 2025 at $15/$75 per million tokens, a price point that put it in direct competition with OpenAI’s top-tier offering at the time. Subsequent versions cut that price dramatically: today’s Opus 4.7 costs $5/$25 per million tokens, a 67% reduction from the original while delivering substantially higher benchmark scores across coding, vision, and reasoning tasks.

As of May 2026, Opus 4.7 is available via the Anthropic API (model ID: claude-opus-4-7-20260416), through Claude.ai on Pro and above plans, on AWS Bedrock, Google Cloud Vertex AI, and Microsoft Azure AI Foundry.

Claude Opus 4.7 Features

Coding and Software Engineering

Software engineering is where Opus 4.7 makes its clearest case. On SWE-bench Verified, the standard test for real GitHub issue resolution, Opus 4.7 scores 87.6%, up from 80.8% on Opus 4.6. That nearly 7-point gain is not a rounding error: it places Opus 4.7 ahead of Gemini 3.1 Pro at 80.6% and competitive with GPT-5.5 on this particular test set. SWE-bench Pro, which uses harder, more recent GitHub issues, puts Opus 4.7 at 64.3% versus GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%.

The CursorBench score of 70% (up from 58% on Opus 4.6) reflects performance on IDE-style multi-file editing tasks that are closer to how developers actually work. This matters because SWE-bench isolates single-issue fixes, while CursorBench tests the kind of context-switching and file coordination that happens in a real codebase.

HumanEval, the function-completion test that was the standard coding benchmark for years, now scores at a ceiling level across top models. Opus 4.6 reached 97.8% on this benchmark, and Opus 4.7 continues at a similar level. The more informative numbers are the SWE-bench and CursorBench figures above.

Self-Verification

One of the most practically useful additions in Opus 4.7 is self-verification. Before the model reports a coding task as complete, it proactively writes tests, runs them, and fixes any failures internally. This means the output that surfaces to the developer or orchestration system has already gone through a verification loop. According to MindStudio’s review, this behavior reduces the manual review cycles that teams previously had to build into their pipelines.

The practical effect: agents running Opus 4.7 are less likely to hand back code that compiles but fails edge cases. For teams that have built elaborate post-processing checks into their pipelines, some of that overhead may become redundant.

xhigh Effort Level

Anthropic added a new xhigh effort parameter that sits between the previous “high” and “max” settings. According to the official release notes, xhigh provides deeper reasoning than “high” without the full token cost of “max,” and Claude Code now defaults to xhigh for all users. Developers can tune the effort parameter to trade off between response speed and reasoning depth, which is particularly useful for agentic pipelines where some steps need deep reasoning and others just need fast tool calls.

Task Budgets for Agentic Control

A new task budget parameter lets developers give Claude a rough token target for a complete agentic loop, including thinking, tool calls, tool results, and final output. This is useful for cost management in production: rather than a loop that runs indefinitely, you can set a ceiling and the model will work within it. According to this developer writeup, Opus 4.7 also provides more frequent progress updates during long agentic tasks, which reduces the “black box” feeling of watching an agent work for minutes with no feedback.

High-Resolution Vision

The vision resolution increase is significant for anyone using Opus 4.7 with the computer use tool. Maximum image resolution rose from 1,568px (1.15MP) to 2,576px (3.75MP), a 3x increase in pixel density. Beyond resolution, the model improved on low-level perception tasks: pointing, counting, measuring, and bounding-box detection on natural images. One developer testing computer-use workflows reported that visual acuity benchmark accuracy jumped from 54.5% on Opus 4.6 to 98.5% on Opus 4.7, effectively eliminating a key pain point in UI-reading agents. CharXiv visual reasoning also jumped from 69.1% to 82.1% on tasks without tool assistance.

Improved Memory and File System Use

Opus 4.7 is notably better at writing and reading file-system-based memory. For agents that maintain a scratchpad or notes file across sessions, the model is more consistent about jotting useful information and retrieving it accurately in future tasks. Earlier Opus versions had context retrieval accuracy of around 18.5% (Opus 4.5) rising to 76% on Opus 4.6. Opus 4.7 builds on this foundation with better structured note-taking behavior.

MCP-Atlas and Tool Orchestration

On MCP-Atlas, the benchmark for multi-tool orchestration, Opus 4.7 scores 77.3%, which MindStudio reports as best-in-class across current frontier models. This is relevant for teams building pipelines that combine web search, code execution, file access, and external APIs in a single agent loop.

Claude Opus 4.7 Pricing

The headline pricing is unchanged from Opus 4.6: $5 per million input tokens and $25 per million output tokens via the Anthropic API. Additional cost options include:

Prompt caching: Up to 90% savings on repeated content (e.g., system prompts sent with every request)
Batch processing: 50% savings for non-real-time workloads
AWS Bedrock and Vertex AI: Available at similar price points, with platform-specific volume commitments

The important caveat is the tokenizer change. Opus 4.7 uses a new tokenizer that can consume up to 35% more tokens for the same fixed text compared to Opus 4.6, according to Finout’s cost analysis. For teams migrating existing prompts without modification, actual costs may increase meaningfully. Anthropic recommends updating max_tokens parameters and leaning on caching and batch processing to offset this.

For Claude.ai subscriptions, Opus 4.7 is available on:

Pro: $20/month (approximately 45 Opus messages per 5-hour window)
Max 5x: $100/month (5x Pro usage limits)
Max 20x: $200/month (20x Pro usage limits)
Team: $30 per user per month
Enterprise: Custom pricing

Claude Opus 4.7 Pros and Cons

Pros:

Best-in-class SWE-bench Verified score at 87.6%, ahead of all publicly benchmarked competitors as of April 2026
CursorBench improvement to 70% is more relevant to real IDE-based workflows than synthetic tests
Self-verification reduces manual post-processing in agentic coding pipelines
3x vision resolution increase makes computer-use agents viable for fine-detail UI work
xhigh effort level gives precise control over reasoning depth vs. cost
Task budgets provide cost predictability for long agentic loops
67% price reduction from the original Opus 4 ($15/$75 to $5/$25 per million tokens)
1 million token context window at standard pricing

Cons:

New tokenizer increases token consumption by up to 35% on existing prompts, meaning effective cost may be higher than the headline price suggests
Rate limits are shared across all Opus versions (4.5, 4.6, 4.7), so high-volume users can hit ceilings faster
On Terminal-bench 2.0, Opus 4.7 (69.4%) trails GPT-5.5 (82.7%), making it a weaker choice for terminal-heavy agentic coding tasks
Context recall degrades in practice as prompts approach the 1M token limit
Stricter literal instruction-following can cause issues if existing prompts relied on the model inferring intent rather than following instructions exactly
Pro plan usage limits (approximately 45 messages per 5 hours) are restrictive for heavy daily use

Claude Opus 4.7 vs Alternatives

Claude Opus 4.7 vs GPT-5.5

According to DataCamp’s comparison, GPT-5.5 outperforms Opus 4.7 on Terminal-bench 2.0 (82.7% vs 69.4%) and is generally faster in agentic loops that require rapid tool calls. GPT-5.5 is also specifically optimized for self-correcting agentic coding: it checks its own work and continues until task completion with minimal user guidance. However, Opus 4.7 leads on SWE-bench Verified (87.6% vs approximately 82%) and SWE-bench Pro (64.3% vs 58.6%), and produces more careful, reviewable output. GPT-5.5 pricing sits around $15/$60 per million tokens, making Opus 4.7 meaningfully cheaper for high-volume API use. For teams prioritizing code correctness over raw speed, Opus 4.7 is typically the stronger choice.

Claude Opus 4.7 vs Gemini 3.1 Pro

Gemini 3.1 Pro’s key advantages are a 2 million token context window and significantly lower pricing ($1.25/$10 per million tokens), plus strong multimodal performance for image, PDF, and video inputs. On pure coding benchmarks, Opus 4.7 leads: 87.6% vs 80.6% on SWE-bench Verified, and 64.3% vs 54.2% on SWE-bench Pro, according to DataCamp’s head-to-head. For cost-sensitive applications or ones requiring extremely long context, Gemini 3.1 Pro is worth considering. For software engineering and agentic coding pipelines where output quality matters more, Opus 4.7 currently leads.

Claude Opus 4.7 vs Claude Opus 4.5

Opus 4.5 scored 80.9% on SWE-bench Verified when it launched in November 2025, which was itself a record at the time. Opus 4.7 pushes that to 87.6%. The gap in CursorBench (70% vs approximately 60% on 4.5) is also material. Unless you’re on a tight budget or have heavily optimized prompts around the Opus 4.5 tokenizer, Opus 4.7 is the better choice for new projects. The vision improvements alone justify the switch for any computer-use workflow.

Real Developer Feedback

Developer response to Opus 4.7 has been broadly positive, particularly on the vision and verification improvements. One developer running computer-use workflows reported to MarkTechPost that the visual acuity improvement from 54.5% to 98.5% on their internal benchmark effectively eliminated their biggest bottleneck with Opus 4.6. Enterprise teams using the model for agentic financial workflows and code migration have reported 10-15% task success rate lifts compared to Opus 4.6, with fewer tool errors and more consistent follow-through, based on partner feedback cited in Vantage Point’s enterprise review.

The main friction points raised in the developer community center on the tokenizer change. Teams that migrated from Opus 4.6 without adjusting their prompts found actual API costs higher than expected, even though the per-token rate stayed constant. The stricter literal instruction-following has also caught some teams off-guard: prompts that previously worked because the model inferred intent now require more explicit instructions. This is generally considered a positive behavioral change for production systems, but it requires prompt auditing before migration.

Who Is Claude Opus 4.7 Best For?

Software engineering teams and developer tools: If your workflow involves multi-file refactoring, code migration, or autonomous bug fixing, Opus 4.7’s SWE-bench scores and self-verification behavior translate directly to fewer review cycles and more reliable outputs. Claude Code with Opus 4.7 is the default configuration Anthropic now recommends for complex coding agents.

Enterprise teams with agentic pipelines: Long-running multi-tool workflows that previously required human check-ins at intermediate steps benefit from Opus 4.7’s task budgets, more frequent progress updates, and better memory management. Teams processing large legal contracts, technical specifications, or entire codebases in a single pass will find the 1M context window and improved recall useful.

Computer-use and UI automation builders: The 3x resolution increase and the jump in visual acuity accuracy make Opus 4.7 the first Claude model that is reliably usable for fine-detail UI reading. Screenshot readers, form-filling agents, and desktop automation tools that were impractical with Opus 4.6 are now viable at production scale.

Who should probably wait or use an alternative: If your primary use case is terminal-heavy scripting and agentic CLI work, GPT-5.5 currently leads on Terminal-bench 2.0. If cost is the primary constraint and coding quality is secondary, Gemini 3.1 Pro’s $1.25/$10 pricing is hard to match. If you’re a light Claude.ai user on the Pro plan, the rate limits may frustrate you before the model’s capabilities become relevant.

Our Verdict

Claude Opus 4.7 is a meaningful step forward from Opus 4.6, not a cosmetic point release. The SWE-bench Verified score of 87.6%, the 70% CursorBench result, the vision resolution upgrade, and the self-verification behavior all address real pain points that developers reported with earlier versions. If you’re running coding agents, computer-use automation, or complex enterprise document workflows, the upgrade from 4.6 is worth making.

The main thing to do before migrating is audit your prompt token counts. The new tokenizer’s up-to-35% increase in token consumption can quietly raise your API bill even though the listed price is unchanged. Run your most common prompts through the tokenizer with the new model and adjust max_tokens settings and caching strategy accordingly. With that done, Opus 4.7 represents the best coding-focused Claude model Anthropic has released, and it holds its position against GPT-5.5 and Gemini 3.1 Pro on the benchmarks that matter most for software engineering.

Frequently Asked Questions

Is Claude Opus 4.7 actually available, or is this a future model?

Claude Opus 4.7 (model ID: claude-opus-4-7-20260416) was released on April 16, 2026. It is currently available via the Anthropic API and on Claude.ai for Pro, Max, Team, and Enterprise plan subscribers. You can access it directly through Anthropic’s Claude page.

What is Claude Opus 4.7’s SWE-bench score?

Claude Opus 4.7 scores 87.6% on SWE-bench Verified, up from 80.8% on Opus 4.6. On SWE-bench Pro, which uses harder real-world GitHub issues, it scores 64.3%. These are the highest publicly reported scores for a generally available model as of April 2026. Claude Mythos Preview (a research model, not generally available) scores 93.9%.

How much does Claude Opus 4.7 cost?

The API price is $5 per million input tokens and $25 per million output tokens, with up to 90% savings through prompt caching and 50% through batch processing. Claude.ai subscriptions with Opus 4.7 access start at $20/month (Pro plan). One important detail: a new tokenizer in Opus 4.7 may consume up to 35% more tokens for the same text compared to older Claude models, which can affect actual costs. Check the official pricing page for the latest rates.

How does Claude Opus 4.7 compare to GPT-5.5 for coding?

On SWE-bench Verified, Opus 4.7 (87.6%) leads GPT-5.5. On SWE-bench Pro real-world issues, Opus 4.7 scores 64.3% versus GPT-5.5’s 58.6%. GPT-5.5 outperforms on Terminal-bench 2.0 (82.7% vs 69.4%) and is generally faster for terminal-heavy agentic workflows. For code correctness and reviewability on complex software engineering tasks, Opus 4.7 is typically the stronger choice. Full comparison available at DataCamp.

What is the xhigh effort level in Claude Opus 4.7?

The xhigh effort parameter is a new reasoning depth setting that sits between the previous “high” and “max” levels. It provides deeper reasoning than “high” without the full token cost of “max.” Claude Code defaults to xhigh for all users. Developers can adjust the effort parameter via the API to trade off between response quality and speed and cost for different steps in an agentic pipeline.

Does Claude Opus 4.7 support computer use?

Yes. Opus 4.7 supports the computer use tool, which provides screenshot capabilities and mouse and keyboard control for autonomous desktop interaction. The vision resolution increase to 2,576px (3.75MP) significantly improves the model’s ability to read fine-detail UI elements, with one developer reporting visual acuity accuracy jumping from 54.5% on Opus 4.6 to 98.5% on Opus 4.7.

What context window does Claude Opus 4.7 support?

Claude Opus 4.7 supports a 1 million token context window with 128k maximum output tokens. The full 1M context is available at standard pricing without surcharges. That said, context recall can degrade in practice as prompts approach the upper limit, so production pipelines using very long contexts should test retrieval accuracy at their actual token counts before committing to full deployment.

Is Claude Opus 4.7 better than Claude Opus 4.5?

Yes, on every major benchmark. SWE-bench Verified: 87.6% vs 80.9%. CursorBench: 70% vs approximately 60%. Vision resolution: 3.75MP vs 1.15MP. Self-verification and xhigh effort are new features not present in Opus 4.5. The API price is the same ($5/$25 per million tokens). For new projects, Opus 4.7 is the better starting point. Teams with heavily optimized Opus 4.5 prompts should audit token counts before migrating, given the tokenizer change.

Can I use Claude Opus 4.7 for free?

There is no free access to Opus 4.7. The Claude.ai free tier does not include Opus 4.7. The lowest paid tier that includes it is Claude Pro at $20/month, which provides approximately 45 Opus messages per 5-hour window. API access requires an Anthropic account with billing enabled at $5/$25 per million tokens.

What are the main weaknesses of Claude Opus 4.7?

The most significant practical weakness is the new tokenizer: it can increase token consumption by up to 35% on existing prompts, making effective costs higher than the unchanged headline price suggests. On Terminal-bench 2.0, Opus 4.7 scores 69.4% versus GPT-5.5’s 82.7%, making it a weaker option for terminal-heavy CLI automation. Rate limits are shared across all Opus versions, which can create bottlenecks at high usage volumes. Stricter literal instruction-following is generally positive, but requires prompt auditing during migration from earlier Claude versions.