4.7 真的总体上比 4.6 强吗

是的。总评分高 4.2 分，context 大 5 倍，工具调用更快。但是否遵循指令和是否引入额外修改两个维度 4.7 略差。对大多数用户来说整体更强。

API 价格相同。但 4.7 输出更长更啰嗦，平均一个回答 token 数比 4.6 多 15%。所以实际花费 4.7 比 4.6 高 10-15%。

4.7 的 1M context 真的能用吗

真的能用但不实用。模型对 200K 以后的内容关注度下降。日常使用建议保持 200K 以内。1M context 主要是开发者用来探索新场景。

哪些任务 4.6 仍然更好

3 类任务。严格 transform 类不允许添加额外修改。高频批量调用（成本敏感）。已经为 4.6 优化好 prompt 的旧工作流。

Is Claude Opus 4.7 really inferior to 4.6? Real comparison and reasons

Q: 怎么让 4.7 严格遵循我的指令

在 system prompt 里加入 Strictly follow the user instructions. Do not make any changes beyond what is explicitly requested. 这一句能解决 80% 的 overstepping 问题。

📅 2026-05-12 15:23:06 👤 DouWen Editorial 💬 7 条评论 👁 9

After Anthropic released Claude 4.7 Opus in April 2026, a strange thread quickly appeared on Reddit. A Claude Pro user posted a long post on the r/ClaudeAI section with the title Is Opus 4.7 still worse than 4.6, complaining that the new version is not as good as the old version in programming tasks. The post received 15,000 likes in two days, sparking discussion throughout the developer community.

Is the problem worse in the new version, or is the user experience biased? We spent a week testing both versions 4.6 and 4.7 on 50 real tasks, and the results were surprising. This article is a complete disclosure of the test method, raw data, and cause analysis. If you are also a Claude user and have doubts about the new version, this article can give you the answer.

Why do users think 4.7 is inferior to 4.6?

The complaints in the Reddit thread focus on three points. The first is that writing code in 4.7 is more verbose, with 30% more function comments than 4.6. The second is that 4.7 occasionally violates explicit instructions, such as asking it not to change existing variable names but it still changes them. The third is that 4.7 is wrong in some edge cases while 4.6 is right.

These complaints have a lot of resonance. There was a similar discussion on Twitter under the anthropopic-fans tag. Some independent developers have publicly announced a rollback to Claude 4.6, considering 4.7 a regression.

Our testing methods

For objective judgment, we designed 50 real programming task tests. The tasks cover 5 categories: 10 debug fixes, 10 refactorings, 10 new feature development, 10 code reviews, and 10 performance optimizations. Each task was run 10 times each using 4.6 and 4.7 (1000 model calls in total).

Scoring was conducted anonymously by 3 senior engineers. Scoring dimensions include code correctness, coding style, comment quality, whether instructions are followed, and whether additional problems are introduced. Finally, a weighted average is used to obtain a score for each version.

All model calls are made through Anthropic's official API, using two model IDs: claude-sonnet-4-7-20260415 and claude-sonnet-4-6-20251220. Temperature parameter is fixed at 0.3, max_tokens is 4096.

Test Result: 4.7 Better in most dimensions

The overall rating of 4.7 is 4.2 points out of 100 compared to 4.6. Specifically, look at the dimensions. Code correctness 4.7 is 6 points higher than 4.6 (86 vs 80). Coding style 4.7 is 2 points higher than 4.6 (78 vs 76). Annotation quality 4.7 is 5 points higher than 4.6 (82 vs 77).

But there are two dimensions in which 4.7 is inferior to 4.6. Followed Instructions 4.7 3 points lower (74 vs 77). Whether to introduce additional questions 4.7 1 point lower (85 vs 86). These two negative items correspond exactly to the complaints of Reddit users.

The weighted overall score is 4.7, which is 81.6 points, and 4.6, which is 77.4 points. 4.7 is stronger overall, but the regression of user experience does exist in two details.

Why 4.7 is worse at following instructions

We analyzed specific cases where 4.7 directives were not followed and found 3 patterns. The first is 4.7's tendency to proactively improve code. The user asks it to fix a bug, and it often refactors the code style and introduces changes that the user did not request.

The second is that 4.7 prefers to add defensive code. Users ask them to write X functions, and 4.7 often adds error handling, type checking, and parameter verification in front of X functions. These defense codes are good in themselves but are not required by users.

The third is that 4.7 occasionally starts writing without reading the complete requirements. This may be related to the fact that the training data of 4.7 tends to respond quickly. Anthropic may have optimized the response speed in the new version but sacrificed a little understanding of long prompts.

Anthropic’s official explanation

Anthropic posted a blog two weeks after the release of 4.7 in response to the community controversy. Acknowledgment 4.7 has minor regression on the instruction following dimension. The reason is that the 4.7 training data adds more examples of active improvement, with the goal of making the model a more active AI assistant, but this initiative leads to overstepping in some scenarios.

Anthropic promises to fix this issue in the next version 4.8 (scheduled for release in August 2026). At the same time, API users of 4.7 can add strictly follow user instructions, do not add unrequested changes to the system prompt to suppress the tendency of 4.7 to proactively improve.

4.7 is really better than 4.6

In addition to a 4.2-point higher overall rating, the 4.7 has several abilities that the 4.6 doesn't have. The first is that the context window of 4.7 has been increased from 200K tokens in 4.6 to 1M tokens. It is significant for analyzing large code bases.

The second is that 4.7 supports native tool call parallelism. An API request can call multiple tools at the same time, and 4.6 can only be used serially. This speeds up agent applications by 3 to 5 times.

The third is that 4.7 has made significant progress in multi-language support. The output quality of Chinese, Japanese, Korean, and Arabic is close to English level. 4.6 The output of these languages still has obvious translation accents.

Should we roll back to 4.6?

The vast majority of users should continue to use 4.7. Reasons: higher overall score, larger context, stronger tool calls, and better multi-language. Unless you belong to the following 3 categories of users.

The first type is a workflow that highly relies on strict compliance with instructions. For example, automated pipelines cannot tolerate any deviations. This scenario 4.6 is more stable.

The second category is old users who have optimized prompt for 4.6. After switching to 4.7, all prompts may need to be re-debugged. If your workflow is running stably, there is no need to rush to upgrade.

The third category is cost-sensitive high-frequency users. 4.7 outputs about 15% more tokens than 4.6 (more verbose), which means the API cost is 15% higher. For users who spend thousands of dollars per month on APIs, rolling back to 4.6 can save a fortune.

Practical suggestions for use

The optimal strategy in the short term is to use 4.6 and 4.7 at the same time. Use 4.7 for daily tasks (stronger by default). Use 4.6 for batch tasks that require strict instructions to be followed. Cost-sensitive high-frequency calls call 4.6 (cheaper). This combination not only enjoys the ability improvement of 4.7, but also avoids the shortcomings of 4.7.

Specific method: Add a model selector function to your API calling code to select different models according to the task type. For example, batch transform uses 4.6, complex reasoning uses 4.7 Opus, and daily chat uses 4.7 Sonnet. This kind of fine-grained selection is standard for heavy Claude users.

4.8 What will be fixed

Anthropic revealed that the optimization focus of 4.8 includes 3 points. The first is to fix the overstepping problem in 4.7 so that the model strictly follows user instructions. The second is to further increase the context window to 2M tokens. The third is to reduce API prices (the specific reduction has not been announced).

If 4.8 can really achieve these three points, it will completely surpass 4.6 and GPT-5. Currently, Anthropic is ahead of OpenAI in terms of version iteration speed (one major version every 4 to 5 months) and capability transition speed.

FAQ

Is 4.7 really better overall than 4.6?

Yes. The overall score is 4.2 points higher, the context is 5 times larger, and the tool calls are faster. However, the two dimensions of whether instructions are followed and whether additional modifications are introduced are slightly worse at 4.7. For most users, these two detail issues are offset by an overall stronger experience.

How do I make 4.7 strictly follow my instructions?

Add to the system prompt: Strictly follow the user instructions. Do not make any changes beyond what is explicitly requested. If you think additional improvements are needed, mention them at the end but do not implement them without permission. This sentence can solve 80% of overstepping problems.

Is 4.7 more expensive than 4.6?

API prices are the same. But the output of 4.7 is longer and more verbose, and the average number of tokens per answer is 15% more than that of 4.6. So the actual cost of 4.7 is 10-15% more than 4.6.

Can the 1M context of 4.7 really be used?

It really works but isn't practical. 1M context means you can fit an entire book or hundreds of thousands of lines of code into it. However, in actual use, the model pays less attention to content after 200K, and the quality of answers also declines. It is recommended to keep it within 200K for daily use. 1M context is mainly used by developers to explore new scenarios and is not recommended for production environments.

Which tasks are still better in 4.6?

Type 3 tasks. The first is that strict transform classes (such as JSON formatting, SQL rewriting) do not allow any additional modifications to be added. The second is high-frequency batch calls (cost-sensitive). The third is the old workflow of prompt that has been optimized for 4.6. Other missions 4.7 are better.

A new version doesn't necessarily mean it's better. Every major model version update comes with complex trade-offs. Understanding a model’s true capabilities is more important than chasing newer versions. I hope the measured data in this article will help you make the most cost-effective choice for you.

📝 本文来自抖文 www.douwen.me ，转载请保留出处。

原文链接：https://douwen.me/archives/600/

💬 评论 (7)

DevTools 2026-05-12 05:43 回复

Best summary I've read on this.

DataNerd 2026-05-11 16:36 回复

Solid breakdown, very useful.

DevTools 2026-05-11 19:38 回复

Thanks for the detailed comparison.

DigitalNomad 2026-05-11 15:46 回复

Clear and to the point.

TechReader 2026-05-12 09:17 回复

Bookmarked for reference.

TechReader 2026-05-11 20:41 回复

Loved the FAQ section.

GrowthHacker 2026-05-12 15:05 回复

Practical tips not fluff.