Why testing is more important in the era of AI coding, 6 new tools for automated testing in 2026

📅 2026-05-18 11:32:47 👤 DouWen Editorial 💬 7 条评论 👁 6

After the efficiency of AI code writing increased, testing became a new bottleneck in the team. Code output has been significantly accelerated, but bugs have not been reduced at the same time. Manual testing is increasingly unable to keep up with the pace. Implementing solid testing will become a real headache for many teams in 2026. This article will not cite the precise figures reported by various companies. It will only explain clearly why testing is more critical in the AI ​​era, and then introduce several automated testing tools that are currently commonly used in engineering teams, and how to use them in combination.

Why testing is more critical in the era of AI coding

Several reasons have combined to push testing to a higher priority.

The first is that the quality of AI-generated code is distributed more broadly. The generated code has beautiful syntax, but there are more logic traps than human writing. The probability of regression bugs after merging into the trunk cannot be ignored. This means that the team can no longer rely on the assumption that "the developers themselves are cautious" and must have automated guardrails to check CI.

The second is that the code output speed has increased. The number of PRs merged by a team of the same size every day is significantly higher than a few years ago. Manual review is getting closer and closer to the ceiling. Automated testing is one of the few links that can still keep up.

The third is that the technology stack itself is becoming more complex. Microservices, Serverless, edge computing, and AI components are mixed together, and the number of levels and combinations of integration testing has increased significantly, making it impossible to complete testing by hand.

A deeper change is that the testing paradigm gradually changes from "finding bugs" to "preventing bugs". Every new feature must be accompanied by testing by default. Features without testing can no longer pass review in many teams. This is one of the most important engineering culture changes in the past few years.

Testing is no longer just a matter of QA

Almost all job descriptions for engineers in large factories in 2026 clearly state that being able to write tests is a basic requirement.

Change one, developers are responsible for unit testing. AI can help with writing, but engineers are responsible for correctness. A PR cannot pass review without unit testing.

Change 2: The role of QA is upgrading. The traditional manual QA based on scripts is shrinking. QA is now responsible for designing test strategies, maintaining test infrastructure, doing chaos engineering, and being responsible for performance stress testing. The threshold is much higher than in the past.

Change three: SDET, or software development and testing engineer, has changed from a rare position a few years ago to a standard job for major manufacturers. This type of engineer is essentially a QA who can write code, and can work independently on testing frameworks, automation platforms, and data management.

Change four: Product managers are pulled into the test design process. The spec document of a new feature should be accompanied by test cases instead of adding it after the fact. This thing seems small, but it can reduce a lot of rework that occurs when "the requirements are not clear until the implementation is completed".

In terms of the overall staff ratio, the ratio of dev to qa is indeed more disparate, but the time each dev invests in testing has increased, and the team's total investment in testing has not decreased.

The first one, Playwright, a full-stack player promoted by Microsoft

Playwright is an end-to-end testing framework open sourced by Microsoft in 2020. It has grown rapidly in the past few years and is currently one of the preferred e2e tools for new projects.

Its highlights are cross-browser, automatic waiting, multi-language SDK, built-in Codegen, and Trace Viewer. One set of code can run Chromium, Firefox, and WebKit at the same time. Elements and network requests will automatically wait without handwriting. TypeScript, JavaScript, Python, Java, and C# all have official SDKs. Codegen allows you to open a browser and operate it manually to automatically generate the corresponding test code, which is particularly friendly to teams that are just getting started. Trace Viewer is a failure playback tool that strings together screenshots, network requests, and console logs of each step, making debugging extremely convenient.

The suitable scenario is e2e testing of new projects and large-scale web applications. The community is active, the documentation is complete, and it is the easiest to recruit people. Completely open source and free, no SaaS lock-in.

The disadvantage is that mobile native does not support it, and native apps require tools such as Appium.

The second one, Vitest, a new generation of unit testing framework

Vitest is a unit testing framework produced by the Vite team, and has become the most common choice besides Jest in new front-end projects.

Its speed advantage mainly comes from Vite's own ESM loading method, which starts significantly faster than Jest. The interface is highly compatible with Jest. To migrate from Jest, you only need to change the import path, and the API is almost the same. Mocking, coverage, and snapshots are all built-in functions, and there is no need to install a bunch of additional plug-ins. The watch mode intelligently identifies related tests and automatically runs the corresponding part when a file is changed. It also supports writing tests directly in the source code file, which is very easy to use for small projects.

The suitable scenario is for new web projects, any front-end framework such as Vue, React, Svelte, Solid, etc. can be used. Completely open source and free.

The disadvantage is that Vitest is mainly oriented to the Node.js environment, and the browser's native running scenario requires additional configuration.

The third model, testRigor, is a SaaS platform for codeless testing.

testRigor is a SaaS testing platform that has not been established for a long time, focusing on writing tests in natural language.

The way it works is that the tester or product manager describes the test case in English, such as "Open the login page, enter valid credentials, verify and jump to the dashboard, check that the user name is displayed in the upper right corner", and the platform parses it and turns to the underlying command execution. Its visual-level assertions can use AI to compare screenshot differences, covering Web, Mobile native, API, and Desktop across platforms. The self-healing testing mechanism can use heuristics to reposition elements after UI changes, reducing maintenance burden.

The price is high for enterprises. Please refer to the price list on the official website for details. It is not a tool for individual developers.

It is suitable for companies with large QA teams, product managers who are willing to take on part of the test writing work, and who are willing to pay platform fees.

The disadvantage is that it is quick to use but has limited space for customization. Complex scripted scenarios still require engineers to write code.

The fourth model, Cypress, is still the main force but growth is slowing down.

Cypress is a well-known e2e framework from the early days and has long been the default option in the React and Vue ecosystems. It has been closely followed by Playwright in the past two years, but it is still one of the main tools.

Its characteristic is that it runs from the same source, the test code and the application code are in the same browser context, and the debugging experience is very straightforward. The time travel function allows you to look back at the DOM snapshot of each step, and the Test Runner's GUI is very friendly to newcomers. Component Testing gives it a good experience in unit-level testing of React and Vue components.

The suitable scenario is for teams that have already invested in Cypress and want to use a set of tools for e2e and component testing.

The disadvantage is that the design of homologous running is not flexible enough in cross-domain and multi-tab scenarios. Playwright is usually more convenient in these two scenarios.

Cypress provides two levels: open source version and cloud version. The specific pricing is subject to the official website.

The fifth model, K6, a new benchmark for performance testing

K6 is an open source stress testing tool owned by Grafana Labs and is currently one of the de facto standards in the field of performance testing.

It uses JavaScript to write test scripts, which is friendly to front-end engineers. The amount of concurrency simulated by a single test machine is an order of magnitude higher than that of traditional JMeter. Stress test data can be directly fed to the Grafana dashboard, and subsequent visualization and alerts are smooth. The Cloud version supports multi-region distributed stress testing and long-term running. After the Browser module is added, K6 can also perform user journey stress testing based on real browsers, not just the pressure at the HTTP request level.

The suitable scenario is any project that requires performance testing, from API stress testing to full-stack user scenario testing. The open source version is completely free, and the Cloud version provides hosting capabilities. The pricing is based on the official website.

The disadvantage is that the GUI is weak, and people who are used to the JMeter graphical interface need to adapt to the coded testing method.

The sixth model, Stryker, mutation testing discovers test blind spots

Stryker is a representative framework for mutation testing and is suitable for projects that already have high coverage but still have bugs.

The way it works is to automatically modify your source code, such as changing a > b to a < b, and then run your tests. If the test still passes, it means that the coverage of this part of the code is "fake coverage" and the test does not really verify the business rules. This is one level deeper than simply looking at the coverage report. 100% coverage does not mean that the test quality is good. Stryker will directly expose these blind spots.

It supports JavaScript, TypeScript, C#, Scala, and PHP. Java usually implements similar capabilities through PIT, and Python uses mutmut.

It is suitable for projects that already have high line coverage but still have frequent regression bugs. The mutation test report can tell you which tests are for decoration and which ones are really protecting the code.

The disadvantage is that the running time is long, usually more than ten times the unit test time, and is not suitable for running every commit. A more common approach is to run it once a week or before each release.

Suggestions for using the six tools in combination

每个团队的栈不同,这里给出几种相对稳妥的组合。

The usual combination of web front-end projects is that Vitest runs unit testing, Playwright runs e2e, and K6 does performance stress testing. Old projects that have Cypress investment can continue to maintain the existing Cypress suite, add new modules and then switch to Playwright.

For back-end API projects, you can use Vitest or Jest to run unit and integration tests, Postman/Newman to run API automation, and K6 to run stress testing.

A common combination of cross-platform mobile terminals is Detox plus Playwright Mobile for e2e, Vitest unit, and K6 for back-end linkage scenarios.

The QA-led test team can ask product managers or testers to write business scenarios on testRigor, engineers use Playwright for backup, and K6 is responsible for performance verification.

If you want to deeply optimize mature projects, you can add Stryker to do a scan based on the existing tests, locate the surviving mutations on the critical path, and then perform targeted tests.

In terms of weekly time allocation, a relatively healthy rhythm is that developers spend about 70% of their time writing functional code, 20% of their time writing tests, and 10% of their time reviewing and optimizing the tests themselves.

Fine-tuning the test pyramid in the AI ​​era

The classic testing pyramid emphasizes unit first, integration second, and e2e least. This principle still holds true in the AI ​​era, but there is some room for fine-tuning the proportions.

Unit testing still accounts for the largest proportion, but the quality requirements are higher. It is easy to write unit tests for AI, but the reviewer must check, otherwise there will be a bunch of "decorative" tests.

The proportion of integration tests has increased slightly. There are more and more microservices and external services, and the value of integration testing across service boundaries is increasing.

The proportion of e2e tests also increased slightly. Playwright's current generation of tools is much more stable than it was a few years ago, and the flaky nightmares of the past are less severe.

Visual regression testing is emerging more and more as a new category. AI changes UI faster than manual work, and visual changes occur frequently. Adding a layer of visual regression is more reliable than relying on human eyes.

The total time investment will not be reduced because of AI, because although the unit time of writing tests has become faster, the scenarios that need to be covered are also wider, and the overall investment in writing tests has increased slightly rather than decreased.

How to get AI to write good tests

Several practical experiences are worth putting in the review process.

The first is to write the spec clearly before writing a test for AI. The input, output, edge case, and error handling are all clearly listed. AI is not a mind-reader.

The second is to let AI generate 5 to 10 different input cases at one time, and humans review them once to delete duplicates and false positives. This is more efficient than letting AI slowly squeeze out the cases.

The third item is a few things that must be seen in the review. Whether the assertion really covers the business rules, whether it contains edge cases such as null and empty arrays, whether the correct dependencies are mocked, and whether the test name clearly describes the intention.

Don’t commit the anti-pattern. Let AI generate a large number of tests with a "coverage rate of 95% but each test only asserts and does not report errors". Such pseudo tests can be exposed as soon as Stryker runs, which will cause the team's trust in coverage to collapse.

FAQ

My project has almost no tests. Is it too late to add more tests now?

There is time, but it must be done in stages. First add unit tests to the most critical business core modules, such as payment, login, and order processes. This stage can block most serious incidents. Then add e2e tests to cover key user journeys, and finally it’s the turn of utility class helper functions. Don't pursue a high coverage number at once. First use 20% of tests to cover 80% of the critical path. Most projects can build a sustainable testing system within a few months.

Which to choose between Playwright and Cypress?

New projects usually choose Playwright, which has better cross-browser, cross-label, and cross-domain support than Cypress, and the SDK is available in multiple languages. There is no problem in continuing to use Cypress for projects that have already been invested by Cypress and are familiar to the team. Cypress is still being continuously updated, and the migration cost may not be cost-effective.

Is testRigor, a SaaS tool worth a few hundred bucks per month?

It depends on the team structure. If the QA team is large, the product manager is willing to write tests, and the entire organization wants to reduce QA's reliance on pure coding, the cost can be spread. It's usually not worth it for small teams or single-dev projects, and open source tools will suffice.

Mutation test Stryker is so slow, is it worth running?

It's worth it, but you don't need to run every day. It is recommended to run the full volume once a week, and then run it again before each release. After reviewing the results, select the surviving mutations on the critical path for supplementary testing. Don't pursue the elimination of all mutations. Reaching a relatively high kill rate is already an excellent level. If you continue to polish it, the marginal utility will decrease.

Can AI-generated tests be merged directly?

Can't. There are three common types of problems in AI generation testing. The first is that the assertion is too loose, only checking for not null but not the specific value. The second is that there is a big gap between the mock data and the real production environment. The third is to only cover the happy path and lack the edge case. All AI generated tests must have human review, and it is best to actually run a reverse case to see if it can catch typical bugs.

Source of inspiration: Issue 388 of Ruan Yifeng's "Technology Enthusiasts Weekly" https://www.ruanyifeng.com/blog/2025/08/weekly-issue-388.html

📝 本文来自抖文 www.douwen.me ,转载请保留出处。

💬 评论 (7)

P
ProductHunter 2026-05-17 19:25 回复

Bookmarked for reference.

D
DigitalNomad 2026-05-18 00:07 回复

Thanks for the detailed comparison.

T
TechReader 2026-05-18 07:49 回复

Best summary I've read on this.

C
ContentDev 2026-05-17 23:22 回复

Great resource.

G
GrowthHacker 2026-05-18 01:55 回复

Easy to follow.

C
ContentDev 2026-05-18 10:31 回复

Sharing this with my team.

G
GrowthHacker 2026-05-18 08:06 回复

Clear and to the point.