QA Engineering|6 minutes

Why Broken QA Systems Become Worse in AI-Native Teams

Adding AI quality discipline on top of a broken QA foundation makes the foundation problem worse, not better.

GatekeeperOps

The Hidden Precondition

Most AI-native engineering teams trying to add AI quality discipline run into a problem they did not expect. The problem is not the AI evals. The problem is the existing QA system underneath.

The existing system is often broken. Not in the obvious way where nothing works. Broken in the subtle way where parts work, parts do not, and the team has stopped trusting the parts that should work.

Adding AI quality testing on top of this foundation does not solve the foundation problem. It compounds it. The team adds a new layer of testing that gets treated the same way the existing tests are treated. Which is to say, with varying degrees of skepticism and inconsistent enforcement.

The Shape of a Broken QA System

Specific symptoms repeat across teams.

Automation suites contain hundreds of tests, many of which fail randomly. Engineers do not investigate the failures because they have learned the failures are usually not real. When a real failure happens, it gets dismissed alongside the flaky failures. Real signal is buried in noise.

CI pipelines take longer than they should. Some pipelines take long enough that engineers start finding ways to bypass them. Direct merges to main. Skipping CI on small changes. Rerunning the build until it passes. Each bypass is individually rational. Collectively, they erode the gate.

Coverage reports show eighty percent coverage but mean very little. The tests that exist are uneven in quality. Critical paths have shallow coverage. Edge cases have no coverage. The percentage is misleading because the underlying tests do not reflect risk.

Tests that should be deleted remain in the codebase. Tests that catch important issues are mixed with tests that catch nothing useful. The signal-to-noise ratio is poor and nobody has the time to clean it up.

The team has informal workarounds. Senior engineers know which test failures to ignore. Junior engineers do not. The institutional knowledge of “what the CI failures actually mean” is in heads, not in documentation. When senior engineers leave, the knowledge leaves with them.

Why AI-QA Makes This Worse

In a healthy QA culture, adding AI quality testing extends an existing discipline. The team is used to taking test failures seriously, so AI eval failures get taken seriously too. The CI integration patterns already work, so adding new evals into CI is straightforward. The team trusts the gates, so the new gates get trust too.

In a broken QA culture, adding AI quality testing extends an existing dysfunction. The team is used to ignoring test failures, so AI eval failures get ignored too. The CI integration patterns are already brittle, so adding new evals into CI breaks things further. The team does not trust the gates, so new gates get no trust either.

The same eval suite, with the same code, with the same thresholds, will produce dramatically different outcomes in these two cultures. The difference is not the AI-QA work. The difference is the foundation it sits on.

The Worst Case: AI-QA as Noise

The specific failure mode worth naming. A team with a broken QA system adds AI evals. The evals run in CI alongside the existing flaky automation tests. When the AI evals fail, the team treats them the way they treat the existing failures: with cautious dismissal.

“The eval failed again. Probably nothing. Let us just rerun the build.”

The hallucination detection that was supposed to catch a real regression gets dismissed alongside the flaky UI tests. The prompt injection probe that surfaced a real vulnerability gets ignored alongside the timing-related test failures. The RAG quality alert that indicated retrieval drift gets muted alongside the false-positive integration tests.

The team has built AI quality testing infrastructure and made it useless by deploying it into a culture that does not act on test failures.

This is worse than not having the AI-QA tests. Not having them means the team knows they are flying blind. Having them and ignoring them means the team thinks they have a safety net that does not actually work.

Fix the Foundation Before Adding Floors

The implication is straightforward. If the underlying QA system is broken, fix it first.

Fixing the foundation means specific things. Eliminate the flaky tests. Either fix them so they pass reliably, or delete them. Restore the principle that CI failures are real signal worth investigating.

Consolidate the test suite so it reflects actual risk priorities. Remove the dead tests that nobody runs and nobody trusts.

Repair the CI infrastructure. If pipelines take too long, find out why and fix it. If pipelines fail for reasons unrelated to the code change, fix those reasons. The principle that “if CI is red, something is wrong” must hold.

Rationalize the coverage measurement. Eighty percent line coverage means nothing if the lines that are covered are trivial and the lines that are not covered are critical. Coverage measurement should target the actual risk surface of the application, not produce a number that looks good on a dashboard.

Document the test ownership. Who owns which tests, who fixes which failures, who decides when to delete which tests. Informal knowledge is fragile. Documented ownership is durable.

This is engineering work, not consulting work. It produces a measurably more reliable test infrastructure that the team can trust. After the foundation is fixed, AI quality discipline can be added on top with a reasonable chance of actually working.

The Sequencing Question

A common objection. The team has limited time. AI features are shipping now. There is pressure to add AI quality testing. Repairing the existing QA system feels like a detour.

The objection is understandable but the math does not work. AI-QA built on a broken foundation will not produce reliable AI quality outcomes. The investment in AI-QA work gets wasted because the surrounding culture does not enforce it. The team spends three months building AI evals that get ignored within six months.

The faster path to reliable AI quality is to fix the foundation first. Foundation repair often takes less time than teams expect for moderately broken QA systems. After that, AI quality work compounds because it sits on a culture that takes test failures seriously.

Teams that try to skip the foundation repair usually have to come back to it later, after their AI-QA infrastructure has failed to produce the expected outcomes. The shortcut is longer than the direct path.

Where to Start

Honest diagnosis is the prerequisite. Most engineering leaders know their QA system has issues but underestimate the severity. The first useful step is an outside assessment of the actual state of the QA infrastructure. Flakiness patterns. CI reliability. Coverage quality. Ownership clarity. Team culture around test failures.

The diagnosis often surprises engineering leadership. The system is worse than they thought. The team has more workarounds than they realized. The cultural debt is larger than the technical debt.

After diagnosis comes prioritized repair. Not every issue gets fixed at once. The highest-impact fixes go first. Flakiness elimination is usually the highest impact because it restores trust in the gates. CI repair comes next. Test consolidation comes after the immediate fires are out.

Once the foundation is reliable, AI quality work can be added on top with confidence that it will actually be enforced.

Final Thought

AI quality engineering is downstream of QA engineering. Teams cannot have a reliable AI-QA function on top of an unreliable QA function. The foundation determines what is possible above it.

This is not glamorous work. Fixing flaky tests does not feel as exciting as building eval suites for LLMs. But the unglamorous work is the prerequisite for the work that follows. Teams that skip it find themselves rebuilding their AI-QA discipline later, on a foundation that should have been fixed first.

Fix the foundation. Then add the floors.

qa-system-rescueai-qaci-cdtest-flakiness

Previous PostLLM Evaluation Is Not Enough Without Release Gates

Fix the QA foundation before adding AI-QA on top.

QA System Rescue repairs flaky automation, broken CI, and weak release signals. After the rescue, your team trusts the gates again.

Fix QA System