AI-First Test Automation: We Let AI Build Our Playwright Tests. Here’s What Happened

What happened when we put TTC Global’s AI-First Test Automation Methodology to the test, and what it tells us about where humans remain essential.

Mei Reyes Tsai
  • Group Chief Technology Officer
  • TTC Global
  • Auckland, NZ

Co-Authors

Pavel Marunin
  • Pavel Marunin
  • Principal Consultant
  • TTC Global
  • Auckland, New Zealand

In our companion article, we set out the methodology: four pillars, five phases, the conditions that make AI-first automation work at enterprise scale. This is what it looks like when you run it.

85 minutes. That’s how long it took an AI agent to automate a 31-step enterprise banking journey, covering registration, account creation, fund transfers, loan applications, and transaction reconciliation across 10 application pages, running inside a custom framework with architecture rules no AI model has ever been trained on. No guidance during execution, no code edits, no human decisions along the way.

This is what we delivered to a human reviewer, and what it tells us about where humans remain essential.

For the full technical breakdown, the phase-by-phase detail, the pillar mechanics, the honest boundaries, download the whitepaper. This article tells the story.

The Vibe-Coding Trap

AI coding assistants have lowered the barrier to test automation to near zero. LinkedIn feeds are full of professionals (many without a coding background) announcing that they built a Playwright automation framework using AI in an afternoon. And they are not wrong. Modern AI tools can produce a working test suite from a conversation, and people with no prior coding experience are shipping functional applications the same way.

For small applications and proof-of-concept demos, this works. The code runs, the tests pass, and the results are genuinely impressive for the effort involved.

The problem surfaces at scale. When those vibe-coded frameworks meet enterprise reality (dozens of applications, hundreds of test scenarios, multiple teams, cross-platform requirements, and months of accumulated change), the absence of architectural discipline becomes a liability. Tests that were easy to create become impossible to maintain. Patterns that worked for ten scenarios collapse at two hundred. Reworking a poorly structured automation suite typically costs more than building it right the first time.

At enterprise scale, the model and the tool are the easy part. What determines whether AI automation holds up is the system of conditions you build around it. Without it, the same capable AI produces the same technically correct but architecturally unsustainable code.

Why Teaching AI Our Framework Was the Real Challenge

Playwright excels at browser automation and API testing, but it is a tool, not an architecture. Business logic modelling, test data management, cross-platform integration, logging, state management, and maintenance at scale are entire categories of complexity that must be addressed at the framework level. We have seen clients abandon entire automation investments, not because the tools were inadequate, but because poor structural decisions made the resulting test suites impossible to sustain.

TTC Global maintains the Playwright Accelerator to address exactly this. Page objects define locators only. Step classes handle interactions with built-in logging. Verification classes encapsulate assertions. Process classes orchestrate multi-step workflows. Facades bundle everything into clean test interfaces. On top of that, the framework includes custom utilities for tables, dates, strings, files, PDF validation, retry mechanisms, and encrypted secret management, capabilities that must be purpose-built for enterprise use and that no off-the-shelf tool provides.

The framework’s test development manual spans over 120 pages across 24 chapters, covering architecture rules, layer boundaries, naming conventions, anti-patterns, and platform-specific patterns. No large language model has been trained on it.

General-purpose AI coding assistants produce vanilla Playwright: tests that work for a demo but violate every architectural principle the framework enforces. They ignore the decorator system, the fixture hierarchy, and the facade pattern. The challenge was not simply "can AI write tests?" It was "can AI write tests that a senior engineer would approve in a code review, inside a framework the AI has never seen before?"

What Happened in 85 Minutes

The test case covered the full lifecycle for a newly relocated banking customer: registration, address update, account creation, fund transfers, bill payments, loan application, and transaction reconciliation, across 10 distinct application pages, exercising both the UI and REST API, verifying balance consistency at every stage. A real enterprise scenario.

The agent worked through five phases without human intervention: it explored the live application and documented every interaction pattern, generated automation components across every layer of the framework, debugged and fixed nine runtime failures, then audited its own output against documented standards and resolved 19 code quality findings, and  all of that before the result reached a human reviewer.

In an earlier TTC Global Test Lab benchmark, we explored how GitHub Copilot with Playwright MCP could accelerate test automation, achieving up to 37% time savings by producing code drafts that still required significant manual refinement. That benchmark proved AI could be a useful assistant: a faster way to produce a first draft that a human engineer would rework into reviewable code.

This was different. The AI produced a working test that needed human review, not manual refinement. That's a different handoff point entirely, and it changes the economics of test automation at scale.

The iterative structure (generating, testing, critiquing, refining) is what separates this from single-pass code generation.. Was it AI that made this possible? No. It is the four conditions we put in place before the agent ran: AI configuration, disciplined consistency, non-AI guardrails, and human governance. You can read the details in our companion article.

What ‘Autonomous’ Actually Means

The word ‘autonomously’ requires careful qualification. The AI operated without human intervention during execution. Every decision about what to generate, how to fix failures, what to refactor, and when to stop was made by the agent. No human typed a single line of code, corrected a single selector, or chose a single assertion strategy during the 85-minute run.

But human expertise is embedded in every layer of the system that made this possible. The Playwright Accelerator  (the framework architecture, its layered conventions, and the 120-page manual that the AI learns from) is not something we built for this occasion. It is the culmination of years of iterative refinement across real client engagements. Humans configured the static analysis rules that enforce quality. Humans designed the five-phase workflow and refined the instructions that guide each phase.

And a human reviews the output before it enters the codebase. Two partial coverage gaps remained in the final output: cases where the AI automated the required actions but did not assert every detail the specification implied. No guardrail catches these, because they require judgement about intent rather than conformance to rules. That's the design, not a gap.

Engineers who adopt this model shift their focus to the work that requires human judgement: specification review, architectural decisions, edge case analysis. AI handles the implementation volume that has always been the bottleneck.85 minutes of autonomous AI execution from test case to a best-practice automated test, followed by human review and approval before it enters the codebase. AI generates. Humans govern.

What's Next

The demonstration is done but the work isn’t. A few things we’re watching closely:

Exploring agent teams. Instead of a single agent managing each phase, a coordinator could spawn parallel sub-agents: per-page explorers during discovery, parallel reviewers during quality assurance. We're watching this space closely as the capability matures.

Extended context. The model used in this demonstration has since defaulted to a 1 million token context. Early observations suggest this would eliminate the phase resumptions we encountered, allowing larger scenarios in a single continuous run.

Scaling today. The five-phase workflow is enterprise-ready for test cases within the 15–20 step comfort zone today. For larger scenarios, we split the work into sequential runs (steps 1–15 followed by steps 16–31) with the second run reusing components created by the first. It works well, but it is a workaround. The capabilities above are what will make it unnecessary.

The AI models will keep changing. The context limits will move. What won’t change is the need for the conditions that make AI effective. We’re not chasing full autonomy. We’re chasing the point where the methodology handles the volume and the engineers handle the judgement. This demonstration is what that looks like. And this didn't happen by accident. It's the result of our deliberate investment: in our frameworks, in the Test Lab, and in the people who've spent years refining what enterprise-grade automation actually requires.

Want to See This in Action?

For the full technical detail, the phase-by-phase breakdown, the pillar mechanics, and the honest boundaries, download the whitepaper: AI-First Playwright Automation at Scale.

Or get in touch to see what this looks like in your environment. Reach out to our team.