AI Testing

AI-powered testing strategies for the Web UI
Functional POC | Meeting March 25, 2026
1

Visual Regression with OpenCV

Capture, compare, detect. Catch visual regressions before the trader sees them
OpenCV with SSIM (Structural Similarity Index) compares screenshots against baselines with intelligent masking to ignore dynamic data like real-time prices and timestamps.
As example, I built goregress, a Go library wrapping OpenCV (via gocv) specifically designed for visual regression testing in CI. It supports three comparison methods (SSIM, pixel-diff, histogram), ignore regions for dynamic content, automatic baseline management, and diff image generation highlighting changed areas in red. The Suite helper integrates directly with testing.T for idiomatic Go test workflows.
// File: compare_test.go, goregress in action
func TestIgnoreRegions(t *testing.T) {
    dir := t.TempDir()
    img := gocv.NewMatWithSizeFromScalar(
        gocv.NewScalar(128, 128, 128, 0), 100, 100, gocv.MatTypeCV8UC3,
    )
    defer img.Close()

    // Modify only the top-left 20x20 corner
    modified := img.Clone()
    defer modified.Close()
    roi := modified.Region(image.Rect(0, 0, 20, 20))
    white := gocv.NewMatWithSizeFromScalar(
        gocv.NewScalar(255, 255, 255, 0), 10, 10, gocv.MatTypeCV8UC3,
    )
    defer white.Close()
    white.CopyTo(&roi)

    bPath := filepath.Join(dir, "baseline.png")
    cPath := filepath.Join(dir, "current.png")
    gocv.IMWrite(bPath, img)
    gocv.IMWrite(cPath, modified)

    // Compare WITH ignoring the modified region, should pass
    r2, err := goregress.Compare(bPath, cPath, &goregress.CompareOptions{
        Threshold: 0.99,
        Method:    goregress.MethodPixel,
        IgnoreRegions: []image.Rectangle{
            image.Rect(0, 0, 20, 20), // mask out the changed area
        },
    })
    if err != nil { t.Fatal(err) }
    t.Logf("with ignore: similarity=%.4f", r2.Similarity)
}
Key insight: SSIM compares structure, luminance, and contrast, not individual pixels. It's more robust than pixel-diff against anti-aliasing and cross-browser rendering differences. With IgnoreRegions for dynamic market data (prices, timestamps, candlestick charts), you get visual regression testing at zero recurring cost, integrated directly into go test.
Next step: Visual AI with CNNs. goregress does not include CNN-based detection yet. But the research shows where this is heading: Chen et al. (2020) found that CNN models (YOLO, Faster R-CNN) significantly outperform traditional methods for GUI element detection. A trained model can distinguish a disabled button from an active one, verify an order book renders correctly, or detect a missing column in a positions table. Things that pure SSIM doesn't capture. AppFlow (Hu et al., 2018, 128 citations) proved ML can synthesize robust UI tests that survive partial redesigns without rewriting selectors. This is a natural evolution for goregress once the pixel-level foundation is solid.
References
  • Repo Pondigo · go-regress: Visual regression testing with OpenCV in Go. GitHub
  • Paper Yu, S. et al. (2021) · "Prioritize Crowdsourced Test Reports via Deep Screenshot Understanding" ICSE. PDF · arXiv
  • Paper Yu, S. et al. (2021) · "Layout and Image Recognition Driving Cross-Platform Automated Mobile Testing" ICSE. PDF
  • Paper Feiz, S. et al. (2022) · "Understanding Screen Relationships from Screenshots" ACM IUI. ACM DL
  • Docs gocv.io · Go bindings for OpenCV. gocv.io
  • Paper Chen, J. et al. (2020) · "Object Detection for GUI: Old Fashioned or Deep Learning?" ESEC/FSE. 207 citations. ACM DL
  • Paper Moran, K. et al. (2018) · "ML-Based Prototyping of Graphical User Interfaces for Mobile Apps" IEEE TSE. 353 citations. PDF
  • Paper Hu, G. et al. (2018) · "AppFlow: Using ML to Synthesize Robust, Reusable UI Tests" ESEC/FSE. 128 citations. ACM DL
  • Paper YazdaniBanafsheDaragh, F. et al. (2021) · "Deep GUI: Black-Box GUI Input Generation with Deep Learning" ICSE. 56 citations. IEEE
  • Paper Si, C. et al. (2024) · "Design2Code: Automating Front-End Engineering" PDF · arXiv
2

Self-Healing Frameworks

Tests that repair themselves. AI-driven fault prediction and automated recovery
Nama, Reddy & Pattanayak (2024) identify five critical limitations in traditional frameworks: maintenance overhead (scripts break on every UI change), lack of adaptability (predefined scripts can't handle unexpected failures), limited fault prediction, high initial setup costs, and scalability issues as applications grow.
Self-healing frameworks address these by using ML algorithms to detect, predict, and recover from faults automatically. The paper outlines four core components: monitoring & detection (continuous system health checks using ML anomaly detection), diagnosis & analysis (pattern recognition to find root causes), automated recovery (rollbacks, patches, or selector regeneration without human intervention), and feedback loops (the system learns from past incidents to improve future responses). For a trading platform where UI changes are frequent and downtime is costly, this is critical.
ToolAI FeaturesSelf-HealingTrading Fit
MablAuto-gen, root cause analysis, smart waitsNativeHigh
TestimVisual recorder, auto-optimize, ML locatorsNativeHigh
ApplitoolsVisual AI, Ultrafast Grid, root causeVisual onlyHigh
TestRigorNLP tests, adaptive selectorsNativeMedium
KatalonObject recognition, analyticsPartialMedium
TestSpriteAutonomous AI agent, MCP IDE integration, failure classificationNativeHigh
Key insight: Nama et al. identify that ML models for fault prediction fall into five categories: regression analysis (predicting defect probability), classification algorithms (decision trees/SVMs to classify risky changes), clustering (grouping similar defect patterns), neural networks (learning complex fault patterns from large datasets), and time series analysis (predicting when defects will occur based on historical trends). For the platform, classification and time series are the most applicable, flagging risky order-flow changes before they reach production.
References
  • Paper Nama, P., Reddy, P. & Pattanayak, S. (2024) · "Artificial Intelligence for Self-Healing Automation Testing Frameworks: Real-Time Fault Prediction and Recovery" CINEFORUM Vol.64 No.3S. ResearchGate
  • Paper Baqar, M. et al. (2025) · "Self-Healing Software Systems: Lessons from Nature, Powered by AI" arXiv
  • Paper Bari, M.S. et al. (2024) · "AI-Augmented Self-Healing Automation Frameworks" AIJMR. 16 citations. PDF
  • Paper Tamraparani, V. (2023) · "Self-Healing Test Automation for Regulatory & Compliance in Financial Institutions" SSRN. 32 citations. SSRN
3

Natural Language Testing

The trader writes the scenario, the LLM generates the test, the browser executes it
The most powerful idea for our context: QA, product managers, and even traders themselves can write test scenarios in plain language that are automatically translated into executable test code. An LLM interprets the intent, knows the project's data-testid selectors, and generates the complete test.
The flow is: natural language scenario → LLM (Claude/GPT-4) interprets using a system prompt that knows the app's selectors → generates async test code → executes against the web UI → OpenCV verifies the visual state → report. The trader never touches code.
// The trader writes:
"Place a limit order for MBonos Dec26, 1000 titles.
Verify confirmation and that it appears in Open Orders as Pending."

// The LLM generates executable test code that:
// 1. Navigates to /trading
// 2. Searches for MBonos Dec26 instrument
// 3. Selects "limit" order type
// 4. Fills quantity=1000 (price from market data)
// 5. Clicks submit, asserts confirmation
// 6. Navigates to Open Orders, asserts "Pending" status
Beyond UI scenarios: Daisy Chains. Natural language descriptions can also define entire flow behaviors across bounded contexts. A daisy chain spec declares boundaries, signals, and timing contracts in YAML. Instead of writing imperative test code, you write what the flow should look like: "order placed at storefront → order.validated within 500ms → inventory.reserved within 500ms → payment.charged within 2s." An agent picks up the spec and observes passively at each boundary. No browser, no mocks, just boundary observation across HTTP, NATS, SMTP, or any transport.
# Daisy Chain spec: natural language behavior description
chain: place-order
links:
  - boundary: storefront-ui
    signal: POST /api/orders
    role: injector
  - boundary: order-service
    signal: order.validated
    expect_within: 500ms
  - boundary: inventory-service
    signal: inventory.reserved
    expect_within: 500ms
  - boundary: payment-service
    signal: payment.charged
    expect_within: 2000ms
Key insight: ChatUniTest (Chen et al., 2024) demonstrates that LLMs can generate high-quality tests with the right framework, with 231 citations in one year. Daisy chains take this further: the spec itself is the test. Write what the flow should look like, and the agent handles observation, timing, and verdicts. The spec is the test, the test is the monitor.
References
  • Proposal Pondigo (2026) · "Daisy Chains: Declarative Flow Testing" me.pondi.app
  • Paper Chen, Y. et al. (2024) · "ChatUniTest: A Framework for LLM-Based Test Generation" ISSTA. 231 citations. ACM DL
  • Paper Fakhoury, S. et al. (2024) · "LLM-Based Test-Driven Interactive Code Generation" IEEE TSE. 183 citations. PDF
  • Paper Mathews, N.S. & Nagappan, M. (2024) · "Test-Driven Development and LLM-Based Code Generation" ASE. ACM DL
  • Paper He, Z. et al. (2025) · "HardTests: Synthesizing High-Quality Test Cases for LLM Coding" arXiv
  • Tool TestRigor · NLP-based test automation (30+ languages). testrigor.com
4

OpenAgentsControl

Agents that learn your project's testing patterns before generating code
OpenAgentsControl (OAC) is an open-source framework where AI agents learn your project's patterns before generating code. Unlike proprietary tools with fixed behavior, agents are defined as editable Markdown files. The "Minimal Viable Information" principle loads only relevant patterns, reducing token usage by ~80%.
For the platform, the direct application is: codify the team's testing conventions in .opencode/context/project/: selectors with data-testid, assertion structure, market data fixtures, verification patterns against PostgreSQL and NATS. The TestEngineer subagent generates tests consistent with those conventions. ContextScout discovers existing patterns automatically.
# Configure team's testing patterns
oac init
oac context add "We use Go tests with goregress for visual regression"
oac context add "Selectors always use data-testid"
oac context add "Trading tests verify: UI, DB (postgres), and NATS events"
oac context add "OpenCV masking for dynamic prices and timestamps"

# Generate test that follows team conventions
oac test "Create tests for the limit order placement flow"
Key insight: Patterns are committed to the repo. The entire team inherits the same testing conventions automatically, new developers included. Approval gates ensure no destructive test runs without human oversight, critical in a financial context.
References
  • Repo Hinde, D. · OpenAgentsControl: Pattern-first AI agent framework. GitHub
  • Paper Han, T. et al. (2026) · "SWE-Skills-Bench: Agent Skills in Real-World Software Engineering" PDF · arXiv
  • Paper Ye, J. et al. (2026) · "CCTU: Tool Use under Complex Constraints" PDF · arXiv
  • Paper Jiang, L. et al. (2026) · "Web Verbs: Agentic Web Task Composition" PDF
5

Multi-Agent Test Orchestration

Each agent specializes. Functional, visual, data, performance
The core idea: instead of a single monolithic system, orchestrate multiple specialized agents with CrewAI. A Test Planner designs scenarios prioritized by trader impact. A Functional Tester executes E2E flows against the browser. A Visual Tester compares screenshots with goregress/OpenCV. A Data Verifier confirms the order reached PostgreSQL and the event was published to NATS.
View full system architecture
Natural Language Scenario LLM Interpreter (Claude / GPT-4) Functional Tester E2E Browser Visual Tester OpenCV + SSIM Data Verifier PostgreSQL + NATS Performance Monitor Lighthouse + Vitals Web UI (Browser) PostgreSQL NATS Events Dashboard: Results + Diff Images + Traceability CrewAI / OpenAgentsControl / Orchestration Layer
Key insight: Separating into specialized agents allows each one to evolve independently. If tomorrow we switch from OpenCV to Applitools for visual regression, only the Visual Tester changes. The rest of the system doesn't notice.
References
  • Paper Zhao, D. et al. (2025) · "SeeAction: Reverse Engineering HCI Actions from Screencasts" ICSE Distinguished Paper. arXiv
  • Paper Liu, J. et al. (2023) · "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS. 1855 citations. PDF
  • Docs CrewAI · Multi-Agent Orchestration Framework. docs.crewai.com
  • Docs Anthropic · Claude Agent SDK. docs.anthropic.com
6

Test Identity Management

Centralized test users, roles, and credentials for every environment
Currently, k6-load-tests/setup provisions the entire test environment locally: institutions (k6_test_{n}_{hostname}), traders ({institution}_user@example.com), hashed passwords, securities (MBonos, Udibonos), API keys, stream sets, and compliance officers. Each developer runs node setup/main.js load {env} from their machine. The DEVICE_IDENTIFIER is the hostname, meaning every machine creates its own isolated set of test data in PostgreSQL.
This works for local development but breaks down for CI, parallel test runs, and agent-based testing. When a CrewAI agent or OAC subagent needs to execute a trader flow, it needs credentials. Right now there's no way to: assign a specific test identity to a specific test run, prevent two parallel runs from using the same trader, or trace which test user placed which order in the database and NATS.
ApproachHow it worksProsCons
Vault + APIHashiCorp Vault stores credentials, tests fetch via API at runtimeEncrypted, auditable, rotationInfra overhead
Identity Pool ServiceCustom Go service: checkout/checkin test users, track which run owns which identityFull control, traceabilityBuild + maintain
k6 + Environment ConfigsMigrate local scripts to repo with env-specific configs, CI injects secretsMinimal change, fastNo rotation, limited audit
Keycloak Test RealmDedicated Keycloak realm for test users, OAuth flows match productionReal auth flows, SSOSetup complexity
Current state: k6-load-tests scripts load users locally. The quick win is migrating these to the repo with environment configs so CI can run them. The longer-term goal is an identity pool service where each test agent checks out a user, runs the flow, and checks it back in, leaving a full audit trail in the database.
// Current: k6-load-tests creates traders per device hostname
// Email pattern: k6_test_{n}_{hostname}_user@example.com
// Password: shared TEST_USER_PASSWORD hashed with TEST_PASSWORD_SALT

// Proposed: identity pool with checkout/checkin
type TestIdentity struct {
    InstitutionID string
    TraderEmail   string // "k6_test_1_ci_user@example.com"
    Token         string
    RunID         string // links identity to test execution
    Role          string // "broker-dealer", "compliance", "admin"
}

func CheckoutTrader(role string) (*TestIdentity, error) {
    // Locks a trader so no parallel run uses it
    // Returns valid credentials + institution context
}

func CheckinTrader(id *TestIdentity) error {
    // Releases trader back to pool
    // Logs: orders placed, NATS events, DB mutations
}
References
  • Docs HashiCorp · Vault Secrets Management. vaultproject.io
  • Docs Keycloak · Open Source Identity and Access Management. keycloak.org
  • Docs Grafana k6 · Load Testing for Engineering Teams. k6.io
  • Repo Pondigo · k6-load-tests: Load testing scripts (private)
7

Implementation Roadmap PLAN

Three progressive phases. From zero cost to full AI in 8 weeks
No need to build everything at once. The strategy is to increment capabilities in phases, validating each layer before adding the next. Phase 1 costs nothing and can be running in CI within 1-2 weeks. Phase 2 introduces daisy chain specs for passive flow observation across NATS and HTTP boundaries. Phase 3 adds agents that auto-generate chains from production traffic.
Phase 1 Wk 1-2
E2E Go tests + browser
Visual goregress (SSIM)
Orchestration Manual scripts
Data Direct queries
Identity k6 scripts to repo + env configs
Phase 2 Wk 3-5
E2E Go tests + OAC
Visual goregress + Percy
Chains Passive observer + YAML specs
Orchestration OAC
Data Verifier agent
Identity Vault + checkout/checkin API
Phase 3 Wk 6-8
E2E CrewAI + OAC multi-agent
Visual goregress + Applitools
Chains Agent ATPG + auto-gen chains
NLP LLM + TestRigor
Orchestration CrewAI multi-agent
Data Autonomous agent
Identity Keycloak realm + auto-rotation
Drop here to dismiss
Drag cards between phases or dismiss them below
Quick win: Go tests + goregress with SSIM. Zero cost, in CI within 1-2 weeks, covers the most critical trading flows. It's the foundation everything else builds on.
Decisions for the March 25 meeting: Do we start with Phase 1 or invest from the start in Phase 2 with OAC? What are the 5 most critical trader flows? Is there budget for SaaS tools? Who owns the testing system: QA, dev, or platform?
Meeting Notes
References
  • Paper Nama, P. et al. (2024) · "AI for Self-Healing Automation Testing Frameworks" CINEFORUM. ResearchGate
  • Paper Huang, H. et al. (2026) · "More Code, Less Reuse: AI-generated PR Quality" PDF · arXiv
  • Docs Applitools · Visual AI Testing Platform. applitools.com
  • Docs Percy · Visual Review Platform. percy.io