AI Testing

Visual Regression with OpenCV

Capture, compare, detect. Catch visual regressions before the trader sees them

OpenCV with SSIM (Structural Similarity Index) compares screenshots against baselines with intelligent masking to ignore dynamic data like real-time prices and timestamps.

As example, I built goregress, a Go library wrapping OpenCV (via gocv) specifically designed for visual regression testing in CI. It supports three comparison methods (SSIM, pixel-diff, histogram), ignore regions for dynamic content, automatic baseline management, and diff image generation highlighting changed areas in red. The Suite helper integrates directly with testing.T for idiomatic Go test workflows.

// File: compare_test.go, goregress in action
func TestIgnoreRegions(t *testing.T) {
    dir := t.TempDir()
    img := gocv.NewMatWithSizeFromScalar(
        gocv.NewScalar(128, 128, 128, 0), 100, 100, gocv.MatTypeCV8UC3,
    )
    defer img.Close()

    // Modify only the top-left 20x20 corner
    modified := img.Clone()
    defer modified.Close()
    roi := modified.Region(image.Rect(0, 0, 20, 20))
    white := gocv.NewMatWithSizeFromScalar(
        gocv.NewScalar(255, 255, 255, 0), 10, 10, gocv.MatTypeCV8UC3,
    )
    defer white.Close()
    white.CopyTo(&roi)

    bPath := filepath.Join(dir, "baseline.png")
    cPath := filepath.Join(dir, "current.png")
    gocv.IMWrite(bPath, img)
    gocv.IMWrite(cPath, modified)

    // Compare WITH ignoring the modified region, should pass
    r2, err := goregress.Compare(bPath, cPath, &goregress.CompareOptions{
        Threshold: 0.99,
        Method:    goregress.MethodPixel,
        IgnoreRegions: []image.Rectangle{
            image.Rect(0, 0, 20, 20), // mask out the changed area
        },
    })
    if err != nil { t.Fatal(err) }
    t.Logf("with ignore: similarity=%.4f", r2.Similarity)
}

Key insight: SSIM compares structure, luminance, and contrast, not individual pixels. It's more robust than pixel-diff against anti-aliasing and cross-browser rendering differences. With IgnoreRegions for dynamic market data (prices, timestamps, candlestick charts), you get visual regression testing at zero recurring cost, integrated directly into go test.

Next step: Visual AI with CNNs. goregress does not include CNN-based detection yet. But the research shows where this is heading: Chen et al. (2020) found that CNN models (YOLO, Faster R-CNN) significantly outperform traditional methods for GUI element detection. A trained model can distinguish a disabled button from an active one, verify an order book renders correctly, or detect a missing column in a positions table. Things that pure SSIM doesn't capture. AppFlow (Hu et al., 2018, 128 citations) proved ML can synthesize robust UI tests that survive partial redesigns without rewriting selectors. This is a natural evolution for goregress once the pixel-level foundation is solid.

References

Repo Pondigo · go-regress: Visual regression testing with OpenCV in Go. GitHub
Paper Yu, S. et al. (2021) · "Prioritize Crowdsourced Test Reports via Deep Screenshot Understanding" ICSE. PDF · arXiv
Paper Yu, S. et al. (2021) · "Layout and Image Recognition Driving Cross-Platform Automated Mobile Testing" ICSE. PDF
Paper Feiz, S. et al. (2022) · "Understanding Screen Relationships from Screenshots" ACM IUI. ACM DL
Docs gocv.io · Go bindings for OpenCV. gocv.io
Paper Chen, J. et al. (2020) · "Object Detection for GUI: Old Fashioned or Deep Learning?" ESEC/FSE. 207 citations. ACM DL
Paper Moran, K. et al. (2018) · "ML-Based Prototyping of Graphical User Interfaces for Mobile Apps" IEEE TSE. 353 citations. PDF
Paper Hu, G. et al. (2018) · "AppFlow: Using ML to Synthesize Robust, Reusable UI Tests" ESEC/FSE. 128 citations. ACM DL
Paper YazdaniBanafsheDaragh, F. et al. (2021) · "Deep GUI: Black-Box GUI Input Generation with Deep Learning" ICSE. 56 citations. IEEE
Paper Si, C. et al. (2024) · "Design2Code: Automating Front-End Engineering" PDF · arXiv

Self-Healing Frameworks

Tests that repair themselves. AI-driven fault prediction and automated recovery

Nama, Reddy & Pattanayak (2024) identify five critical limitations in traditional frameworks: maintenance overhead (scripts break on every UI change), lack of adaptability (predefined scripts can't handle unexpected failures), limited fault prediction, high initial setup costs, and scalability issues as applications grow.

Self-healing frameworks address these by using ML algorithms to detect, predict, and recover from faults automatically. The paper outlines four core components: monitoring & detection (continuous system health checks using ML anomaly detection), diagnosis & analysis (pattern recognition to find root causes), automated recovery (rollbacks, patches, or selector regeneration without human intervention), and feedback loops (the system learns from past incidents to improve future responses). For a trading platform where UI changes are frequent and downtime is costly, this is critical.

Tool	AI Features	Self-Healing	Trading Fit
Mabl	Auto-gen, root cause analysis, smart waits	Native	High
Testim	Visual recorder, auto-optimize, ML locators	Native	High
Applitools	Visual AI, Ultrafast Grid, root cause	Visual only	High
TestRigor	NLP tests, adaptive selectors	Native	Medium
Katalon	Object recognition, analytics	Partial	Medium
TestSprite	Autonomous AI agent, MCP IDE integration, failure classification	Native	High

Key insight: Nama et al. identify that ML models for fault prediction fall into five categories: regression analysis (predicting defect probability), classification algorithms (decision trees/SVMs to classify risky changes), clustering (grouping similar defect patterns), neural networks (learning complex fault patterns from large datasets), and time series analysis (predicting when defects will occur based on historical trends). For the platform, classification and time series are the most applicable, flagging risky order-flow changes before they reach production.

References

Paper Nama, P., Reddy, P. & Pattanayak, S. (2024) · "Artificial Intelligence for Self-Healing Automation Testing Frameworks: Real-Time Fault Prediction and Recovery" CINEFORUM Vol.64 No.3S. ResearchGate
Paper Baqar, M. et al. (2025) · "Self-Healing Software Systems: Lessons from Nature, Powered by AI" arXiv
Paper Bari, M.S. et al. (2024) · "AI-Augmented Self-Healing Automation Frameworks" AIJMR. 16 citations. PDF
Paper Tamraparani, V. (2023) · "Self-Healing Test Automation for Regulatory & Compliance in Financial Institutions" SSRN. 32 citations. SSRN

Natural Language Testing

The trader writes the scenario, the LLM generates the test, the browser executes it

The most powerful idea for our context: QA, product managers, and even traders themselves can write test scenarios in plain language that are automatically translated into executable test code. An LLM interprets the intent, knows the project's data-testid selectors, and generates the complete test.

The flow is: natural language scenario → LLM (Claude/GPT-4) interprets using a system prompt that knows the app's selectors → generates async test code → executes against the web UI → OpenCV verifies the visual state → report. The trader never touches code.

// The trader writes:
"Place a limit order for MBonos Dec26, 1000 titles.
Verify confirmation and that it appears in Open Orders as Pending."

// The LLM generates executable test code that:
// 1. Navigates to /trading
// 2. Searches for MBonos Dec26 instrument
// 3. Selects "limit" order type
// 4. Fills quantity=1000 (price from market data)
// 5. Clicks submit, asserts confirmation
// 6. Navigates to Open Orders, asserts "Pending" status

Beyond UI scenarios: Daisy Chains. Natural language descriptions can also define entire flow behaviors across bounded contexts. A daisy chain spec declares boundaries, signals, and timing contracts in YAML. Instead of writing imperative test code, you write what the flow should look like: "order placed at storefront → order.validated within 500ms → inventory.reserved within 500ms → payment.charged within 2s." An agent picks up the spec and observes passively at each boundary. No browser, no mocks, just boundary observation across HTTP, NATS, SMTP, or any transport.

# Daisy Chain spec: natural language behavior description
chain: place-order
links:
  - boundary: storefront-ui
    signal: POST /api/orders
    role: injector
  - boundary: order-service
    signal: order.validated
    expect_within: 500ms
  - boundary: inventory-service
    signal: inventory.reserved
    expect_within: 500ms
  - boundary: payment-service
    signal: payment.charged
    expect_within: 2000ms

Key insight: ChatUniTest (Chen et al., 2024) demonstrates that LLMs can generate high-quality tests with the right framework, with 231 citations in one year. Daisy chains take this further: the spec itself is the test. Write what the flow should look like, and the agent handles observation, timing, and verdicts. The spec is the test, the test is the monitor.

References

Proposal Pondigo (2026) · "Daisy Chains: Declarative Flow Testing" me.pondi.app
Paper Chen, Y. et al. (2024) · "ChatUniTest: A Framework for LLM-Based Test Generation" ISSTA. 231 citations. ACM DL
Paper Fakhoury, S. et al. (2024) · "LLM-Based Test-Driven Interactive Code Generation" IEEE TSE. 183 citations. PDF
Paper Mathews, N.S. & Nagappan, M. (2024) · "Test-Driven Development and LLM-Based Code Generation" ASE. ACM DL
Paper He, Z. et al. (2025) · "HardTests: Synthesizing High-Quality Test Cases for LLM Coding" arXiv
Tool TestRigor · NLP-based test automation (30+ languages). testrigor.com

OpenAgentsControl

Agents that learn your project's testing patterns before generating code

OpenAgentsControl (OAC) is an open-source framework where AI agents learn your project's patterns before generating code. Unlike proprietary tools with fixed behavior, agents are defined as editable Markdown files. The "Minimal Viable Information" principle loads only relevant patterns, reducing token usage by ~80%.

For the platform, the direct application is: codify the team's testing conventions in .opencode/context/project/: selectors with data-testid, assertion structure, market data fixtures, verification patterns against PostgreSQL and NATS. The TestEngineer subagent generates tests consistent with those conventions. ContextScout discovers existing patterns automatically.

# Configure team's testing patterns
oac init
oac context add "We use Go tests with goregress for visual regression"
oac context add "Selectors always use data-testid"
oac context add "Trading tests verify: UI, DB (postgres), and NATS events"
oac context add "OpenCV masking for dynamic prices and timestamps"

# Generate test that follows team conventions
oac test "Create tests for the limit order placement flow"

Key insight: Patterns are committed to the repo. The entire team inherits the same testing conventions automatically, new developers included. Approval gates ensure no destructive test runs without human oversight, critical in a financial context.

References

Repo Hinde, D. · OpenAgentsControl: Pattern-first AI agent framework. GitHub
Paper Han, T. et al. (2026) · "SWE-Skills-Bench: Agent Skills in Real-World Software Engineering" PDF · arXiv
Paper Ye, J. et al. (2026) · "CCTU: Tool Use under Complex Constraints" PDF · arXiv
Paper Jiang, L. et al. (2026) · "Web Verbs: Agentic Web Task Composition" PDF

Multi-Agent Test Orchestration

Each agent specializes. Functional, visual, data, performance

The core idea: instead of a single monolithic system, orchestrate multiple specialized agents with CrewAI. A Test Planner designs scenarios prioritized by trader impact. A Functional Tester executes E2E flows against the browser. A Visual Tester compares screenshots with goregress/OpenCV. A Data Verifier confirms the order reached PostgreSQL and the event was published to NATS.

View full system architecture

Key insight: Separating into specialized agents allows each one to evolve independently. If tomorrow we switch from OpenCV to Applitools for visual regression, only the Visual Tester changes. The rest of the system doesn't notice.

References

Paper Zhao, D. et al. (2025) · "SeeAction: Reverse Engineering HCI Actions from Screencasts" ICSE Distinguished Paper. arXiv
Paper Liu, J. et al. (2023) · "Is Your Code Generated by ChatGPT Really Correct?" NeurIPS. 1855 citations. PDF
Docs CrewAI · Multi-Agent Orchestration Framework. docs.crewai.com
Docs Anthropic · Claude Agent SDK. docs.anthropic.com

Test Identity Management

Centralized test users, roles, and credentials for every environment

Currently, k6-load-tests/setup provisions the entire test environment locally: institutions (k6_test_{n}_{hostname}), traders ({institution}_user@example.com), hashed passwords, securities (MBonos, Udibonos), API keys, stream sets, and compliance officers. Each developer runs node setup/main.js load {env} from their machine. The DEVICE_IDENTIFIER is the hostname, meaning every machine creates its own isolated set of test data in PostgreSQL.

This works for local development but breaks down for CI, parallel test runs, and agent-based testing. When a CrewAI agent or OAC subagent needs to execute a trader flow, it needs credentials. Right now there's no way to: assign a specific test identity to a specific test run, prevent two parallel runs from using the same trader, or trace which test user placed which order in the database and NATS.

Approach	How it works	Pros	Cons
Vault + API	HashiCorp Vault stores credentials, tests fetch via API at runtime	Encrypted, auditable, rotation	Infra overhead
Identity Pool Service	Custom Go service: checkout/checkin test users, track which run owns which identity	Full control, traceability	Build + maintain
k6 + Environment Configs	Migrate local scripts to repo with env-specific configs, CI injects secrets	Minimal change, fast	No rotation, limited audit
Keycloak Test Realm	Dedicated Keycloak realm for test users, OAuth flows match production	Real auth flows, SSO	Setup complexity

Current state: k6-load-tests scripts load users locally. The quick win is migrating these to the repo with environment configs so CI can run them. The longer-term goal is an identity pool service where each test agent checks out a user, runs the flow, and checks it back in, leaving a full audit trail in the database.

// Current: k6-load-tests creates traders per device hostname
// Email pattern: k6_test_{n}_{hostname}_user@example.com
// Password: shared TEST_USER_PASSWORD hashed with TEST_PASSWORD_SALT

// Proposed: identity pool with checkout/checkin
type TestIdentity struct {
    InstitutionID string
    TraderEmail   string // "k6_test_1_ci_user@example.com"
    Token         string
    RunID         string // links identity to test execution
    Role          string // "broker-dealer", "compliance", "admin"
}

func CheckoutTrader(role string) (*TestIdentity, error) {
    // Locks a trader so no parallel run uses it
    // Returns valid credentials + institution context
}

func CheckinTrader(id *TestIdentity) error {
    // Releases trader back to pool
    // Logs: orders placed, NATS events, DB mutations
}

References

Docs HashiCorp · Vault Secrets Management. vaultproject.io
Docs Keycloak · Open Source Identity and Access Management. keycloak.org
Docs Grafana k6 · Load Testing for Engineering Teams. k6.io
Repo Pondigo · k6-load-tests: Load testing scripts (private)

Implementation Roadmap PLAN

Three progressive phases. From zero cost to full AI in 8 weeks

No need to build everything at once. The strategy is to increment capabilities in phases, validating each layer before adding the next. Phase 1 costs nothing and can be running in CI within 1-2 weeks. Phase 2 introduces daisy chain specs for passive flow observation across NATS and HTTP boundaries. Phase 3 adds agents that auto-generate chains from production traffic.

Phase 1 Wk 1-2

E2E Go tests + browser

Visual goregress (SSIM)

Orchestration Manual scripts

Data Direct queries

Identity k6 scripts to repo + env configs

Phase 2 Wk 3-5

E2E Go tests + OAC

Visual goregress + Percy

Chains Passive observer + YAML specs

Orchestration OAC

Data Verifier agent

Identity Vault + checkout/checkin API

Phase 3 Wk 6-8

E2E CrewAI + OAC multi-agent

Visual goregress + Applitools

Chains Agent ATPG + auto-gen chains

NLP LLM + TestRigor

Orchestration CrewAI multi-agent

Data Autonomous agent

Identity Keycloak realm + auto-rotation

Drop here to dismiss

Drag cards between phases or dismiss them below

Quick win: Go tests + goregress with SSIM. Zero cost, in CI within 1-2 weeks, covers the most critical trading flows. It's the foundation everything else builds on.

Decisions for the March 25 meeting: Do we start with Phase 1 or invest from the start in Phase 2 with OAC? What are the 5 most critical trader flows? Is there budget for SaaS tools? Who owns the testing system: QA, dev, or platform?

Meeting Notes

References

Paper Nama, P. et al. (2024) · "AI for Self-Healing Automation Testing Frameworks" CINEFORUM. ResearchGate
Paper Huang, H. et al. (2026) · "More Code, Less Reuse: AI-generated PR Quality" PDF · arXiv
Docs Applitools · Visual AI Testing Platform. applitools.com
Docs Percy · Visual Review Platform. percy.io

Papers & References

Visual Regression with OpenCV

Self-Healing Frameworks

Natural Language Testing

OpenAgentsControl

Multi-Agent Test Orchestration

Test Identity Management

Implementation Roadmap PLAN