Project

WebGraph

Four-tier browser automation framework with auto-escalation across DOM, accessibility tree, and vision workers.

Live

Overview

WebGraph is a four-tier browser automation framework I designed and built at TrainWithMe. Public writeup focuses on the architectural pattern, not the product it operates on.

A single supervisor agent coordinates three worker tiers, each operating on a different representation of a web page. Workers escalate to the next tier only when their representation is insufficient — so most steps resolve at the cheapest tier and the expensive ones run only when necessary.

The four tiers

  1. Supervisor. Owns the task, the plan, and the budget. Calls into workers and decides when to escalate.
  2. DOM worker. Operates on the page’s DOM directly. Fast, deterministic, cheapest. Handles ~70% of typical steps (form fills, link clicks against stable selectors).
  3. AXTree worker. Operates on the accessibility tree. Useful when the DOM is opaque (custom components, shadow roots, inconsistent attributes) but the page still exposes meaningful semantics through ARIA.
  4. Vision worker. Operates on pixels via a vision LLM (Gemini). Used when neither DOM nor AXTree gives the agent enough context — typically canvas apps, image-only content, or pages where layout is the signal.

Technical decisions worth talking about

  • One representation at a time. Each worker sees exactly one view of the page. Mixing DOM context into the vision prompt was tried and dropped — the worker conflates layers and gets worse, not better.
  • Escalation is one-way per step. Falling back to the cheaper tier mid-step caused thrashing. The rule is: if a tier can’t resolve, the supervisor decides whether to escalate or replan.
  • Determinism budget per task. Vision steps are tracked separately because they’re the expensive and least-stable layer. Tasks have a per-run vision budget; running out forces a replan rather than spending more pixels.
  • Strict tool surface. Workers see typed tools, not raw browser APIs. Easier to reason about, easier to swap, and the LLM stays inside a small grammar.

Status

In production at TrainWithMe, powering the automation orchestrator that runs scheduled jobs and webhook-triggered work on top of it.