Project
WebGraph
Four-tier browser automation framework with auto-escalation across DOM, accessibility tree, and vision workers.
Live
- Python
- LangGraph
- Playwright
- Gemini
- TypeScript
Overview
WebGraph is a four-tier browser automation framework I designed and built at TrainWithMe. Public writeup focuses on the architectural pattern, not the product it operates on.
A single supervisor agent coordinates three worker tiers, each operating on a different representation of a web page. Workers escalate to the next tier only when their representation is insufficient — so most steps resolve at the cheapest tier and the expensive ones run only when necessary.
The four tiers
- Supervisor. Owns the task, the plan, and the budget. Calls into workers and decides when to escalate.
- DOM worker. Operates on the page’s DOM directly. Fast, deterministic, cheapest. Handles ~70% of typical steps (form fills, link clicks against stable selectors).
- AXTree worker. Operates on the accessibility tree. Useful when the DOM is opaque (custom components, shadow roots, inconsistent attributes) but the page still exposes meaningful semantics through ARIA.
- Vision worker. Operates on pixels via a vision LLM (Gemini). Used when neither DOM nor AXTree gives the agent enough context — typically canvas apps, image-only content, or pages where layout is the signal.
Technical decisions worth talking about
- One representation at a time. Each worker sees exactly one view of the page. Mixing DOM context into the vision prompt was tried and dropped — the worker conflates layers and gets worse, not better.
- Escalation is one-way per step. Falling back to the cheaper tier mid-step caused thrashing. The rule is: if a tier can’t resolve, the supervisor decides whether to escalate or replan.
- Determinism budget per task. Vision steps are tracked separately because they’re the expensive and least-stable layer. Tasks have a per-run vision budget; running out forces a replan rather than spending more pixels.
- Strict tool surface. Workers see typed tools, not raw browser APIs. Easier to reason about, easier to swap, and the LLM stays inside a small grammar.
Status
In production at TrainWithMe, powering the automation orchestrator that runs scheduled jobs and webhook-triggered work on top of it.