How Our AI Agent Understands Web Pages: A Look Inside browse.do

Traditional web automation is powerful, but it's often incredibly fragile. Anyone who has spent hours writing XPath queries or CSS selectors for a web scraping task knows the pain. You craft the perfect selector, your script runs flawlessly, and then... the website's developer changes a div to a span, adds a class name, and your entire workflow shatters.

This brittleness exists because traditional tools don't understand a webpage; they just parse its structure. They follow a rigid map. If the map changes, they get lost.

At browse.do, we're taking a fundamentally different approach. We've built an AI agent that interacts with websites more like a human does—by understanding context, semantics, and visual layout. This post will pull back the curtain and give you a look inside how our agent turns your natural language objectives into automated actions.

Beyond Selectors: The Problem with the Old Way

The core issue with selector-based automation is its tight coupling to the Document Object Model (DOM). A selector like div#main-content > article.post-entry:first-child > h2 > a is a precise, step-by-step path through the HTML tree. It’s effective, but it makes a dangerous assumption: that the structure of the tree will never change.

This is like giving directions by saying, "Take a left at the third oak tree, then a right at the blue fire hydrant." If the city removes a tree or paints the hydrant red, the directions become useless.

A human, on the other hand, understands the intent. You'd tell a friend, "Head towards the main post office and look for the big clock tower." That's a semantic, goal-oriented instruction. Our AI agent is designed to work the same way.

Step 1: A Multi-Modal View of the Web

To understand a page like a human, our agent needs to perceive it like a human. It doesn't just read the raw HTML. Instead, it combines multiple data sources to build a rich, contextual understanding of what's on the screen.

The Visual Layout: First, the agent renders the page in a full, headless browser environment. It literally sees the page, just like you do. This visual data is crucial for understanding proximity, hierarchy, and what elements are genuinely prominent versus hidden in the code.
The DOM & Accessibility Tree: While the raw DOM can be brittle, it's still a vital source of information. The agent parses the HTML to understand the "blueprint" of the page—what's a button, what's an input field, what's a link. More importantly, it heavily utilizes the Accessibility Tree. This is a version of the DOM that browsers create for screen readers, and it's a goldmine of semantic information. Elements often have clear, human-readable labels (aria-label) and roles that explicitly state their purpose.

By combining what it sees with the underlying structure and semantic meaning, the agent gets a far more robust picture of the webpage than a simple HTML parser ever could.

Step 2: The LLM Brain and the Action Loop

This rich, multi-modal representation of the page is then fed to a Large Language Model (LLM), which acts as the agent's "brain." The LLM's task is simple to state but incredibly complex to execute: correlate the user's objective with the elements on the page.

This process happens in a continuous loop:

Observe: The agent analyzes the current state of the page using its multi-modal view.
Think: The LLM receives this analysis along with the user's high-level objective (e.g., "Find the title of the top story and its URL."). It then reasons about the best next action. It might think, "The objective is to find the 'top story'. Visually, this element is at the top of the main list. Its accessibility role is 'link' and its text looks like a headline. This is my target."
Act: The agent executes the action decided by the LLM. This could be clicking a button, typing into a form, scrolling down the page, or extracting data from an element.
Repeat: After the action is taken, the page may change—a new page loads, a pop-up appears, or new content is displayed. The agent starts the loop over, observing the new state and deciding on its next move until the final objective is achieved.

This iterative loop is what allows browse.do to handle complex, multi-step tasks like logging into an account, navigating to a dashboard, and downloading a report.

A Practical Example: Finding the Top Story on Hacker News

Let's look at the simple code example from our homepage.

import { browse } from "@do-inc/agents";

async function getTopHackerNewsStory() {
  const result = await browse.do({
    url: "https://news.ycombinator.com",
    objective: "Find the title of the top story and its URL."
  });

  console.log(result.data);
  return result.data;
}

Here’s a simplified breakdown of the agent's thought process:

Observe: The agent loads the URL and "sees" a ranked list of news articles. It parses the DOM and accessibility tree, identifying a series of list items, each containing links, point totals, and comments.
Think: The LLM processes this information against the objective: "Find the title of the top story and its URL." It identifies that "top story" means the one ranked first. It recognizes the main link within that first list item as the "title" and its href attribute as the "URL."
Act: The agent's action is to extract the text content of the link and the value of its href attribute.
Conclude: The objective is now complete. The agent formats the extracted information into a structured JSON object ({ "title": "...", "url": "..." }) and returns the final result.

Notice what didn't happen. There was no tr.athing:first-child > td.title > a. The agent didn't need a fragile, hard-coded path. It understood the concept of a "top story" based on visual hierarchy and semantic cues, making the automation resilient to future site redesigns.

The Future of Automation is Conversational

By shifting from rigid structural paths to semantic understanding, we're building a more resilient, intuitive, and powerful way to automate the web. This approach lowers the barrier to entry for complex automation and frees developers from the tedious cycle of writing and fixing broken selectors.

You no longer need to program a robot with a faulty map. Instead, you can simply tell it where to go.

Ready to stop fighting with selectors and start a conversation with the web?

Try browse.do Today