Traditional web automation is powerful, but it's often incredibly fragile. Anyone who has spent hours writing XPath queries or CSS selectors for a web scraping task knows the pain. You craft the perfect selector, your script runs flawlessly, and then... the website's developer changes a div to a span, adds a class name, and your entire workflow shatters.
This brittleness exists because traditional tools don't understand a webpage; they just parse its structure. They follow a rigid map. If the map changes, they get lost.
At browse.do, we're taking a fundamentally different approach. We've built an AI agent that interacts with websites more like a human does—by understanding context, semantics, and visual layout. This post will pull back the curtain and give you a look inside how our agent turns your natural language objectives into automated actions.
The core issue with selector-based automation is its tight coupling to the Document Object Model (DOM). A selector like div#main-content > article.post-entry:first-child > h2 > a is a precise, step-by-step path through the HTML tree. It’s effective, but it makes a dangerous assumption: that the structure of the tree will never change.
This is like giving directions by saying, "Take a left at the third oak tree, then a right at the blue fire hydrant." If the city removes a tree or paints the hydrant red, the directions become useless.
A human, on the other hand, understands the intent. You'd tell a friend, "Head towards the main post office and look for the big clock tower." That's a semantic, goal-oriented instruction. Our AI agent is designed to work the same way.
To understand a page like a human, our agent needs to perceive it like a human. It doesn't just read the raw HTML. Instead, it combines multiple data sources to build a rich, contextual understanding of what's on the screen.
By combining what it sees with the underlying structure and semantic meaning, the agent gets a far more robust picture of the webpage than a simple HTML parser ever could.
This rich, multi-modal representation of the page is then fed to a Large Language Model (LLM), which acts as the agent's "brain." The LLM's task is simple to state but incredibly complex to execute: correlate the user's objective with the elements on the page.
This process happens in a continuous loop:
This iterative loop is what allows browse.do to handle complex, multi-step tasks like logging into an account, navigating to a dashboard, and downloading a report.
Let's look at the simple code example from our homepage.
import { browse } from "@do-inc/agents";
async function getTopHackerNewsStory() {
const result = await browse.do({
url: "https://news.ycombinator.com",
objective: "Find the title of the top story and its URL."
});
console.log(result.data);
return result.data;
}
Here’s a simplified breakdown of the agent's thought process:
Notice what didn't happen. There was no tr.athing:first-child > td.title > a. The agent didn't need a fragile, hard-coded path. It understood the concept of a "top story" based on visual hierarchy and semantic cues, making the automation resilient to future site redesigns.
By shifting from rigid structural paths to semantic understanding, we're building a more resilient, intuitive, and powerful way to automate the web. This approach lowers the barrier to entry for complex automation and frees developers from the tedious cycle of writing and fixing broken selectors.
You no longer need to program a robot with a faulty map. Instead, you can simply tell it where to go.
Ready to stop fighting with selectors and start a conversation with the web?