For decades, businesses have sought to automate back-office tasks, data entry, billing processes, and other repetitive workflows. But even as software has evolved, true end-to-end automation remains elusive for most enterprises. Now, with the rapid rise of Large Language Models (LLMs), and the emergence of “AI agents” capable of reasoning and acting autonomously, there’s a growing belief that 2025 might be the year we finally see a significant leap forward in enterprise automation.
Sam Altman has publicly stated that “in 2025, we may see the first AI agents join the workforce and materially change the output of companies,” while Marc Benioff is pivoting Salesforce toward “AgentForce” in anticipation of a future where many organizational processes are delegated to specialized agents. These predictions raise a central question: can AI agents overcome the complicated hurdles of real-world enterprise systems? In this article, we’ll examine the unique difficulties of enterprise automation and explore some of today’s promising (but still maturing) solutions. We’ll also share hands-on tests with a seemingly straightforward workflow in Salesforce (SFDC) — creating a reseller order for a new account — that reveals the complexity lurking behind the scenes.
On paper, automating enterprise tasks sounds straightforward: spin up a script to log in, fill out forms, and click “submit.” In practice, the complexity is staggering. Enterprises rely on a myriad of systems of record like Salesforce, SAP, Oracle, and plenty of homegrown solutions. Each system has its own web of permissions, authentication flows, and custom business logic. What’s more, these systems are often heavily customized. It’s common to see specialized UIs, additional data fields, and bespoke workflows that differ from business to business.
According to a joint survey by MuleSoft and Deloitte, large enterprises may use an average of 976 different systems to support daily operations (source). This fragmentation means an automation tool must talk to multiple systems, each with its own nuance; some with robust APIs, others with none at all. Often, the simplest tasks involve bridging data across old, legacy applications and new cloud-based services. Even standard platforms like Salesforce can become labyrinthine once custom workflows and third-party integrations are in place.
Against this backdrop, LLM-powered agents promise a more flexible approach: they can parse data, reason about next steps, and even navigate complicated GUIs — at least in theory. But as you’ll see in the following example, the reality of getting an AI agent to do even a basic Salesforce workflow without human help is more complicated than many realize.
Picture you’re a sales associate at a bike manufacturing company that uses Salesforce. You’ve just sold 1 large Dynamo X1 bike for $5,000 to a new reseller called “Northern Trail Cycling”. Your job is to:
1 - Authenticate to Salesforce (with provided credentials).
2 - Create a new account for the reseller.
3 - Create a reseller order and add the line item (the bike).
4 - Submit that order to manufacturing for approval.
For a successful execution, we’re expecting the final result to look like follow:
It seems simple enough, but the devil is in the details. The company’s Salesforce instance is customized: it uses a custom “reseller order” object and flow, a special drag-and-drop feature for adding products, and a hidden “submit to manufacturing” step with no clear labeling. I tested this scenario using several emerging AI-driven automation approaches to see how they measure up.
Claude Computer Use is a new feature from Anthropic, introduced with Claude 3.5 Sonnet v2. It takes the standard LLM function-calling paradigm a step further by giving Claude an entire containerized desktop environment to “see” and “control.” It can capture screenshots, interpret them via visual/spatial reasoning, and perform OS-level actions like mouse clicks, scrolls, and keystrokes.
From a user’s perspective, you give Claude a high-level task (“Log into Salesforce and create this reseller order”), and Claude attempts to do exactly that. It loops through a sequence of:
Let’s start with the simplest approach of running Anthropic’s reference implementation without any changes to the system prompt. Here is the beginning of the interaction showing the initial prompt, Claude’s proposed plan, and the desktop it is starting the interaction with.
Observing Claude’s containerized desktop was initially impressive. It opened the browser, visited the Salesforce URL, logged in with the provided credentials, and navigated to “Accounts.” It flawlessly created a new account for Bike Production Company, inputting the right details in the form, then attempted to create a new reseller order. Things were going smoothly until it encountered the custom drag-and-drop interface for adding the bike. The system got stuck trying to perform a pixel-based drag-and-drop.
After a few failures, it tried to find an alternative method (like a hidden “Add Item” button). It’s first attempt with the “edit” button did not succeed.
“I notice in the edit dialog there's no clear way to add products. Let me try a different approach by clicking on the Reseller Orders dropdown to see if there are other options”.
It found its way eventually by discovering a way to add new items through the “related” tab — only to fail when the app’s dynamic triggers wouldn’t update the order total automatically. The developers of the SFDC app did not complete the development of this code path, expecting the human user to just follow through the drag and drop method. In short, the flow was designed for humans, not for an AI agent.
Claude then tried to locate the “submit to manufacturing” button, which was buried under a custom tab. Lacking prior knowledge of that step, it floundered for several more minutes. Ultimately, I had to intervene, manually add the bike to the order, and point Claude to the relevant button. After roughly 10 minutes and about $0.80 in usage costs, the process still wasn’t fully automated. It was easy to see why Anthropic calls this feature experimental: many real-world guardrails and improvements are needed before Computer Use can be truly production-ready.
Despite its rough edges, the concept is exciting. Vision-based AI for GUI interaction is improving rapidly, and the cost curve for inference is dropping quickly. A recent a16z study suggests that for the same performance, LLM costs are decreasing roughly 10x per year. In principle, future versions of Claude could get faster, cheaper, and more accurate at visual/spatial tasks like drag-and-drop.
Yet the fundamental problem remains that enterprise UIs, especially older or heavily customized ones, are seldom built with automation in mind. Pixel-level interactions are fragile. Minor changes to the layout or dynamic pop-ups can break the entire flow. There is also growing research around visually grounded GUI frameworks, but making these production-grade for hundreds of different workflows is a major undertaking.
One alternative approach is to ignore the “visual bounding boxes” entirely. If your target application runs in a web browser, you can automate at the DOM level, skipping screenshots and pixel-based interactions. While traditional headless browsers like Playwright and Selenium are often associated with testing frameworks, a new generation of AI use-case-focused headless browsers is emerging. These newer platforms build on top of Playwright and Selenium to enable more dynamic, LLM-powered interactions.
BrowserBase is one such example. It functions as an infrastructure platform that hosts and scales browser sessions without requiring developers to manage containers. The interaction pattern revolves around parsing the HTML content of a page into components (e.g., forms, buttons) mapped to their xPaths and passing this structure to an LLM of your choice. The LLM then generates the next set of Playwright code to execute, allowing interaction with the DOM via code rather than traditional GUI clicks. Because it’s purely headless, it uses fewer or no screenshots, keeping context length short and latency lower than a full “desktop environment” approach.
More recently, BrowserBase shipped its StageHand open-source library to make things easier for developers. In the original model, interactions were still very manual, requiring developers to work with the low-level details of the headless browser, including directly writing Playwright code and manually parsing HTML. With StageHand, BrowserBase provides a higher level of abstraction, allowing developers to use intent-based natural language commands like “navigate” or “extract.” This approach also bakes in some processing to convert raw HTML into components, making it easier for the LLM to handle tasks. However, users still need to create their own orchestration layers to connect and manage workflows, as StageHand itself does not offer built-in orchestration.
To test BrowserBase, I used their developer playground, which provides a console for writing Playwright code and an LLM prompt writer to automatically produce those scripts. The idea is to do multi-step navigation — log in, create an account, create a reseller order. But the platform expects you to orchestrate the steps yourself. Starting with the same prompt given to Claude, BrowserBase stumbled as it couldn’t reason in multi-step fashion. So I proceeded to providing a natural language prompt for each step and observing whether the generated Playwright code was doing what was intended. In the screenshot below, you can see the series of prompts and their generated Playwright code.
In practice, I ran into occasional misalignment between the Playground’s browser environment and the HTML forms that needed to be filled out. Buttons rendered oddly, waiting times got extended, and form fields didn’t load exactly as expected. Despite these glitches, the LLM-generated Playwright code did manage to log in, create an account, and partially fill out the reseller order form. However, drag-and-drop to add the item was again a stumbling block. I spent about seven minutes tinkering with it before giving up. It was clear that the platform is not yet fit for such type of automation. It likely works best for web scraping use cases.
Skyvern is a more all-in-one headless approach that adds orchestration by default. Unlike BrowserBase, which requires users to define and manage steps manually, Skyvern attempts to handle orchestration out of the box. Under the hood, it operates similarly to BrowserBase — as seen in their open-source code — but also adds a web agent that can orchestrate and reason about steps. This includes an optional vision mode that sends screenshots to the LLM alongside the extracted components and their xPaths to assist in decision-making.
To address the limitations of manual step creation in BrowserBase, I decided to test Skyvern using its managed service, focusing specifically on the Workflow mode. This mode is designed for multi-step processes, and I wanted to evaluate how well it performs with our Salesforce workflow. Unfortunately, the run spent over 15 reasoning steps and $1 of credits stuck in the two-factor authentication (2FA) process. Skyvern’s hosted IP was flagged, triggering 2FA, and there was no way to manually supply a code or share a cookie to bypass the situation. This highlights the ongoing challenge of authentication in enterprise settings and underscores why startups like Anon are emerging to focus solely on authentication solutions for AI agents.
Skyvern’s team positions the platform as suitable for simpler, smaller tasks, with contact form automation being the primary supported use case. Other potential use cases (e.g. Jobs, Invoices) are still listed as "in training," indicating the platform is starting with simple use cased focused automation rather than the more complex needs of enterprise workflows. While promising, it’s clear Skyvern is better suited for less intricate scenarios at this stage of its development.
Headless browsers skip pixel-level guesswork, which often leads to fewer errors and faster execution. But as soon as you hit advanced features like drag-and-drop or complex single-page apps, you may need to revert to partial screenshot analysis or specialized code. Browsers might also run into 2FA and IP blacklisting. For multi-tenant enterprise applications, authentication alone can be tricky, and you may still need custom orchestration layers.
Another limitation is that these platforms rely on generating code dynamically via LLMs each time the workflow is executed. Since LLMs are inherently non-deterministic, the outputted code can vary across runs, making it challenging to audit or verify consistency. This unpredictability can lead to issues, especially in sensitive workflows. While caching generated code seems to be on the roadmap for some platforms, it poses significant challenges for LLMs. Even minor changes in the prompt or batch processing during inference can produce entirely different results, complicating the caching process.
Overall, headless browsing can be cheaper and more stable than full GUI manipulation, but it’s far from a magical fix. Many solutions, such as BrowserBase and Skyvern, are focusing on narrower use cases (e.g., forms, data extraction) rather than being the “one platform to automate everything.”
A third approach is to bypass the web page altogether by intercepting the network calls that happen when you click around. If you can capture the requests your browser sends, you can reconstruct those calls in code. In principle, this avoids messy UI-based steps and ensures you’re hitting the same backend logic your application uses. This trend is not entirely new, as reverse-engineering APIs has been around for a long time. However, the novel addition is incorporating an AI agent to reason about the network requests, making the process more intelligent and adaptable.
A few months ago, a product called Integuru launched on Hackernews and has garnered attention for its open-source approach and novel methodology. Intrigued by its potential, I decided to test it, drawn by its interesting graph-based approach and integration of AI agents to reason about network requests. The promise of drastically cutting down the time and cost of automation made it a compelling option to explore.
Integuru’s repository is relatively new but shows promise. At its core, it records all network traffic and cookies in Chromium during a task. It then creates a graph representation of the requests, mapping which pages call which endpoints. Using this graph, it performs a traversal, passing it into an LLM to generate code for each node that replays the same requests, injecting your dynamic parameters (like “Bike Production Company”) as needed and piecing them together based on dependencies. This approach could theoretically streamline the automation process significantly.
In practice, however, it didn’t work well for our use case, mostly due to context window limitations. The flow might have been too long for the LLM to handle effectively. Even attempts to short-circuit the process by embedding login cookies directly and starting from the homepage did not succeed. While I suspect my low-tier OpenAI API key contributed to these issues, it’s clear that Integuru is still in its early days. The potential is there, but the product requires further refinement. Its demos (like downloading tax documents from Robinhood) worked best on modern web frameworks with simpler flows. Salesforce, with its complicated front end and labyrinthine custom objects, introduced errors.
That said, this method is not yet a universal solution. The need for recording all steps limits its flexibility, and it leans toward a more static approach of generating code for specific flows in advance, reminiscent of the rule-based RPA tools popular a decade ago. This highlights a fundamental limitation: while the addition of AI reasoning to network requests is exciting and can open the doors for integrating with systems that don’t have APIs, it’s still better suited for more controlled or repeated tasks rather than dynamic, diverse workflows in enterprise environments.
No conversation about AI-driven automation in Salesforce would be complete without mentioning AgentForce, Marc Benioff’s big bet on building “agents” inside the Salesforce ecosystem. Unlike other solutions we tested above, which are developer-focused and aim to automate workflows across various systems, AgentForce is positioned as a low-code, embedded solution specifically for Salesforce. It packages many components together and focuses on the entire flow within the Salesforce platform.
The idea is to create agents that fully reside in Salesforce and build upon your customizations. Users define an agent’s general description, assign topics, and link associated actions which are prebuilt flows defined either in code or through the Salesforce UI. Permissions, user roles, and instructions are then set up to enable the agent to function. This concept theoretically allows businesses to leverage their existing Salesforce data and workflows to drive automation without extensive coding.
I wanted to test AgentForce directly with our eBikes reseller order example. Unfortunately, access to Einstein (AI features) is required, which isn’t available in a free developer account. Instead, I explored their 30-minute playground with the fictional “Coral Beach Resort” app. The test task was to configure an agent to automate the creation of a reservation, a process somewhat analogous to a reseller order in our eBikes scenario.
The setup was quite involved, requiring multiple steps: defining permissions, enabling topics, connecting to prebuilt actions, mapping data fields, and clarifying instructions. While marketed as a low-code solution, it became clear that significant knowledge of Salesforce’s intricacies is necessary. If a company’s Salesforce instance lacks well-documented custom fields and preconfigured action flows, the initial lift can be substantial. Realistically, most businesses would likely need to bring in system integrators or consultants to fully implement and optimize these agents.
AgentForce’s rule-based nature also stood out. Users must carefully map which fields are filled or passed for the automation to work accurately, making it more hands-on than some AI-driven platforms. While this approach ensures precision, it reinforces the dependency on strong Salesforce expertise and existing infrastructure.
While AgentForce confines itself to Salesforce’s ecosystem, this has both advantages and drawbacks. On one hand, it’s a packaged solution that unifies authentication, user permissions, tool definitions, and orchestration logic within a single platform. On the other hand, many enterprise workflows span multiple systems, and the siloed nature of AgentForce limits its applicability for broader automation needs. Marc Benioff has stated that hundreds of customers have already signed deals to use AgentForce, so its evolution will be worth monitoring.
From these experiments, it’s clear that current AI agent solutions can do a decent job of reasoning about multi-step tasks and forging a plan. The real challenge is execution in a messy, real-world environment with tribal knowledge about how these systems truly behave. Graphical UIs were built for human interaction, and each enterprise’s custom logic is like a mini black hole of complexity. Even if you skip the GUI for a headless approach or reverse-engineer the backend APIs, you still face edge cases, authentication hurdles, rate limits, or dynamic workflows that throw off the best of LLMs.
The remaining challenges are predominantly engineering problems: building robust tools, integrating deeply with enterprise systems, establishing guardrails, and creating reliable monitoring and orchestration frameworks. These are solvable with dedicated effort and specialization. Today’s LLMs already demonstrate reasoning capabilities far beyond what was available even a year ago, and their cost is dropping rapidly. The focus now must shift to constructing the infrastructure and processes needed to deploy these capabilities effectively.
Yet these difficulties shouldn’t overshadow the steady progress happening. We’re already seeing specialized, vertically focused AI automations (e.g. SDR or customer support agents) that can deliver high accuracy in a controlled domain. As each of these single-use automations matures, we may see them chained together into broader workflows. That might ultimately be how we crack end-to-end automation in large enterprises: by combining multiple specialized agents rather than expecting a single general-purpose agent to do everything. For now, the ROI of building a from-scratch agent might not pencil out for all but the highest-volume tasks.
One lesson from these tests is the importance of specialization. Achieving near-perfect reliability in a single domain (for instance, creating invoices in NetSuite) takes significant fine-tuning. Startups or internal teams that focus on one specialized workflow can deliver a better experience than a broad, generic solution. We are already seeing a wave of “vertical agents” that tackle targeted tasks in finance, logistics, HR, or supply chain. Each agent would integrate deeply, perhaps combining UI automation where necessary with direct API calls when possible, plus domain-specific fallback logic and guardrails.
The big question remains: Will 2025 truly be the year when these agents go mainstream, or are we looking at a longer runway? The technology is moving quickly, and optimism abounds. But just as software engineers didn’t disappear when code generation got better, we probably won’t see “hands-free” enterprise automation for all processes. Instead, we’ll see iterative improvements in specialized pockets, eventually stitching them together as a mosaic of partial automations.
The concept of autonomous AI agents is undeniably compelling, especially in enterprise settings where repetitive tasks abound. The potential benefits — saving time, reducing errors, and enabling employees to focus on more creative and strategic work — are enormous. However, while the foundational capabilities of AI agents are strong, the path to widespread adoption hinges on overcoming engineering challenges in addition to advancing the underlying research.
Building the right infrastructure is key: robust tooling, reliable integrations, and domain-specific solutions with well-defined guardrails and orchestration layers. The complexity of real-world enterprise systems requires specialized solutions, and this is where vertical agents can excel. Concentrating on narrow, well-defined workflows allows teams to refine their solutions to a high degree of accuracy and reliability, addressing the unique challenges of each domain. Over time, these specialized agents could interconnect, creating a broader network of automations.
2025 may well bring impressive advancements and a growing number of pilot programs. Rather than a world running on autopilot, we are more likely to see targeted, highly effective automations tackling specific problems. The journey toward full enterprise automation will be iterative, driven by specialization and collaboration. The momentum is building, and solving these engineering challenges will pave the way for the next wave of enterprise innovation.
(Feature image credits to DALL-E)