Proactive Agents for the Web with Devi Parikh - #756

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) • Sam Charrington

Wednesday, November 19, 202556m

Spotify Apple

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

0:0056:04

What You'll Learn

✓Parikh has 20 years of experience in AI, with a focus on multimodal problems like image captioning and generation.
✓Utori's goal is to enable a new way of interacting with the web, where users describe their needs and AI agents proactively execute tasks in the background.
✓The team at Utori, including Parikh, her husband Dhruv Batra, and Abhishek Das, have been collaborating for years and are now working to bring this vision to life.
✓Developing reliable and autonomous web agents requires innovation across the stack, including both the underlying models and the product experiences.
✓Parikh believes the current state of the technology is at a sweet spot where it's possible to make significant progress on this problem, compared to the challenges of robotics in the physical world.
✓The team is taking an iterative approach, putting out products and learning from user feedback to continuously improve the capabilities of their web agents.

AI Summary

The podcast discusses Devi Parikh's work on building AI-powered web agents and assistants that can proactively execute tasks and workflows on the user's behalf, rather than the traditional model of manual web interactions. Parikh and her co-founders at Utori aim to create a new paradigm for interacting with the web, where users describe their needs and AI agents handle the execution in the background. The discussion covers Parikh's background in AI research, the motivations behind Utori's mission, and the technical challenges of building reliable and autonomous web agents.

Key Points

1Parikh has 20 years of experience in AI, with a focus on multimodal problems like image captioning and generation.
2Utori's goal is to enable a new way of interacting with the web, where users describe their needs and AI agents proactively execute tasks in the background.
3The team at Utori, including Parikh, her husband Dhruv Batra, and Abhishek Das, have been collaborating for years and are now working to bring this vision to life.
4Developing reliable and autonomous web agents requires innovation across the stack, including both the underlying models and the product experiences.
5Parikh believes the current state of the technology is at a sweet spot where it's possible to make significant progress on this problem, compared to the challenges of robotics in the physical world.
6The team is taking an iterative approach, putting out products and learning from user feedback to continuously improve the capabilities of their web agents.

Topics Discussed

#Web automation#AI assistants#Multimodal AI#Proactive agents#Product development

Frequently Asked Questions

What is "Proactive Agents for the Web with Devi Parikh - #756" about?

What topics are discussed in this episode?

This episode covers the following topics: Web automation, AI assistants, Multimodal AI, Proactive agents, Product development.

What is key insight #1 from this episode?

Parikh has 20 years of experience in AI, with a focus on multimodal problems like image captioning and generation.

What is key insight #2 from this episode?

Utori's goal is to enable a new way of interacting with the web, where users describe their needs and AI agents proactively execute tasks in the background.

What is key insight #3 from this episode?

The team at Utori, including Parikh, her husband Dhruv Batra, and Abhishek Das, have been collaborating for years and are now working to bring this vision to life.

What is key insight #4 from this episode?

Developing reliable and autonomous web agents requires innovation across the stack, including both the underlying models and the product experiences.

Who should listen to this episode?

This episode is recommended for anyone interested in Web automation, AI assistants, Multimodal AI, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Today, we're joined by Devi Parikh, co-founder and co-CEO of Yutori, to discuss browser use models and a future where we interact with the web through proactive, autonomous agents. We explore the technical challenges of creating reliable web agents, the advantages of visually-grounded models that operate on screenshots rather than the browser’s more brittle document object model, or DOM, and why this counterintuitive choice has proven far more robust and generalizable for handling complex web interfaces. Devi also shares insights into Yutori’s training pipeline, which has evolved from supervised fine-tuning to include rejection sampling and reinforcement learning. Finally, we discuss how Yutori’s “Scouts” agents orchestrate multiple tools and sub-agents to handle complex queries, the importance of background, "ambient" operation for these systems, and what the path looks like from simple monitoring to full task automation on the web. The complete show notes for this episode can be found at https://twimlai.com/go/756.

Full Transcript

I'd like to thank our friends at Capital One for sponsoring today's episode. Capital One's tech team isn't just talking about multi-agentic AI. They already deployed one. It's called Chat Concierge, and it's simplifying car shopping. Using self-reflection and layered reasoning with live API checks, it doesn't just help buyers find a car they love. It helps schedule a test drive, get pre-approved for financing, and estimate trade-in value. Advanced, intuitive, and deployed. That's how they stack. That's technology at Capital One. Join developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 other supporting companies to build the open tool stack for multi-agent software and trusted agent identity on Agency. Agency, which I recently discussed on the podcast in my interview with Vijoy Pandey, is now an open source Linux foundation project where you can help create the protocols, specs and tools that power next gen AI infrastructure. Visit agency.org to learn more and join the build. That's A-G-N-T-C-Y dot O-R-G. We will no longer be interacting with the web in the same way that we do right now. We won't be clicking buttons, fiddling with forms on websites and browsers. We'll be interacting with the web one level higher in the abstraction where we're describing what needs to be done, or maybe our assistant is proactively noticing what needs to be done, and agents in the background are starting to execute these workflows on the web, on your behalf. All right, everyone, welcome to another episode of the Twin Wall AI Podcast. I am your host, Sam Charrington. Today, I'm joined by Davey Parikh. Davey is co-founder and co-CEO of Utori. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Devi, it has been a while since we caught up last. Welcome back to the podcast. Thank you. Thank you for having me again. Yeah, five years later, not much has happened at all, right? In some ways, a lot has happened, but in some ways, I'm like, wow, it's been five years. I know, I know, I know. So we're going to be talking a bit about AI browsers and browser use agents and what you're building at Utori. But I'd love to have you take a few minutes and catch us all up on what you've been up to recently. Yeah, yeah. And I can go a little bit further back than five years just to talk about my background a little bit. So I've been working in AI for about 20 years now. Originally, my PhD thesis was in computer vision. And then over time, I got interested in seeing if we can find ways in which people can interact with these systems more naturally. And that's how I moved towards multimodal problems at the intersection of vision and language. So things like given an image, can you describe it in a sentence? Can you answer questions about it? Can you have a conversation going back and forth about the content of an image? And this was back in 2014 or so. So it was after that initial the excitement of deep learning models, where it was starting to feel like, wait, these models are doing something, stuff is actually starting to work. But it was well before all of the current excitement around Gen AI and LLMs and so on. So these models weren't really as good as they are today. And yeah, so it was kind of fun to tinker on the boundaries of what's possible. And then I started getting interested in seeing if we can find ways in which we can use AI as a tool for creative expression. And that's how I got involved with generative models for images and videos and music and other modalities like that. I was in academia for a while, faculty at Virginia Tech and then Georgia Tech. And then I was at Meta for about eight years, first in FAIR. Then in Gen AI, where I was a senior director leading a lot of the multimodal research efforts there. So models like EMU, EMU Video, EMU Edit for image and video generation and editing. They were shipped across Meta's surfaces. My teams were involved in that. and the multimodal capabilities in Lama 3 were coming from my teams as well. And this was up until early last year where my co-founders and I, we left Meta to start UtoRing. Correct me if I'm misremembering this, but at some point you were interested in fashion. Am I remembering that correctly? Did you do some papers on like fashion data sets or something? I did. I did. Yeah, I had done a couple of projects in that space with other collaborators. I think the last time we talked, I was right at that edge of looking for the next thing. And I was starting to tinker in the space of like, and we use AI as a tool for creative expression. So there were a whole bunch of kind of weird little projects that I had done at the time. Some of the passion ones were more legit. I won't say those were weird. Like I was doing them with other collaborators, but yeah. Tell us a little bit about the problem that you're aiming to tackle with Uttori. So what we're working on at Utori is towards this vision that we will no longer be interacting with the web in the same way that we do right now. We won't be clicking buttons, fiddling with forms on websites and browsers. We'll be interacting with the web one level higher in the abstraction where we're describing what needs to be done. or maybe our assistant is proactively noticing what needs to be done. And sort of agents in the background are starting to execute these workflows on the web on your behalf. These agents will be always on. They'll be proactive. They'll be personalized. And so what we are working on at Utori is both building the underlying tech and the product experiences to sort of usher in this change over time. And how did you settle in on that problem? Like, and tell us a little bit about the founders also. One of your co-founders is your husband, Dhruv. Is that right? Yeah, yeah, yeah. And I think actually he has also been a guest on your podcast, if I remember correctly. I could be wrong there, but yeah. He was a guest on your podcast. That also, that also. But yeah, yeah. So, yeah, so I can talk about how we settled on this problem. And then I can also talk about the three of us. So the general space of sort of efficiency, productivity, or I think more generally just sort of designing a life that is more meaningful to you is something that's personally motivating. And so Yotori, actually the name, it's a Japanese word for the sense of well-being that you experience as a consequence of mental spaciousness. So sort of you're not trying to context switch every few minutes. You don't have a gazillion things coming your way. Things that you would rather not be doing are being taken care of for you in the background. And you have the space and time to focus on whatever it is that is meaningful for you. So that general space was personally motivating. And then on the technical front, it was feeling like these web agents, these digital assistants is something that wasn't quite ready yet. Like at the time, you couldn't just take GPT, put it in a for loop and sort of expect these agents to do sort of workflows on the web reliably, autonomously. At the same time, it didn't seem like it was going to take 10 years before we can start getting that reliability on like something like robotics, for instance. And so it felt like that sweet spot where with our research backgrounds, with the kind of team we can put together, we'd be able to make a solid dent on the underlying capability and sort of bring to people actual product use cases that they're using on a day-to-day basis. And so that felt like a sweet spot. And so, yeah, that's how we arrived at that. Nice, nice. And speaking of robotics, I found that episode with Dhruv. That was talking about his work building maps and spatial awareness. and blind AI agents. And that was only two years ago. I see. Yeah, yeah, yeah, yeah, yeah. So exactly. So he was working in robotics. Like he was leading a lot of the embodied AI efforts at FAIR. And so there are sort of certain sort of sequential decision-making, these agents taking actions, there being consequences to the actions. A lot of those things carry over. But a lot of the challenges of these entities just being physically around you in the physical world are sort of are taken out when you're building web agents. And so, yeah, that has influenced our perspective. And so a little bit about the three founders. Yeah, the three founders are myself, Abhishek Das, who goes by Das, which is his last name, and Dhruv Batra, who we were just talking about. Dhruv and I are married. We've worked together the whole time we've known each other. We've had the same employer, 19 of the 20 years, how we've known each other. Our offices have been next to each other. Our desks have been next to each other. We've had the same managers, the whole thing. Das was Dhruv's PhD student at Georgia Tech. And because Dhruv and I ran our labs together, I have collaborated very closely with Das as well. And at this point, the three of us are just really good friends. We had talked about starting something together going back six, seven years. We even had this weekly dinner that was originally called Brainstorming, because like, yeah, we were just brainstorming on like, if we did this, what would we do it on? Over time, we were just socially hanging out. We were brainstorming for six, seven years straight. But yeah. So does it feel like the stakes are really high for this to be the idea and no pivots are allowed? I don't think so. I don't think so. We do genuinely believe in this vision that like, yeah, how we're interacting with the web is going to change drastically. and I think a lot of others would also buy that. I think the devil is in the details of how we execute on it, what are the product use cases that we bring to the market as we go along and those are all experiments, right? We put something out there, learn from it, tweak it and go from there, yeah. So let's maybe dig into that broad space and try to cover why folks are so excited about automating the web with browser use agents how you see what you're working on versus some of the other things that are out there. Open AI's Atlas comes to mind, perplexities come to mind in terms of browsers, but there's like, you know, probably a dozen or two more, if not more than that, if not hundreds. You know, lay out the way you think about the space for us. I think there's a couple of dimensions that are worth commenting on. I think one is sort of what part of the stack you choose to focus on. And so I think there are a good number of efforts that are focused on the underlying models and the underlying tech and sort of getting those to be reliable and things of that nature. And then there are efforts that are focused more on sort of the product experiences themselves. But I do think that in this space, at least at this point in time, given where the tech is, you need to be innovating across the stack. You need to be pushing on the underlying model capabilities and architectures and in tandem thinking through what are the product experiences that you can deliver on reliably based on what you know the status of the tech is. And those sort of need to go hand in hand. If you do one or the other, that's not going to be sufficient. If you're focusing just on the tech, you sort of have these tech demonstrations that are awesome to look at, but then no one's using this on a day-to-day basis. And if you sort of in an isolated way think through product experiences, not grounded in the modeling capabilities, then you kind of sort of, yeah, either end up promising too much that doesn't live up to it the first time the user tries it. And so, yeah, you don't need to be pushing across the stack. So that's one. And the second, more relevant to what you were talking about with AI browsers, whether it's Atlas, whether it's Comet, whether it's DIA, we kind of sort of that has a little bit of the flavor of taking how we are interacting with the web today, which is through these browsers and sort of enhancing that with AI features, which is useful to do. but I think the way we think about it is more that it shouldn't even look like this that like like yeah we shouldn't even be in browsers the way they are today looking at web pages the way in which we do do today yeah like there's no reason if I am trying to get x done on a website and you are trying to get y done on the website given that we're trying to do two different things the website should just show up to us in two different ways because our intent is different that purpose is different, right? There's no reason you and I are both looking at the same website. And the second bit is that we are strong believers of these agents being in the background, sort of out of your space. And the way in which that's relevant is a couple of different things. One is, let's say you sort of shut the lid of your laptop down and walk away. And if the agents are working on your device what happens to them So that one And the second is there is a lot of value to these systems inherently being multi where there a whole bunch of agents in parallel executing on these workflows for you. And if they're sort of all in your browser, that's not going to scale well. It's just going to kind of take over your device. So for a few different reasons, the kinds of product experiences that we envision are very different from sort of having AI features in existing browsers. And so is the implication of that that you have to tackle these experiences use case by use case? And does that mean that you end up releasing a bunch of kind of disparate products that serve different needs? Or are these, you know, experiments or steps that lead you to some broader platform approach that can be applied across use cases? Yeah, I think it's more the latter. So I can talk about the first product that we put out there and you'll see what I mean. And I'll talk more about what could follow. So the first product that we put out there is called Scouts. And Scouts monitor the web for anything that you care about. So if you sort of want to stay updated on any new AI announcements, you can set up a scout for that. If there's a certain product that you're looking to buy and it's out of stock or the price is too high and you want to be notified whenever it's in stock or the price is below threshold, you can set up a scout for that. If you're looking for internships or jobs and you want to be notified anytime there's a new listing of a certain characteristic, if you're looking for apartments, same thing if you're doing sort of business and competitive intelligence, you want to be notified anytime your competitor's product gets a bad review because now you can reach out to that person to sell. Yeah, so anything that you are interested in wanting to stay up to date on on the web, you can set up a scout and it will notify you. So what's relevant here is that it is not promising the world, right? It's not saying that I'm going to make the reservation and I'm going to purchase the product or I'm going to sort of take care of all your digital chores. It's very specific. It monitors the web for any information that you might care about. So it's narrow in the capability, but it's very general in the domain, right? You can be monitoring anything on the web that is of interest to you. And so the way we think of it is over time, we will add more capabilities to this that right now it just monitors. In the future, it can let you know that this is available. Do you want me to buy it for you? And if you say yes, then it can go ahead and do that. And so it can, over time, it can do a larger and larger chunk of the workflow that you're trying to get done while all along being fairly general in the domains on which you are applying this. So that's the approach that we've taken. I'd love for you to dig a little bit more deeply into the way you've approached Scout, kind of it's the technical approach in architecture, but also like the challenges that you run into. And what is coming to mind for me is that, you know, as you describe this use case, Like it both sounds easy, you know, Hey, we've had like notifications, like Google alerts forever. Uh, but also having, you know, try to, to build that kind of thing with AI, like it's also really difficult. Like even simple things like, uh, I did something to look at like, uh, particular, like scrape a forum for car postings. and just like the cron jobs and stuff was a pain in the butt like everything was a pain in the butt and this was not a forum it was a forum and not something where you know someone was actively trying to prevent me from doing it it's kind of like uh but then even then things would break the lm would you know misinterpret things it wouldn't follow instructions very well like talk a little bit about like maybe the the question is like help us understand the the nuances and why this is an interesting experiment for you yeah yeah and what you're describing is something that we've heard from a bunch of our users frequently that like they've some of them most of them haven't but some of them have tried to sort of put together some tools on their own for these capabilities and either because of infrared challenges or like you said sort of the NLM won't do what you expect it to do. Even things like having the context of what it has already told you in the past and making sure that whatever it tells you next is actually new relative to it. All of this, yeah, needs thought and orchestration. The other bit here that I think a lot of people sometimes miss is that information on the web is, even publicly information on the web, is often behind lightweight forms, right? So if you're thinking about like your local tennis court reservation and you're like, let me know whenever there's an opening for Monday at 7 a.m., it's not like there's a list somewhere for all dates and all times, right? You have to go pick the date, you have to go pick the time, and only then you can see is this available or not. And so even for this monitoring capability, you often need to have a browser use model, a navigator that's automatically navigating on the website to find the information that you need. And so the way in which we've orchestrated Scouts is that it has access to a whole bunch of APIs, a whole bunch of MCP servers. So whenever information is available in a very agent-friendly way, that is what it would choose to use. But in situations, which is sort of the large, heavy tail on the web of websites, where information is sitting behind some actions that you need to take on the website, for that, we will spin up a remote browser, we will open up the web page, and we use this navigator model that we've trained in-house that will click on buttons, fill out lightweight forms to get you the information that you need. So that's one that just access to information often has this complexity associated with it. The other is sort of coverage is important, right? When you're saying that, like, let me know whenever there is news about blah topic, that news could be anywhere. It could be on social media. It could be sort of more traditional news articles. It could be conversations on Reddit. It could be anywhere on the web. And so having the system built in a way that it's sort of optimized for just a lot of persistence, optimizing for coverage. to start with. But then by the time you don't want to send sort of a brain dump of information to the user, right? By the time you make it a user-facing report, you do want it sort of high precision, well-summarized, easy to read, easy to digest, presented in a way that it has the context of what it has told you before. And so all of these are various pieces of the architecture that we've built. And it's a very general architecture. When you talk about access to APIs, MCP servers, this browser use model, all of that, you can see how it powers Scouts, but then you can also imagine all the other things this architecture can do over time. And so that's been important to us to build the underlying tech in a fairly horizontal way, but sort of bring product experiences to users for specific things where we can set expectations we know we can deliver reliably and sort of build that trust over time. And what are some of the MCP servers or what are examples of the MCP servers that you're relying on? So it's tools that go across the board. So things like Airbnb availability or sort of various social media, like things on LinkedIn or Twitter that is publicly available, like the Scouts can't go behind AuthWalls. Search APIs for just sort of, yeah, just regular web search for the weather. Yeah, there's 80, 90 of these individual tools that our stack has access to. And then for the heavy tail for the rest is where we spin up these browser use models. And do the browser use models that you've created, do they navigate off walls? Not currently, not what we've shipped in the product right now. The model itself has the ability to do that. And over time, that is something that we'll put out there. So you can authorize the scout to sort of monitor various feeds that you have access to and authorize it to log in on your behalf. But that's not something that's in the product right now. Talk a little bit about the way you've approached the model for this. I've had some conversations with folks like Dan Jeffries is one that I remember where he talks about the complexity of really simple things that you think would be easy, like going to a Google Flights form and trying to get the model to choose the right date from the pop-up being really difficult. Maybe you have avoided that by sidestepping the whole travel booking use case, but maybe not. I guess if you're doing tennis, you probably still need that perhaps. But yeah, talk about that kind of thing that you've run into. Yeah, so I think anyone who's tried to train these navigators, these browser use models, I think all sort of bond over how hard date pickers are. So that is a recurring example that comes up. Yeah, we do. You can use a scout to monitor the availability or price of a flight and things like that. And yeah, so the way we've chosen to train the model is by relying on visual information. So we take a screenshot of the web page that you're looking at, and the model is trained to predict the next action that it could take, whether it's clicking, typing, scrolling, so on. clicking the radio button, clicking on the appropriate place in the date picker kind of a thing. And so, yeah, it approaches it in a very sort of visually grounded way, which is what lets us deal with date pickers and other things like that, which would be much harder to do if you use sort of the underlying DOM information to try to build these agents. And a lot of people do approach it that way. And yeah, we don't think that that's going to generalize well if you do it that way. This was a little counterintuitive to us. When we first started working on training these models, our hypothesis was that the web page is rendered for human consumption. There is no reason a machine needs to be perceiving the web page through that visual modality. The web page is just a rendering of the underlying DOM information. And so we should just be using that directly. Like, why go through this added layer? But over time, we just realized that's very, very hard to do reliably. Like all different webpages are built very, very differently. Two webpages can visually look very similar, but have very different underlying DOM information. And so over time, we just realized that consuming these webpages in the same way that humans are, by just looking at visual screenshots, is way more reliable and way more general. And we can just sort of scale. We can just have more and more data of that kind and make these models better and better. And are you post-training a generic VLM for browser use or are you post-training a browser use model for your specific use case? We are post-training a generic VLM for browser use. Are there off the shelf, open source, generic browser use? Like, do we, is it, do we know how to like deliver a post-train browser use agent that is, you know, generally useful and then post-train it? You know, or is that something that, you know, part of the question is like, it seems like if it doesn't exist, that it would be, it would exist sometime soon, like it would be a thing that you're doing that is likely to be commoditized at some point. How do you think about that? I do anticipate that, and I'm sure over time, these VLMs also have been trained on various kinds of data, including like computer use screenshots and website screenshots in addition to other visual information. And so I do think that sort of grounding capabilities in the context of computer use and browser use has been getting better over time. And yeah, I anticipate that will keep happening. I do think there is continued value to us post-training for, I would say, maybe three reasons. One is we can sort of cater them for the kinds of use cases that are showing up for our product. And so we can push on reliability there. The second is as the capabilities of the underlying foundation models get better, out of the box, they'll be able to do longer and longer workflows. But then if we post-train, you'll be able to push the ceiling of what we can do at any given point in time right like we can just go after even more complex workflows than what's possible and if you think about the kinds of things we do on the web the sky is the limit of how complex and how long these workflows can get right and the third is cost reasons we are able to serve our product on our own browser use model our own navigator and so that keeps our costs under check way more than it would if we were sort of routing everything to one of these existing API providers. And are you using something like a Quen as the base model? We are using a Quen model right now. We experiment with any new models that come out. And so, yeah, over time, we might swap that out. But right now we are using Quen. You know, when I think about, like, how to evolve a system like this and make it better you know beyond talking to your users and understand what they trying to do and the problems they trying to solve It seems like you know there a pretty easy loop to follow which is like sort by domain that they're trying to have this thing use and, you know, just go from the top of the list to the bottom of the list and make your product work better in those domains. Like, is that part of the way you're thinking about evolving the system? The model that we've trained is fairly general. It hasn't been overfit to any specific domains based on where we've seen usage. It is sort of trained on these monitoring and information seeking tasks. So the nature of the task is playing a role in the kind of data that it's seen during post-training. and yeah there is a consequence of the usage that shows up there but it's not we don't approach it as here are the top 10 websites that we care about let's make sure the model works well there and then and then keep expanding we do have a whole bunch of evals across the stack so all the way from the individual step level that given where the navigator is right now this is the action that it decided to take next was that a reasonable thing to do or not right so literally one-step prediction accuracy up to sort of trajectory level that these are all the things that it did along the way net did it get the right thing done all the way to sort of the level of the report that we send to the user which isn't just about this navigator it also uses a whole bunch of other tools and so things like was the information relevant to what they were looking for were there enough citations in the report that we provided so that users can click on those to find out more. Were those citations specific enough? Were any of the links broken? Which would be a bad outcome to have. Is it repeating anything that it had already said before? So all of these more sort of user-facing factors that we consider. And so we have evals. And a whole bunch of these are automatic evals, but then we also have human evals for a lot of these things. So that is what we track to kind of make progress and make sure quality is up to the mark. And then there's a bunch of feedback that we get from our users. Right now, the product is available behind a waitlist. And so we've been quite careful with letting in cohorts of users so we can get feedback over time. And there's a bunch of stuff that we've put out in the product in reaction to that user feedback. And so, by the way, if any of the listeners do want access to the product, they can sign up on the waitlist at utori.com. And if they mention Twimble for where they heard about it, happy to prioritize their access. One thing that you mentioned was that the end result of this process is a report that feels like a very static and kind of dead-end-y way to terminate this interaction. whereas I might want to use this thing as the beginning of some other process that I kick off or like to further work with this information. Like for example, if I'm summarizing or fetching and summarizing competitive fees, I might want to then take that off of your platform and like put it into some broader report that I create that includes other things and maybe, you know, or put it on Slack or do something like, do you, how do you think about, is there a way that you think about the scouts as like, you know, the starting place for other processes as opposed to a report generator? Very much so. Very much so. So we very much think of scouts as kind of that starting point, that it's a read-only action. It's monitoring information for you right now. It doesn't change the state of the world, but the intent very much is that over time it will do more and more for you. So like the example that I was giving you earlier, the product is now available and it lets you know of that, but now it's on you to have to go purchase it, click on the link and do that. And it tries to make it easy. It will try and give you a link very directly to the product that you had asked for so that you can click on it and buy it easily. But over time, it should just be like, I found it, you want me to buy it for you, you say yes, and it goes ahead and does it for you. The second bit here is that we have scouting APIs and webhooks available for people who are interested in that. So there are some users who take these reports from scouts and sort of build other workflows on top of it, other custom workflows on top of it. And so that's an option as well. And the third thing is that for some of these things, it is a bit of a one-off that like you were monitoring the product for a while. And then once you found it and bought it, you no longer have use for the scout. And there it could be relevant to sort of walk the user through sort of the next thing that they may be interested in doing. But there are other things where if you're like, let me know of any updates on various technical topics or AI news or anything like that, that's just sort of an ongoing thing that you're monitoring over time. There is no natural endpoint to that. And yeah, so that's also the other one. And another bit is that we recently shipped this feature where you can respond to the scout report and give it feedback in terms of how it should do this differently going forward and what way you want the report to be different. And so that makes it a little bit more of sort of an interaction between you and your scout. And it's sort of getting higher and higher signals for you over time. But a lot of users have asked us that, like, and I just chat with it, like it found me all of this cool information and I just want to learn more about it. And so that's something that over time we might prioritize. And so the feedback that they're giving going into the context of the model and shaping the output of the next time, kind of like a system instruction or like a personalization instruction? Exactly, exactly. The feedback that they're giving is sort of being shared, like it's a part of the agent loop instruction the next time it runs. So there's a higher chance of it being a higher signal for you. I want to return a little bit back to the, you know, like the domain specificness of it versus the bitter lesson, you know, just collect a bunch of data and train this generalized thing. You know, I often think about like how I sometimes don't know how normal people use the web. Obviously, they do. But, like, for example, there's all these weird things. Like, you know, if you have multiple Google accounts, like, you can't really use them unless you know that you can go into the URL and change the U1 to U2 or U0 to U1 kind of thing. Like, and it's, you know, less frequent than it was two or three years ago, maybe, but or eight years ago, whatever. but it's still not infrequent that like I'm trying to solve a problem or to get something like basic done on a website and I'm opening up like dev tools and like fiddling with their CSS or something like that. Like, do you, are you training the browser use agent to do those kinds of weird hacky things generally? Are you training them to do those things? Do you have like heuristics that you use, you know, for specific important sites where you need that kind of thing? Or are you just saying if, you know, you're kind of carving that out as not attainable at this point in time? Yeah. So the way we've trained these models is we started with supervised fine tuning, right, where we got people to go to web pages, click around, get various tasks done, and then use that as training data to train our model. and then for a little while the more data you have especially if you're making sure quality is high sort of accuracies keep going up but then you eventually start plateauing right and then that's where we started using something called rejection sampling where the agent goes out and sort of tries to complete a trajectory we have automatic ways of telling whether this was done well or not whether this was done correctly or not and then if it was done correctly use those trajectories to sort of add it back to the training data instead of keep going there. That also eventually starts to plateau and sort of becomes a natural transition into reinforcement learning. And that is where we are at, like our models currently. Yeah, we've done supervised fine tuning, we've done rejection sampling, and we're doing reinforcement learning right now as we speak. But that's where I think there is potential for the model discovering certain hacky things, whether or not people do it that way. and yeah, so that might emerge. We haven't seen obvious signs of that so far, but this is when there is potential for that to happen. But while we were still doing supervised fine tuning, unless our annotators knew of these hacky ways of doing various things on these websites, the model wouldn't have picked up on that. Presumably you haven't found it necessary to do very site-specific things to make progress. Like the models are able to rely solely on this general knowledge to do most of the things you want them to do. Yeah, yeah. So far, that is what we've been seeing. Just from a sort of execution and like when we're doing these experiments, like on these exploratory things like reinforcement learning and when we were first getting started there, just to keep it scoped, we might start with one website and make sure that this is actually working on one website before we expand to others. so for that kind of a thing we've done experiments in narrow domains as well but the model that's in production is one model that's been trained across many different websites in a fairly general way and I do think I mentioned this earlier but I do think it's quite important like the fact that we are training on screenshots of web pages the fact that we've chosen to go with the visual modality and not the underlying DOM is what has enabled us to get to this point. Earlier in our journey, when we were experimenting with DOM and sort of trying to write parses for it to like extract relevant information, that is what was taking this down a very specific website route, because you need to parse that information in different ways for different websites. And that's when we realized that one is just not reliable. There are constantly edge cases that keep coming up, even from one website. And then when you try and go to a different website, you kind of have to just start from scratch. So I think that shift that we made to using vision was quite significant in what let us, yeah, just scale and have one general model across websites. Can you talk a little bit about how you structure the problem for RL tuning? So it's similar to what I was saying that we have these automatic rewards that we can set up at least for certain tasks. So for example, like whether something is available for sale or for reservation or whatever the case may be, If you've done the appropriate actions and you find yourself on the page that is now showing that availability information, you can look at just that page and know whether the model got it right or not. Meaning if you can characterize the query as an availability search, you can tell from the resulting screen whether they got to a presentation of options, that kind of thing? Whether they got, yeah, so there are some tasks where you can look at the final state of it and automatically decide whether the response that the model gave you is correct or not. You don't need to look at the whole trajectory along the way to be able to assess that. And so those kinds of things, for example, set themselves up well for having these rewards that you can then use to train the model for those. Do you have a sense for what's beyond Scout? Like, do you think of these as serial experiments or parallel experiments? We are a relatively small team so far. And so it restricts how many things we can do in parallel. We do think we are on to something with Scout. So it wasn't obvious to us when we first put it out that, like, we all thought this makes sense and it's valuable. but when we put it out there it was unclear if people would get like this is not chat gpt right this is not something that you go talk to when it's instantly real-time talking back and forth it's not google search and so it's this you're essentially monitoring for information that's going to happen in the future and so like now when i describe it i think it makes sense and everyone gets it but at the time we weren't really sure if that like this is a thing that will land for users and whether this is something that shows up in their lives frequently enough is this valuable enough and so yeah but I do think with sort of the users that we've seen the retention that we've seen conversions to the price plan that we've seen with all of that it does feel like we are onto something here but we don't want to sort of fall for that thing of like that local minima where this is good and let's not just keep optimizing in its local neighborhood there's so much more to do in terms of getting closer and closer to like taking care of people's digital And so, yeah, there's a bunch of new capabilities that are coming at different points in time, including being able to monitor information that's sort of behind auth walls, actual completion of these tasks on the web for you, and sort of integration with other tools and your context more and more of your context over time What are some of the ways that you use the tool personally I use it for all kinds of things So things like even sort of narrow things like whenever an AI researcher announces that they're leaving, but they haven't announced where they're going next. It's sort of a nice way of just keeping up with what's happening in the ecosystem with people moving around. That's kind of an interesting use case. It's like an active bookmark, like keep a tab on this for me and let me know when something new happens around it, as opposed to like, this is a thing that's continually happening and give me summaries of all the new things. yeah and this is like what i just said is like when you think about google alerts right it's very keyword based and so you could set up an alert for somebody's name for example but if it's this kind of a thing that whenever an AI researcher changes like posts that they've left their current job and haven't said when they're going next how would you set up a google alert for something like that right so uh yeah yeah um uh i often set some up for certain current events like there was this air india crash that had happened in my hometown uh some number of months ago and so in those weeks i just wanted to know of all updates that are happening um around that and so i had set up a scout uh for that um i have one set up for uh just sort of the latest hot take that's causing a lot of debate in san francisco uh and it and it sends me a note uh every day after lunch, things of that nature. I've recently gotten into air dry clay and like various projects that I can do with it. But I'm now going to have the time to go to the studio or something. Like air dry clay. So like ceramics and pottery. But where you, yeah, you don't need to go to a studio. It's just things that you can make at home. And so I'm always looking for ideas on quick things that I can do. Not a lot of overhead. It's not going to take me a lot of time. And so I have a scout set up for just like sending me posts from Reddit and other places where there are neat project ideas for that. Yeah, a bunch of people have used it for apartment searching and they found that to be really useful. We have recruiters who used it for lead generation. We have sales teams using it for lead generation. So we kind of see a pretty broad spectrum of usage across personal use cases and fairly deep usage even in the context of work. How have you approached like setting up an ingestion architecture? Like are you like crawling Reddit or are there APIs or like X, do you have an X feed? like are you doing that um and this is maybe kind of the other side of the you know domains question but like do you like have the top x sources you know not x twitter but the top end sources like prefigured to come in and then you have that accessible for what different users are trying to do or does the agent need to figure out what is relevant for a particular user's query and then go find that and you're maybe, you know, searching Google 50 different times for different use cases or something. Yeah. So it's, we have 80 to 90 tools that are made available to sort of this orchestration to the agents that they can use if they think it's relevant to the query. So we have this predefined vocabulary of tools that it has access to, but then based on the query, it's going to decide which one of these makes sense to use, how many times to use it, which ones can be used at parallel all at once and things of that nature. And so there is a little bit of this flavor when we were building out that maybe we had given it access to some tools and then we saw certain queries and we were like, wait, if we had this other tool, that would make it much easier. Then we sort of just kept adapting that tool list over time. Yeah. And what's maybe not so intuitive to people is that when you start getting to 80, 90, 100 tools, it's not very reliable to just give the orchestrator access to all of these tools at once. The context blows up and it doesn't know what to do. Exactly, exactly. And so we've had to build it out more hierarchically where there are certain sub-agents and those sub-agents have access to certain tools. And so there's, yeah, there's things that we needed to do to make that reliable at the scale of number of tools. Can you dig into that aspect of it a little bit more? You spoke earlier about multi-agent types of, you know, collaboration and interactions. Is that the primary place that you're using, you know, multiple agents or are there other ways that you're using that? It's at every step along the way. So you can basically think of scouts as agentic search with a cron job around it, with access to this history of what it has already told you in the past so far. There is a sort of contextual cron job around agentic search. And so if I think of the agentic search piece of it, it starts off with a bit of a to-do list and a plan of what it's going to do to address the query that has come in from the user. it will send out multiple tools in parallel based on this plan that it has come up with and then based on what it gets back from these tools it's going to decide what it should do do next so that to-do list is adapting based on the outcomes of what it found after that after that first step right so each one of these steps is multiple agents going out coming back with information that they have and based on that the orchestrator deciding what it's going to do next and so no two queries have the same workflow that is being executed on right and each of these steps is actively engaging with the web as opposed to sort of having an index of the web and and using just that right because this is future facing you're looking for new information including in sort of heavy tales of websites of like the tennis court reservation or so on And so, yeah, that's what it looks like. In terms of the kind of the backgrounding of these agents and cron jobs, like any interesting learnings or experiences in getting all that working? I think what's interesting in that aspect is that if you think about it for each scout query, you're executing on that same thing over and over again. But the information on the web may have changed between the last time you did it and you did it this time, right? And so there is a good amount of redundancy that you could be taking advantage of. And we've only sort of scratched the surface off that so far. And the other relevant bit is what are the chances? Redundancy in a set of one users, what one user has requested being useful for another user or? redundancy in the sense of that even for the same user for a particular scout every time it does the search right the cron job nature of it within that cron job it's executing on the same query over and over again right so for the same user for the same scout there is a redundancy across these scout runs across these agentic searches that we're doing over time as part of the cron job and so there's interestingness to it's interesting to think about how to exploit that you don't want to go all the way to one extreme where the first time you run it, you encode that into a deterministic workflow and then you just execute that over and over again because the information on the web may have changed, right? And so that same workflow may no longer work, right? So you don't want to go to that extreme end of trying to take advantage of the redundancy. But at the same time, completely ignoring the fact that there is redundancy here also doesn't seem right. And so figuring out that that sweet spot is interesting. And I think we've only scratched the surface of doing that. And the other bit that is relevant to this is some information is fast changing and some information is very slow changing, right? Or some information only tends to happen during the day. Like if you're looking for AI releases, it's unlikely to be happening like at night time in the US, right? It's more likely to happen sort of maybe in the morning, maybe morning Pacific time in the U.S. And so having these priors of when is it worth checking again, based on certain just priors of when this kind of information is likely to change is another thing that would be relevant. But again, we haven't exploited that a whole lot so far. We're coming close to the end of the year. What are your predictions for next year in this space? How do you think it will evolve? I don't know. I'm a little skeptical of predictions in general. Yeah, I think what I'll say is sort of, it's not going to be very insightful. Like there is a lot happening in this space. And so, yeah, I'm curious to see how that evolves. with sort of the AI features in browsers. I'm interested to see what the reaction to that is. Like I've used DIA, I've used Comet, I'm using ChatGPT's Atlas right now, but I find myself more or less using it as like a regular browser. I haven't found ways for sort of the AI features to be a very like an integral part of my workflows yet. But that is a feeling that I get in talking to other users as well. I'm actually curious if you use any of these AI browsers right now and what your experience with them has been. I've played with them mostly to see what they can do and to play with them. But I also have not found like a killer use case, so to speak. Like the one thing that I've wanted to do is like, it's kind of a weird one, but like I save articles across lots of different platforms. X, LinkedIn, Hacker News are the big ones, actually. And accessing those LinkedIn saves is kind of super hard because like I said, I always have to look for like where that thing is. and there are no APIs that I have been able to find. And so, like, someone was talking about using the Comet, I think, for LinkedIn. And I thought it would be kind of interesting if it could, like, go in on the, you know, periodically and find those saved articles and pull them out and format them into some other way and then send them to me or post them someplace else, post them to Airtable. Like, that's the kind of thing that I would find useful. I don't know that we're there yet for like the travel booking or, you know, those common use cases that people talk about. Like the problem for at least the way I approach that problem is so much more complex than any of these things are readily able to do. And I think what you're describing is like, I do think we need these agents to be in the background doing useful things for you. Right. Like there is a certain workflow right now where like the details are not important, but there's a certain manual thing that I end up doing multiple times a day. and it's so like it would be a great fit for like this agent in my browser right to do it for me but I end up closing the tab accidentally every so often and then like it's gone right and I need to sort of resend that and so if I could just set it up to be like every time blah happens just go do this very specific manual thing and like do it in the background whether or not my laptop is open or short or not is not important. I mean it's amazing like how you know simple that is and maybe elusive in the sense of like when absent of the current state of the technology, if you talk to somebody about like agents that act on your behalf, you would imagine them like living in the cloud and doing these things and like letting you know when they have something interesting for you. And for the most part, it's not really like that right now. Like you're going to these things, you're asking us stuff, you're doing it stuff, are you doing things with them? And it's very interactive. And I've not seen a lot of really great examples that illustrate the power of, I don't know what we want to call them, ambient, you know, agentic systems or whatever. I don't know that we've coined that term yet. Yeah, yeah. I mean, obviously, I'm biased, but I think Scouts is exactly that. You should try it out. But it's very much this. You set it up once, right? you're telling it that like, let me know whenever blah happens, or whenever something interesting happens, and then it's off, right? You're never, you're not, right? You're not interacting with it. It's off monitoring the web 24 seven for you. And whenever it has something to report, it's going to send you an email with that information. And so it's often the case that like for several weeks, I haven't heard from the scout because nothing relevant happened. And then it shows up in my inbox because it was out there looking. And that just, that just feels quite magical. So yeah, I already promised to give you access before we started recording. So I'll get you access. You should try it out and then you should tell me what you think of it. I definitely will. I definitely will. Well, Devi, thanks so much for taking the time to jump on and share a bit about what you've been up to recently. Super cool stuff. Thanks for having me. Thanks for having me. Thank you.

Share on X Share on LinkedIn

Related Episodes

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

TWIML AI Podcast

57m

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

57m

Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

48m