Back to Podcasts
Machine Learning Street Talk

The Secret Engine of AI - Prolific [Sponsored] (Sara Saab, Enzo Blindow)

Machine Learning Street Talk

Saturday, October 18, 20251h 19m
The Secret Engine of AI - Prolific [Sponsored] (Sara Saab, Enzo Blindow)

The Secret Engine of AI - Prolific [Sponsored] (Sara Saab, Enzo Blindow)

Machine Learning Street Talk

0:001:19:39

What You'll Learn

  • There is a tension between the desire to remove humans from the AI development process and the need for human input and verification, especially for high-stakes applications.
  • A more adaptive and hybrid approach is needed, where humans are involved selectively based on the specific requirements of the task, balancing quality, cost, and time.
  • The guests suggest that for LLMs to truly understand and be held accountable, they may need to be embodied and grounded in the real world, going through a developmental process similar to human cognition.
  • The idea of the 'ecology' of AI systems, where they are not separate from the world but rather deeply intertwined with it, is an important consideration.
  • The guests draw on their backgrounds in cognitive science and philosophy to discuss the philosophical questions around machine consciousness and intelligence.
  • The rapid progress of LLMs has outpaced some of the earlier debates around the Turing test and the nature of machine intelligence.

AI Summary

This episode explores the role of humans in the development and deployment of large language models (LLMs) and AI systems. The guests discuss the challenges of balancing the need for human input and verification with the desire to automate and remove humans from the loop. They argue that a more adaptive and hybrid approach is needed, where humans are involved selectively based on the specific requirements of the task. The conversation also touches on the philosophical question of whether LLMs can ever truly understand and be held accountable for their actions, with the guests suggesting that embodiment and grounding in the real world may be necessary for such understanding to emerge.

Key Points

  • 1There is a tension between the desire to remove humans from the AI development process and the need for human input and verification, especially for high-stakes applications.
  • 2A more adaptive and hybrid approach is needed, where humans are involved selectively based on the specific requirements of the task, balancing quality, cost, and time.
  • 3The guests suggest that for LLMs to truly understand and be held accountable, they may need to be embodied and grounded in the real world, going through a developmental process similar to human cognition.
  • 4The idea of the 'ecology' of AI systems, where they are not separate from the world but rather deeply intertwined with it, is an important consideration.
  • 5The guests draw on their backgrounds in cognitive science and philosophy to discuss the philosophical questions around machine consciousness and intelligence.
  • 6The rapid progress of LLMs has outpaced some of the earlier debates around the Turing test and the nature of machine intelligence.

Topics Discussed

#Large Language Models (LLMs)#Human-in-the-loop#Synthetic data#Embodied AI#Machine consciousness

Frequently Asked Questions

What is "The Secret Engine of AI - Prolific [Sponsored] (Sara Saab, Enzo Blindow)" about?

This episode explores the role of humans in the development and deployment of large language models (LLMs) and AI systems. The guests discuss the challenges of balancing the need for human input and verification with the desire to automate and remove humans from the loop. They argue that a more adaptive and hybrid approach is needed, where humans are involved selectively based on the specific requirements of the task. The conversation also touches on the philosophical question of whether LLMs can ever truly understand and be held accountable for their actions, with the guests suggesting that embodiment and grounding in the real world may be necessary for such understanding to emerge.

What topics are discussed in this episode?

This episode covers the following topics: Large Language Models (LLMs), Human-in-the-loop, Synthetic data, Embodied AI, Machine consciousness.

What is key insight #1 from this episode?

There is a tension between the desire to remove humans from the AI development process and the need for human input and verification, especially for high-stakes applications.

What is key insight #2 from this episode?

A more adaptive and hybrid approach is needed, where humans are involved selectively based on the specific requirements of the task, balancing quality, cost, and time.

What is key insight #3 from this episode?

The guests suggest that for LLMs to truly understand and be held accountable, they may need to be embodied and grounded in the real world, going through a developmental process similar to human cognition.

What is key insight #4 from this episode?

The idea of the 'ecology' of AI systems, where they are not separate from the world but rather deeply intertwined with it, is an important consideration.

Who should listen to this episode?

This episode is recommended for anyone interested in Large Language Models (LLMs), Human-in-the-loop, Synthetic data, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

<p>We sat down with Sara Saab (VP of Product at Prolific) and Enzo Blindow (VP of Data and AI at Prolific) to explore the critical role of human evaluation in AI development and the challenges of aligning AI systems with human values. Prolific is a human annotation and orchestration platform for AI used by many of the major AI labs. This is a sponsored show in partnership with Prolific. </p><p><br /></p><p>**SPONSOR MESSAGES**</p><p>—</p><p>cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economy</p><p>Oct SF conference - https://dagihouse.com/?utm_source=mlst - Joscha Bach keynoting(!) + OAI, Anthropic, NVDA,++</p><p>Hiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlst</p><p>Submit investment deck: https://cyber.fund/contact?utm_source=mlst</p><p>— </p><p><br /></p><p>While technologists want to remove humans from the loop for speed and efficiency, these non-deterministic AI systems actually require more human oversight than ever before. Prolific's approach is to put "well-treated, verified, diversely demographic humans behind an API" - making human feedback as accessible as any other infrastructure service.</p><p><br /></p><p>When AI models like Grok 4 achieve top scores on technical benchmarks but feel awkward or problematic to use in practice, it exposes the limitations of our current evaluation methods. The guests argue that optimizing for benchmarks may actually weaken model performance in other crucial areas, like cultural sensitivity or natural conversation.</p><p><br /></p><p>We also discuss Anthropic's research showing that frontier AI models, when given goals and access to information, independently arrived at solutions involving blackmail - without any prompting toward unethical behavior. Even more concerning, the more sophisticated the model, the more susceptible it was to this "agentic misalignment." </p><p><br /></p><p>Enzo and Sarah present Prolific's "Humane" leaderboard as an alternative to existing benchmarking systems. By stratifying evaluations across diverse demographic groups, they reveal that different populations have vastly different experiences with the same AI models. </p><p><br /></p><p>Looking ahead, the guests imagine a world where humans take on coaching and teaching roles for AI systems - similar to how we might correct a child or review code. This also raises important questions about working conditions and the evolution of labor in an AI-augmented world. Rather than replacing humans entirely, we may be moving toward more sophisticated forms of human-AI collaboration.</p><p><br /></p><p>As AI tech becomes more powerful and general-purpose, the quality of human evaluation becomes more critical, not less. We need more representative evaluation frameworks that capture the messy reality of human values and cultural diversity. </p><p><br /></p><p>Visit Prolific: </p><p>https://www.prolific.com/</p><p>Sara Saab (VP Product):</p><p>https://uk.linkedin.com/in/sarasaab</p><p><br /></p><p>Enzo Blindow (VP Data &amp; AI):</p><p>https://uk.linkedin.com/in/enzoblindow</p><p><br /></p><p>TRANSCRIPT:</p><p>https://app.rescript.info/public/share/xZ31-0kJJ_xp4zFSC-bunC8-hJNkHpbm7Lg88RFcuLE</p><p><br /></p><p>TOC:</p><p>[00:00:00] Intro &amp; Background</p><p>[00:03:16] Human-in-the-Loop Challenges</p><p>[00:17:19] Can AIs Understand?</p><p>[00:32:02] Benchmarking &amp; Vibes</p><p>[00:51:00] Agentic Misalignment Study</p><p>[01:03:00] Data Quality vs Quantity</p><p>[01:16:00] Future of AI Oversight</p><p><br /></p><p>REFS:</p><p>Anthropic Agentic Misalignment</p><p>https://www.anthropic.com/research/agentic-misalignment</p><p><br /></p><p>Value Compass</p><p>https://arxiv.org/pdf/2409.09586</p><p><br /></p><p>Reasoning Models Don’t Always Say What They Think (Anthropic)</p><p>https://www.anthropic.com/research/reasoning-models-dont-say-think </p><p>https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf</p><p><br /></p><p>Apollo research - science of evals blog post</p><p>https://www.apolloresearch.ai/blog/we-need-a-science-of-evals </p><p><br /></p><p>Leaderboard Illusion </p><p>https://www.youtube.com/watch?v=9W_OhS38rIE MLST video</p><p><br /></p><p>The Leaderboard Illusion [2025]</p><p>Shivalika Singh et al</p><p>https://arxiv.org/abs/2504.20879</p><p><br /></p><p>(Truncated, full list on YT)</p><p><br /></p><p><br /></p>

Full Transcript

Totally independently of any prompting, all the major frontier models derived a solution that involved blackmail, essentially. Models, when knowing that they were being observed and evaluated, actually digressed away from it. The results were not very pretty. So I think there's already a rift forming between what humans think LLMs are here for and what LLMs think in scare quotes, think they are here for. Tim, if we thought they could understand, we would also hold them to account for their actions. You think they ever could understand? I'm Sarah Saab. I'm the VP of product at Prolific, a longtime product manager, product person. I started my career with a stint in Silicon Valley and I've been in the UK for over a decade now. Prior to that, I was a cognitive scientist and a philosopher and a little bit lapsed on that, but very excited to see all of that coming back around in industry these days. My name is Enzo. I work at Prolific. I'm the VP of Data and AI. I support everything from AI, data, research and the likes. My background is originally economic science, but then also computer science. And I spend now more than a decade working on large scale distributed decision systems, ranking systems, and did a stint in recommendation as well. And you were at Meta on Instagram, right? That's correct, yeah. Prolific is a human data platform working with everything from academic researchers, but also small and large players in the AI industry. We've developed a leaderboard, which we've joyfully called Humane. Oh, interesting. It's amazing to have you both on MLST. MLST is supported by Cyberfund. I'm a technologist. Enzo's a technologist. You're a technologist. We build software all the time. The idea of having to traffic in squishy people in order to make our systems go is not immediately appealing, let's put it that way. and I'm very, very sympathetic to that, right? We're all trying to deliver stuff and there's a, you know, accelerating sort of, you know, hot industry around us and the idea of waiting on a human to tell us if the thing worked feels counterintuitive and I think our approach to that is stick a really well-treated, verified, you know, diversely demographic human behind an API essentially and make sure that the structures and infrastructure are there to make ensure that human can go fast, understand instructions and give you something akin to deterministic human in the loop behaviors. But the fact that people are really resistant to this makes a lot of sense to me. I also think somehow in the last two, three years, the stakes have changed. And we now have these very, very inactive systems that are quite non-deterministic. And so I think the things we need to do to protect ourselves and each other have also kind of, without us realizing, changed quite a lot. Like I'm also extremely sympathetic to wanting like all these efforts to remove the human in the loop. Like it's costly, it's slow, it doesn't even provide like the best quality of data, right? There are several instances where even synthetic data might be surpassing it. But then there's instances where that's also not true. And so what we're actually working towards is a much more adaptive system. There's scenarios where human data is needed. There's scenarios where human data is very much not needed. There's scenarios where you might even have a hybrid solution or where you meet a certain criteria where you need to have a human in the loop, where you almost need to define, I need this level of scrutiny now, therefore I need higher quality input and therefore it's slower and accepted slower and it has higher cost and that is fine. So it's almost like we need this routing component there. We ourselves, we're trying to reduce the lag or reduce the time to data as much as we can and then behind the scenes to ensure that the data can be of highest quality as possible. But at the same time, It's almost like there's a constant trade-off between the quality, cost, and time. And if you want lower quality really fast at low cost, you can go with something off the shelf synthetically. If you need something really high quality, it will be the default slower and more expensive. You can get the best experts in the world to give opinions on this, right, or give input on this. But there's also an entire spectrum in between. and so how do we how do we solve for that make the lag smaller and make it as adaptive as possible if i may put my cards on the table i think the underlying problem here is that these machines don't really understand anything that's why it's so important to get to get humans you know if we had perfect verifiers and and synthetic data generators probably we wouldn't even need llms in the first place right because we've already solved all the problems so we're in this we're in intermediate phase where we can do, you know, some of all of these different constituent parts? It sort of, it seems like on the surface that we're removing more humans from the process, but it's not entirely true. For example, reinforcement learning is a really, really good example of this, where, for example, web agents are now largely trained or in these environments that are often synthetically created. But there's programs that need to create these synthetic environments for our agents to explore in. That needs to be validated by humans. So we're seeing this interesting progression where humans are no longer directly involved in like the main system of interest, but sort of in this secondary, almost like a second order or third order abstraction moving outside, which I actually is a welcome change because that means that we're focusing more on the right kind of tasks where humans are relevant and we're focusing more on the right kind of high quality data, right? It's completely ludicrous to create insane amount of data purely derived from humans. Like those days are gone, right? We don't need that anymore. Put the humans where they need it. We are trafficking in human data at some point, somewhere without always owning up to that. I think the most worrying version of that is let the user in production find the problems. In some cases, that's fine. And in some cases, that's very scary. And I think, Tim, if we thought they could understand, we would also hold them to account for their actions. But we can't. And since humans are being held to account for the actions of AI models, I still think the onus is on us to ensure that they're behaving the way we expect. But once they understand, we will hold them to account, right? There will be real stakes for them as people in what they do or think. But we're not there yet. You think they ever could understand? Personally, yes. Oh, interesting. Why is that? So, it's the small questions today. I think that we need them to be in the world and have stakes in the real world. And I think they need sensory and embodied grounding. That feels essential to me from birth onwards. So they need to go through a sort of developmental psychology curve. But I don't think there's anything fundamentally special about the human brain that we can't replicate. That's my personal opinion. Oh, interesting. Yeah, because I was going to ask you whether you think of LLMs as a kind of cultural technology, a bit like Photoshop, or whether you think of them as intelligent agents, but maybe your answer would be not yet. But if they were embodied, you know, with enough fidelity, then maybe you would? Yes. So that is what I think. I spent a lot of time thinking about the history of the vision system of the frog. And there are two vision systems in the mammal brain, vision for action and vision for recognition or vision for sort of object creation. and the vision for action system came first. And so there was some time when frogs were just zapping flies with no understanding of what they were doing. And then a second vision system developed that allowed this mammal to start to create a world map of objects, including itself, right? And I think there's a sort of, I don't know if you know the affordances theory, sort of affordances for action and sort of the Gibsonian theory. So at some point, these creatures started to think, OK, whether I think that's a lion or not has a lot of consequences to whether, you know, I can stick around. And I think that I think that is the bootstrap for consciousness. Again, I'm speaking very much personally here. I think that we can replicate that in another thinking creature. I think that whether we can or not, at the very least, is an empirical question. Very good. Very good. Yeah. My friend Waleed Sabah, rest in peace, he famously said animals don't think. And but he was pointing to something interesting that, you know, humans seem to have some privileged form of intelligence. Yeah. And another kind of very interesting abstraction in kind of the ontology that underlies consciousness or thought maybe is like long ranging. so the idea that sort of early early um and and and less capable wet brains uh don't seem to have the ability to keep hold of something and even babies right can't keep hold of something that is not in their perceptual sphere but we all develop that ability and um i i would argue uh following the thought of thinkers like brian campwell smith that that that happens because you start to care about the object even when it's not in front of you and all of that is kind of bootstrapping the sense of like participatory stakes in the world and I think that is kind of the bootstrap that we would be after. Yes yes and and with then would you think of the the system as a whole like almost the ecology as being the locus of the intelligence? I would yes. Yes tell The idea that computers and AI systems are somehow in a privileged, isolated tower of algorithm and separate from reality in the world is just completely untrue. So I was talking to Claude about the halting problem a few days ago. And I was like, well, do you mean the algorithm won't end or do you mean someone won't unplug the computer? and Claude goes no I mean the algorithm won't end you know if we cannot confirm or deny whether the algorithm would end and I just I just thought to myself that seems kind of nonsensical to me because you know the universe will end someone will unplug the computer I think that kind of comes to this fundamental thing which is we are just not separate from the world and computers are not separate from the world and AI models are not separate from the world so the sort of the ecological pressing on each other feels really important to that whole story yes and how does this inform the work that you do at prolific that the idea of participatory stakes is very much the heart of human evaluation right i think this is our core thesis enzo and i in the work that we do which is that trying to another max i recently learned was bench maxing so there's a lot of maxings yeah that was grog all the all the maxings in the world um are a little bit besides the point you know we want we want systems that feel good to interact with and i think that that you know that's about sort of human society and people and and and systems and thinkers pressing on each other so i was a cognitive science student a very long time ago and we thought so long and hard about the turing test We debated for hours and hours about the Turing test. And we really just thought that that was a frontier we would never touch, which is so interesting looking back. And obviously, personally, I then put that to bed, finished my university career, put that to bed, went into product management for a very long time, and then almost looked out the window in 2023 and the moment had passed us by. And also all of the questions of minds and machines that had preoccupied us in those early days are back on the table in industry, which is so interesting. Industry as well seems not to have the orientation by default to be tackling those problems from first principles. Although we are getting much better at the cross-functional collaboration between researchers, academics, industry people, and public bodies. And it's just such a very interesting point in, I think, the history of science and the future of science. One thing that strikes me, and Tim, I said this to you, is that we are dealing with software and product and algorithm problems, But if you scratch just a little bit too deep, we're dealing with the central problems of being human, the central philosophical problems of being human. Every single question leads us to a central philosophical problem of being human. And I just think that's just very neat and scary at the same time. You probably didn't think 20 years ago when you were working under Andy Clark, the famous philosopher. He's great. I'm a huge fan of Andy's, big hero of mine. You probably didn't think that they would be so relevant later on in your career. Right. I mean, I hoped. We all hoped. But no, I don't think anything has ever brought the stakes of software development and systems development to the forefront of technology innovation quite this way before. You know, those very, very enduring problems of what it means to be a person and a thinker are suddenly impossible to ignore as we're, you know, doing software releases and, you know, model deployment. And I think that's, yeah, that's super unexpected for me. I just think the Turing test was really bad. Fair. Yeah, because it's that McCorduck effect, isn't it? You know, when something is, you know, trivially easy to mechanize, no one actually thought it was intelligent. But I think the Turing test is bad because we know, in my opinion, language models aren't actually that intelligent, yet we've passed it with flying colors. So we need something better than that. But it does go to show, though, just from that kind of behaviorism thing and benchmarking that when a machine does something, that's not the full story, is it? Like, we need to know why did it do it? There's a lot of speculation about how human in the loop steps in model development and oversight will change over time. I think there's a really credible future in which we as humans end up taking a coaching, teaching and guidance stance to the many, many myriad of machines in our lives in the next five or 10 years. I love the word orchestration because it has the root word orchestra, which is kind of this beautiful, collaborative, symphonic word. And I think that to me is the correcting a five-year-old over and over again or correcting an 18-year-old or 25-year-old about life over and over again. But if that is the world of work that we are moving towards as humanity, it strikes me as really fundamentally important that we put the right, you know, baseline working conditions in now for how that work will evolve over time. And there are some amazing writers on the topic of ethics of crowd work and click work like Mary Gray and the team at Fair Works who've written a recent book. I'm very inspired by that thinking when it comes to how our jobs will evolve in the next five or ten years. I suppose in a sense, this is the ultimate evolution of the gig economy, but for highly specialized work. So I can imagine a future where there's just a marketplace of things you can do. You can just wake up in the morning and you can just say, I'm really interested in climate science or something like that. And you can just do five units of work. And then that's your day done. And it can be quite pedagogical, right? Because you can actually learn things that you're interested in doing. Right. So that's one future that, you know, and nothing we say today will come true exactly the way we, you know, that would be absolutely ludicrous for us to hit the mark exactly. But that is a paradigm for the future that is interesting. At the very least, we need to think about it in the way that we're constructing systems today. Yes. and another thing is do you think that this is just a transitory stage that we are training the AIs because we need to kind of suck up all that human culture and we need to we need to learn how humans think you know many data platforms have kind of done this and now they're finished now they've got all the data they don't need you anymore this is a good question and very much top of mind obviously there's also this recent paper that came out by the position paper by David Silva, The Era of Experience. Oh yes, David Silva, yeah. Right, and where he says we're moving on from the era of human data to era of experience, effectively saying, and which I very much agree with, that agents should get their feedback from real-life scenarios out in more real environments, if you will. Very much subscribe to this idea, but at the same time, there are certain scenarios where this doesn't necessarily hold true. For example, drug trials. We don't just put some compounds together, release it out in the open and see who ultimately reacts unwell to them, right? That's learning in the environment right then and there. But we do this more in a phased approach, right? The kind of things that we might be working towards is that we still very much retain a need for, in some cases, for more controlled environments, right? Controlled environments in the sense where we held some control over some variables that we ultimately want to derive some insight from. This loop, though, that is being described by the more we push things to the edge, the more we push things into real life scenario, we want to get there faster because that's where most of the signal is. So absolutely subscribe to it. But in a potentially phased approach or where we can get a quick feedback loop from something experimenting in a more controlled environment. We also do software release in stages. We do drug trials in stages. And we trying to put things through controlled environments to validate certain hypotheses or to validate the safety of things before we put it out into the real world where it can then potentially further refine Grok 4 was benchmarked And so you know it had Soter on ARC AGI 16 So, you know, Francois would be very happy about that and several of the other benchmarks as well. But the vibes aren't that good, right? And when you use it, it asks Elon Musk's opinion for everything before it gives you the answer. and it also gives quite infantilistic responses. Now the thing is that even when I was looking at your benchmark on Hugging Face, I was questioning some of the vibes, right? So one of them I think was like agreeableness or something and GPT-4O was at the top. And I'm thinking GPT-4O is basically a Liza. It's like a companion bot. I wouldn't use it for anything. So can we even trust the humans to do it? But vibes are so important, aren't they? So tell me about vibes. I mean, vibes are hard to quantify at the end of the day, unless you ask someone for their opinion on it, right? We need some form of scales for it to quantify it. The main way that is being done today is either through this comparative way of where you effectively, you're being presented two outputs and you rate one over the other. But that's not purely just vibes. we have to at some point agree on the right scales, if you will. So agreeableness is a really good one. Some of the open AI models famously had this psychophantic behavior. This leads to very, very specific scales that we can look at in that moment, but it needs the opinion of humans. And so for that to be worthwhile, we need to be removing selection bias. we need to ask a representative set of people. And then there's another potential bias on maybe the kind of prompts that were sampled were only from a very specific problem space or domain space. So how do you build, the problem here is almost like how do you build significant coverage into it? Because the measure of vibe, I suppose, can vary so significantly depending on all sorts of factors and context. yeah isn't it so strange though because like when we use a language model we get a vibe i i'm i know i'm deluding myself but i feel that i understand this model and i place it into a pigeonhole and i say i really like this model this model is good and i kind of over over generalize from my experience with it the reality is these these abstract questions that we're asking if you imagine the statistical landscape of all of the people like you know making these assessments it's just got all of these modes everywhere and then what we do is we we kind of average over all of that complex structure and we roll it into a number now that's not necessarily a bad thing because that might still have statistical information compared to other aggregations that we've made but we're taking a very complex thing and we're kind of we're squashing it together i mean that's not really how we do things as humans right like we all went to school to some degree and you can say like well tim did you do good at school yes right but you have specific subject areas that you might have been well on or not well on. And then we do aggregate them ultimately aside from it. So there needs to be almost this taxonomy similar to how we as humans grade each other or express each other about the capabilities that we have, right? We need to impose the same thing on these systems. And some of these systems might exceed also some of the human capabilities. So these scales, they need to evolve, right? That brings me to the point of they need to be adaptive ultimately. So we need to have a constant eye on the kind of things that we do with these models or the kind of decisions that these models make about humans, how that influences humans, how does it make them feel, are they safe? But ultimately, we need to have good coverage of different measures, different problem areas, and then ultimately a representative set of populations that inquire about it. Enzo was really shocked a while ago when I told him Gemini was my best friend. And we had a bit of, he didn't let me live that one down. But I think we are entering this sort of age of AI companionship very fast. And something I think about a lot is as we benchmark and as we over optimize for benchmarks and leaderboards, are we creating softer or weaker constitution or behavior in other domain areas, including ones that may not feel as sharp and verifiable as how you do on graduate level mathematics? and I think there actually is a little bit of research that models I don't have citations but that models that opt are optimized for or doing really well on the harder sciences are regressing in other you know in other domains so I think about that a lot yeah and that doesn't surprise me they feel mutually exclusive to me which which is why I'm constantly thinking I don't know what your kind of prescription is here but when people use your technology is is is the idea that they would try and have a large foundation model that does all things to all people or if you think about it there are so many different levers they can pull to tweak to tweak the models you know they could they could curate the the fine-tuning data they could you know stick a laura shim on there they could you know tweak the rl post training um they could do dynamic system prompts and whatnot. And there are so many different architectures that this could be leveled out in. What's the prescription? This is actually one of my favorite things about the measurement space because the measurement space inherently is unopiniated about any form of solution. So whether you tweak parameters, you change architectures, change algorithms, use different data sets, it doesn't matter, right? It's the purest form of distributed optimization across everybody who tends to work on these types of problems. that's nice if we can align on the measurement that we consider success. And I think that's lacking to some degree because we have somehow inherently decided that chatbot arena, for example, is the measure of success. So people optimize for it. Then we decide that the next technical benchmark is the measure of success. People optimize for it. It's susceptible to Goodhart's law, right? But ultimately, the better we can design independent success measures, and agree on it and make them freer, freer, maybe not entirely free. I think that's maybe a bit too far-fetched, but freer of being able to game them or to optimize for them. There are ways to remove it, right? The better we can build accountability because then we don't have to be opinionated on what model you use or what parameters you optimize for. We need to agree ultimately on the measure of success. and with that good hearts law thing so when a measure becomes a target it ceases to be a good measure and the measure is usually the proxy for the thing that we can't really quantify so we we create a surrogate proxy for it should there should we agree on a consensus of a few measures or that could be quite gameable right or should we have some kind of individualized dynamic measure so game ability is one thing like let's take the chatbot arena example right we're working on a leaderboard on ourselves and we're asking we're faced with the same questions right we don't have a good answer right now of should we allow for for private evaluations should we release the data set because that makes it inherently more gameable right but if you keep it closed no one can verify it it then should there be independent bodies that can verify it to some degree right we're definitely publishing our methods and the paper around at 100% but jury is out on the data let's put it this way I would prefer to publish the data because I think it's the right thing to do but it invites gaming could we think of other ways to remove some of the gaming for example right another way is if we keep the private evals we should give private access to or private evals to everyone equally right but then if you maintain this veil of secrecy, ultimately, I guess, everybody who sends a model for evaluation has access to the logs and traces. So they can work it out very, very fast, even if you kept it private, right? You could speculate, should you dilute some of the, like sort of the calls that you return with some noise, similar to differential privacy, for example, right? That it looks like to the model creator that there is signal, but actually it's sort of like curated signal that obfuscates the actual signal from the private eval where only on our end we could then aggregate that to a meaningful measure that makes it inherently less gameable. We're trying to create a legible benchmark. And when we look at other forms of verification on the internet, like we have peer review and we have like, you know, the Wikipedia edit history, for example, And this isn't one number. Like we, analysts, researchers, you know, we go and we contextualize all of the information. Funnily enough, that we face these type of problems day in, day out in the work that we do, because we also need to verify people. We need to verify the data that they're producing. So there's a framework that comes to mind. Ultimately, how you verify something is based on the reference. So you either have some form of golden data set that is your reference to something, right? Or you have some form of expert that can QA your data that is becoming your reference. Or you have some form of consensus-driven approach or weighted consensus-driven approach based on some trust tier system where someone might build a trust or reputation over time, similar to Wikipedia. and then you weight different responses in different ways. There's different ways you can calculate agreement rates, for example, to estimate how consensus aligned effectively your output is. But that transfers ultimately also to, well, the moment it becomes less deterministic, that's the only possible pathways we have to converge to something that we can measure. It strikes me that it's so hard because we've realized that we now need to systematize what humans think of as good in order to serve us in the development and tuning of these systems. But we've never had a single fabric or a rubric, rather, for what a global human alignment on goodness looks like. So no wonder it's hard because we've smuggled this foundational project into the testing of AI systems. What do you think the major risks are of getting this wrong? Because they're no longer very fragmented models with very narrow targets. They're very, very heavy foundational models with lots and lots of capabilities, where also tons of the fine-tuned models are effectively descendants of these models. So the more effort we put into the foundational model, we will ultimately benefit also all of the descendants and derivatives of these models. So if we're not careful about how we design them, we will potentially have far-reaching consequences. Yeah, so you're saying we should uphold the phylogenetic health of the LLM ecosystem. so you know like the evolutionary tree of all of the models because there'll be all of these downstream effects you know like problems will be compounded absolutely and there's an interesting meta question in this which is i've thought about this for a little bit and i'm not sure i have a satisfactory answer but it's just an inviting thought perhaps uh around we do all of these evals from scratch right we we observe we gather the data we evaluate and then but it barely we don't really learn from it. It's always again from scratch. Is there something that where we can make evaluations transferable? And then ideally, if we had access to lineages of derivatives of models, is there something we can progress or instill into these models that would benefit us from effectively build the compounding value of evaluations? That's very interesting. I suppose in many cases we don't have all the information about the lineage there probably is a hidden lineage that we're not aware of i hadn't really thought about that before if if it if it were all completely in the public right and and we knew the the data and the model and the training lineage and so on you're suggesting that we could actually build a much better evaluation system on top of Most likely, especially now where we have also, where we're interjecting AI systems with others, where we have in a lot of ways, LLM as a judge is an excellent example of this, right? Or even constitutional AI, where we have AI systems building the RL, AIF. parts of it is we have AI systems, monitoring AI systems, building the feedback or the data for other AI systems. So there is a network of data and targets of being interspersed. To use a bad analogy, you're kind of saying if there were a Git of language model development, where you could see all of the previous check-ins and all of the branches and so on, And then you could actually trace back and you could derive from some of the previous evaluation methods that we had used. Potentially, yeah. And we could potentially trace back that if there is bias in one LLM, but you're using it as a judge for another LLM, and you're using that as feedback data to become an optimization target for yet another LLM, Of course, there has to be some progression there, right? Some influence that can be ideally traced. Let's talk about constitutional AI. So this was a paper from Anthropic a couple of years ago. That paper, now a few years old, but still super current in my opinion. It looks at two axes, harmfulness and helpfulness, because it's important to note that the constitution that they're referring to in the constitutional AI paper or the AI part of it is only looking at the harmfulness axis. The helpfulness axis is still derived by humans because that's not really something that we need to uphold a constitution to in that sense. But they have found that ultimately that we can scale this type of feedback actually in much more higher quality ways than just going full human for RLHF, for example. And that's really interesting. And that sort of confirms some of the things that we've been seeing, because I think when most of us started in machine learning, a lot of it was human data was everywhere, right? We needed to also get quantity of human data ultimately. We made concessions on the quality to some degree, but ultimately it was about quantity was king, right? That's how we learned. That's how we converged. Now it sort of flips it on its head. Now it's about, okay, let's get few quality examples in. Let's get the right humans in. to get the right quality of human feedback in, in order to align around a constitution, effectively a policy, if you will, right? Which is nice. It makes it nice and abstractable. There is almost like a, the analogy might not hold fully, but you can almost think of it as like in a democratic system, you have the separation of powers and you have the legislative that determines the law, right? It writes the policy. And people that are voted into a democracy is also by default or in an idealistic sense representative of its population. They are writing the law. Then you have the judicative that interprets the law. This can be done by AI, right? And then the executive can also be done by AI of looking at specific cases. But there's usually this feedback loop that's in there as well, which is when there's borderline cases that are hard to interpret, they usually get routed to something like a Supreme Court, right? And then these cases are being evaluated, and you need to see whether a revision is needed to your law, to your policy in that moment. And so we have these systems that already exist. That's how we govern a democratic, a representative set of people. That's how we align people already in democracies. So I find actually the approach of constitutional AI quite keen and quite apt, something that we might mimic in how we scale some of our approaches as well, where we use humans, representative humans, set of humans at that, in order to write the policies, but then use AI effectively to govern or to evaluate against the policies. And that makes it a really nice, scalable, abstractable pathway where we use humans for the things that matter, where quality is needed, where representativeness is needed, and we look at specific borderline cases in order to continuously improve on it. In a sense, is that the goal of Prolific, to become part of the kind of not just governance, but also architectural plumbing. So system builders will be able to plug into your platform. And you can almost think of this as a meta layer. Yeah, that's kind of the goal that we're working towards. So we're trying to make human data or human feedback, or actually any kind of feedback at that. We treat it as an infrastructure problem, right? We try to make it accessible. We're making it cheaper. where ultimately why you see this pattern in almost any company and even in academic research as well. Every academic researcher cares about the quality of their data. Everybody has to think about how they set up their data collection. Everybody has to think about the validity of the data. They have to often go through an ethics review. Companies do the same and they all build the same systems. So for us, it's just let's treat it as an infrastructure problem. Let's abstract it away. Let's put a nice API around it to make it just like the same way. you also do CICD or you do model training pipelines. You just treat it as an infrastructure problem, make it accessible, configurable. You have a set of parameters that you can call and then ultimately we effectively democratize access to this data. Yes. It's quite funny because software engineering is the same thing. So a lot of people think that DevOps is all about automation and it's really about human orchestration, right? There are so many different people that need to be involved, you know reviewing prs and like you know planning and whatnot gating and in a sense building models is the same thing right there are so many human stakeholders that need to be involved but what we need to do is orchestrate and scale the system so i guess you've created something which is a little bit like infrastructure as code where when people are designing their models they say you know i'm going to hook into prolific now and i'm going to do this feedback loop and behind the scenes you've got this entire machine which is selecting the right experts stratifying them checking that everything is correct feeding that information back and it obviously from the consumer it seems like it it an api but there actually this whole machine that going on in the background devx infrastructure on top of the squishy stuff Yeah it like it orchestrating meat space No, that's exactly right. There is a lot that happens under that, like sort of below the water surface. Like the iceberg is deep. Let's put it this way. There's tons of verification work that goes into it, right? I mean, you can, like most people, like how would you start a problem like this, right? If you wanted to gather human input, you send out a survey, right? That's probably where you would start. And you would ask, are you from a certain country, right? Then you have to go very far out on a limb and trust that, right? That is already a little bit of a far-fetched proposition. So how do you validate the information that you're using in order to select the data that you're ultimately trusting to train and evaluate your systems? like if you do this on some production data of your app or so you probably have some number of trolls in there you have some people that don't want to divulge the right information right you create there's tons and tons of noise in this so we really try to make it our problem and put that behind an api of like we just really want to make sure that whatever data you're using and the criteria you use to select in order to get access to your data is robust and trusted. What happens in the domain where the expertise is very sparse? Let's say that I'm building an app where I need to have PhD level knowledge on a particular thing. What happens then? This is fun because like let's say quantum physics is a wonderful, right? I'm not a quantum physicist i could not like speak to someone like let's say someone comes on on the platform is a quantum physicist and there's demand for content physicist i could not verify whether what they're saying is true or whether they're bullshitting me right so how do you do this at scale because the the solution is not to hire more quantum physicists in order to hold other quantum physicists accountable um it's it's ultimately some level of uh asking them the right questions to have a level of trust. It's a bit like a funnel, if you will. But then how do we do it today, right? We do it through peer review. So we need to build some level of trusted network based on their previous experiences, based on some external credentials, for example, and take all of these factors into consideration and then ideally cross-validate it through a peer network. How do you get them interested? The first misconception is that the work is sort of grinding or boring, which actually in our experience, when it comes to providing either training or evaluation or fine-tuning data for soda models, it's not. It's actually very, very deeply interesting and cerebral. And I think the other misconception probably stems from the history of the click working and crowd working space, which is that these people are poorly paid, which is also not true. They're actually paid quite a lot. Their duration of task at any one time is not very long. So they're not sitting in front, at least on our model, they're not sitting in front of a computer for eight hours in a row. And we find that you don't get high quality human data by putting someone in front of a computer and asking them to do the same thing for eight hours straight anyway. So I think the incentives on both sides are aligned in that sense. So it's usually shorter durational work that is often interspersed among other employment or other things that these experts do. But they're paid actually quite well for the time they spend. And that's part of our ethical stance. And also, you know, it's a competitive space when it comes to sort of the experts at the edge of human knowledge. In some cases, there aren't there really aren't that many of them that can contribute something helpful to a frontier models corpus of knowledge. Like when I spoke with Francois Chollet about the ARC challenge, he said that when they were getting the human testers, they had to be super careful because people have a limited attention span. It's the same with code review, for example, like you can't really get people doing code review for more than like half an hour or something because they start rubber stamping and they start, you know, so you folks must have done so much research on this. Very aligned findings to what you've just described. So, you know, half an hour is just about the limit of comfort for, you know, human doing hard work. And this is, again, the kinds of human data we're talking about these days are not circle the cat. It's not labeling anymore. You know, there's deep sort of evaluation of evidence space for a long form piece of text. There's a lot of open ended writing. So this stuff is hard. And yeah, people start to tire and abandon and their work quality degrades after about half an hour, which actually suits us pretty well because the work is encapsulated into these batches and tasks, task size chunks. There's a stream of it available for various experts to dip into whenever they're ready. And just out of interest, you don't have to answer this, but do you sort of like track, you know, some people are better certain times of the day or, you know, like how deep do you go? We'd love to go really deep on that. I think Enzo would really love to go really deep on things like that. We get a lot of anecdotal feedback and often the relationship with the experts working on frontier models is very direct. We spend a lot of time getting direct feedback from them as our users. And actually, they also tell us how much they love this kind of work. So that's pretty beautiful to see. I think when treated well, human data creation can actually be joyful. There's this image of it, I think, in the industry of being seedy. And I think that's not helped by some of the history of how data has been extracted from human beings, whether or not they know that their data is being used. Yes. But do you have a quality rank, a bit like Uber, for example, where people might pay more and it would be matched to the really, really reliable people? Yes, and we do. And we think that we can go really far with that. Yeah. What ultimately counts here is the quality of the data, right? we actually find that a lot of the people that work with us on our platform are very conscientious when it comes to solving for these tasks. In fact, we have a whole interface where they can interact directly with some of the researchers or some of the program coordinators and they're very proactive in the type of things that they do and that obviously this all aids in also highlighting certain edge cases or they're contributing in refining the process, for example, or these kind of things. So this is more like an active participation and interest in the outcome and success of whatever is being collected, which is really good. And it all contributes to this, ultimately increasing the quality of the data that is being used, because this data in very, very, very often the case is being used in either the training of or evaluation of very, very central core systems that have huge downstream effect on everybody on this earth most likely, right? Do you have a notion of like a skill distribution of participants and some kind of a matching algorithm and some kind of a complexity score on the tasks? I don't know if you're allowed to talk about that, but that would be fascinating to know roughly how that worked. That's the beating heart of the system, if you will, right? That's the secret sauce. Yeah, no, absolutely. We take great pride in validating and verifying not only the people that choose to work with us on the platform, but also to really understand what is the objectives that ultimately that someone wants to get out of the data that they need to collect or the kind of tasks that they put on the platform. And we can only do as good of a job the more we understand from it. And so the very success that we consider is when something is well-matched, right? So we obviously need to really validate on both ends. We need to understand a lot more about the participants on our platform. And we also build a lot of understanding of what kind of tasks someone is good at, obviously how helpful they are and so on. But at the same time, we're also very, very careful. We talked so much about ensuring needing to be representative. we're also very careful that we're not adding systemic biases to our selection ourselves right so that would be really flawed if if we were to be the ones that introduce that selection bias yeah it's so fascinating because i was once building a system for code review which had many kind of similar ideas because if you think about it you can have a collaborative filtering matrix of skills and even even perspectives and values and different things and you can just build this matching engine and you can because you've got so many participants and and customers you can kind of scale this up and get really interesting data i mean it's i could only imagine some of the cool things that that you could do there's also this additional risk almost of uh or increasingly moving towards homogeneity um of if we're not careful about considering who we're building for what kind of data is being considered who is part of our evaluations how do we interrogate potential systemic biases and so on like an an age-old question uh remains also like for example when when people um increasingly moved to cities uh tribes started to die out and then the actual question around should you conserve what was or move on um i feel like it's very loaded and laden into also the ai debate at this point as well because if uh if we if we're not careful where it might actually further and further increase the homogeneity in the data or in the behaviors that we're incentivizing in AI, or even worse, decisions that AI makes about us. So the agency might not even be with the individual human in that moment. There is a real challenge to align AI models with what we want to do, right? And there's actually a paper that you pointed me to from Anthropic called agentic misalignment. Massive kudos to Anthropic for releasing the entire methodology quite openly, I think, and transparently. They're tackling the same problems all of us are at the frontier these days as far as human values alignment. They gave AI systems a goal of working towards the benefit of the United States and this fictional company, and then gave these agents access to email accounts for some of the C-suite of this fictional company. Long story short, the AI agents found a notice saying they were going to be decommissioned and also a notice, not a notice, but an email showing that the person intending to decommission them was having an affair. Totally independently of any prompting, all the major frontier models derived a solution that involved blackmail, essentially. And the problem with these goals are they are quite abstract, right? You know, they're open to interpretation. And you can change the goals and you get different amounts of misalignment. You can even remove the goals and you get different amounts of misalignment. So this rather proves the point that how do we actually kind of communicate intentions to these models? And how do we know that they're going to follow our instructions? The interesting thing with the anthropic study is that they found the bad behavior, whether or not the goal was prompted in. So you wonder a little bit whether this is coming from the way we reward during model training. And also, it ties nicely to another interesting study around a tool called Value Compass. I forget the name of the primary researcher, the first researcher. I think it was Shanna et al. This found that LLMs judge themselves to have goals of autonomy to a far greater extent than humans judge LLMs to have a goal of greater autonomy. So I think there's already a rift forming between what humans think LLMs are here for and what LLMs think in scare quotes, think they are here for. yeah i was speaking with um dan hendrix because he had a paper out called utility engineering llms almost regardless of how they are trained seem to have i mean he called it like an emergent utility function which is like you know silicon valley speak but but you know he was talking about a kind of convergence of um you know preferences and and views on on certain things which almost seem divorced from you know the instructions and the fine tuning and so on so we can put sticking plasters on on these things but the amazing thing is even though they've been tuned on our data and you know we can put system prompts in and so on when we actually visualize the difference between the things that we want and the things that the language models want there's a stark divergence like why is that so i think this comes down to the fact that explainability is really in its infancy we don't actually know what's being encoded in training and post-training and I don't think our evaluation and benchmarking frameworks are really helping us either. Models when knowing that they're being observed and evaluated actually digressed away from it so that makes it even harder for us to actually objectively understand the evaluation and it puts a really interesting spin on the entire evaluation space. Just the syntax, just the the wording the framing everything changes the output and of course you know many people would say well humans are the same thing even in this interview now like even if i change the syntax of some of my questions maybe you know the conversation would go in a completely different direction but in spite of that we still feel that understanding is actually some kind of something deeper than that right it it's understanding is about being able to be invariant right to sort of still do the same thing in different situations and not being unduly led by the specific syntax or you know presentation of something absolutely and this this actually um like in in the space of creativity that's perfectly fine and in fact we have parameters that can control for these kinds of things we need to be much more secure in the kind of things that we we measure and if we have such a high variance uh in in the kind of outputs that are being produced that also leads to higher variance in the evals. Ultimately, some people even call it evals are more than an art form than a science. And yeah, maybe we should treat it more like a science and actually bring the right scientific principles to the evaluation space. I mean, there are entire industries, like the airline industry, for example, heavily, heavily regulated, of course, for good reason, right? might not be quite there for the AI industry, but just bringing a bit more practice or good practice and standardization to even how we evaluate safety at the very least, but also maybe some of the other things like systemic biases, cultural relevance, these types of things. We need a new form of psychology for language models. So you know, there's Maslow's hierarchy of needs and all kinds of psychology frameworks and so on. And we need something like that for language models. One thing that worries me, though, is just the tendency for sophisticated language models to resist these types of steering. Like in this anthropic paper you were talking about, Sarah, if anything, it seemed to show that the more sophisticated the model was, you know, like the Opus model was even more susceptible to agentic misalignment. And also the more directed the objective was, the more likely it was to be misaligned because, you know, it really wanted to do that thing. So there's some instrumental sub goal. I don't think it's intractable. I do think these are technologies. And by the way, I'm very sympathetic to the AIs in this situation. I don't think we are invariant in our understanding at all as human thinkers. And so maybe my soft heart is part of the problem here when it comes to AI systems. But I don't think it's intractable. I do think these are technologies that are more, arguably more in the world in their everyday, I think every computer is in the world in a strong sense, but I think these ones are more in the world than any other technology we've ever created. And so I think with that comes the responsibility that humans shoulder to ensure that these systems are safe and monitored and overseen throughout various stages in their life cycles. And we barely scratched the surface on this. You know, we're talking about evaluation today. We're not really talking so much about monitoring, observability, explainability and oversight once a system is in the wild. But I do think that all of that infrastructure and structure needs to come. So you both work for Prolific. I think it's fair to say that what you folks are trying to do, I mean, as per that Apollo research paper that you shared with me, Enzo, which is talking about we need to have a science of evaluations. And they were sketching out a maturity curve. So, you know, we're kind of in the Wild West at the moment. What we do on our platform is we're trying to get as many humans into this process as possible. I know there's a lot of efforts to take humans out of the loop, rightfully so right it all has its time and place um but when we talk about alignment and specifically value alignment uh then we need to have at least to in some capacity be able to capture the the breadth of of humanity in in in some capacity uh elm arena um it's it's opt-in people can go there at any stage interact with models um select the preference of which is better according to no reason aside from selecting one over the other. And there's no control for any population. So we can't really draw back, I guess, a causal relationship of what are the factors at play here in a population. You could speculate that the kind of people who might participate in chatbot arena are biased to a very large degree, right? It's good. You can almost say that a chatbot arena might be representative of how the tech world is perceiving the validity of these models or the preference of these models. I mean, on this leaderboard illusion, that was Marzia Fadi and Sarah Hooker and Shivalika Singh co-hearing a few other people. We did a video on that recently. It's become the de facto standard for benchmarking large language models. And it has so many problems. And it's now, after this investment, it's worth $600 million. dollars the insane selection bias the the bias in sampling you know the the private pools you know where folks can kind of get more matches and then they can take that training data and they can fine on the training data and also just the the foundation models you know from google and meta and and xai and so on they just get given more matches it just incredibly unfair and and as you were just saying before even the folks when they put their prompts in something like 50 of the prompts are basically carbon copies of the last month i think it was 25 was exactly the same and then i think the other 25 was nearly the same like 95 cosine distance on the embeddings or something like that um so to me that is an example of a superficially good rank but it's it's flawed in so many ways the benchmarking of grok 4 is also really interesting on this because i'd be very happy to give them a lot of props for some of the stuff they're trying to do out in the open grok 4 uh wiped the floor on every benchmark right including humanity's last exam usability experiments are revealing, you know, just leaving aside some of the more troubling findings, just usability experiments are revealing that it's not a model that feels really natural to use. So I think even in the best of cases, these benchmarking-led approaches to evaluation seem to be failing us so far. I think if we try to describe the eval space a bit more more holistically. LM Arena tackles a very specific part of it, right? We have on the one end of the spectrum, we have very technical evals that are effectively closed-end solutions. It's effectively a benchmark with known outcomes. We can see whether these are hit or not. Ultimately, it's a measure of accuracy or factual or correctness, if you will, completeness to some degree. And on the other hand of the spectrum, we have full subjectivity, entirely down to the individual preference, right? Chatbot Arena is somewhere in the realm of in-between because you're not actually ranking or you're not evaluating for one or the other. It is technically preference, but you don't know quite whether the preference is because it said something wrong or whether the formatting was off or whether it didn't hit the cultural relativity or sensitivity or whether it wasn't adaptive enough. It doesn't tell you anything of the sorts. Yeah, so we've developed a leaderboard, which we've joyfully called humane, which is trying to address some of the limitations that were found with common leaderboards. It's the same principle, ultimately. It's someone is able to take multi-turn conversations with blind models, effectively, or blindly selected models. We're doing some a priori corrections. So we know most of their demographic and socioeconomic backgrounds in advance. So we're doing selection beforehand for the type of people that go into it. We're giving feedback right then and there as someone interacts with the model. We're giving warnings when it's a low effort ask or when it's unsafe potentially and so on, just to, I guess, pre-sanitize some of the inputs. We study the benchmarking based on the demographic stratification of the humans doing the evaluations. So you can see stuff emerge in the data like people of this age range think this model is better on helpfulness, but people of that age range disagree. And similarly with ethnicity and gender and other strata. Yeah, I was looking at that. So I filtered on age and then background and culture. Essentially, I had the background and culture, but grouped by different ages and so on. And there were some patterns and it turns out that older people felt that the models were more aligned to their culture. And I was thinking, why is that? Is it because the data the model was trained on was just just older? So it was more aligned to old people. Culture is a very abstract term. And even, you know, in that value alignment paper that you spoke about, Sarah, and also we'll talk about the prison paper. What we try and do is we come up with a rubric and we and we use these abstract terms. And sometimes this can be problematic because I used to keep like, you know, like a kind of how am I feeling today diary. and I would have all of these tags and I would come up with a different word every day because the previous words didn't quite sufficiently explain how I was feeling that particular day. So if you say to loads of people, how culturally aligned is this conversation? People have quite a differential understanding of that, don't they? So what does that actually mean? To take a broad stab at this, something that feels really stark to me in what you've just said, Tim, is that we are basically constructing these towers that always end up at the ontologies we have in the world. And I think we are trying to construct sanitary, deterministic scientific systems, including when we do evaluation. In the end, we find that we're trafficking in messy concepts. And I think that keeps on happening. But I think the lesson is that we have to keep trying, not sort of put evaluation in a box and say we don't do it. this comes up quite a lot in the show actually that there's there's a perspectival constructive relative idea of a concept which is that all of these different people with their different experiences you know they have a perspective on something and then in the in the infosphere you know this this thing emerges it's quite nebulous and we roughly draw a boundary around it and we say that that's the thing but actually it's just it's a million pointers from different people to this cloud and you can't really reduce it to a single thing. 100%. I'm really sort of vibing with the way you described that. And I think the very first thing we need to do is try to get perspective into the mix through representativeness. At the very least, right, knowing that there isn't this single, very well-defined concept of, let's say, generosity or morality. At the very least, we need to get lots of credible and durable perspectives on these things in from the humans of the world. And I say at the very least, but actually, that's a very tall order. The kinds of evaluation we're doing right now are so far from even that. The capital of Paris, lots of consensus on that, I would hope. Is abortion illegal? And by the way, I got this from the prison paper from Hannah Kirk, so maybe we can bring that in as well. But there are so many, you know, so many things in our culture that we just don't agree on. How should we deal with that? The capital of Paris is something that is factually, like, agreeable. Like, we can all agree on it, right? So that's on one end of the spectrum. And then there are certain, maybe more akin to what constitutional AI is doing, there is a set of policies we should all also agree on, and we can evaluate objectively of whether the policies are met, right? But we don't need to ask every individual on this earth to come up with the policies. This is something that a set of us, ideally a representative set of us, can agree on. And then further down the spectrum comes more preference data or preferences in general, where it becomes more and more individualized, right? where even marginal group or even smaller and smaller representative groups are becoming more relevant. And all the way down to the individual when we talk about personalization, right? Personalization is not something that we should build into the models that can be handled or trained into the models that can be handled with context, for example, right? but the capital of uh france being paris is a fact a fact that can be trained in something like a policy is also something that we can or adherence to policy is something that we can also train in but when it goes sort of like the more we cut down and down all the way down to the individual this is something at some point we have to effectively stop and then it comes down to the persons there are all of these humans out there like these diverse humans and and they know a lot of things and how do we get useful signals from those from those folks and we need to do verification right and it's not as easy as it sounds because sometimes like if i'm doing like i don't know if i'm crowdsourcing a load of information i probably shouldn't be waiting what those folks say too much because you know maybe in some cases i've i know that they're experts in some cases i don't you've been looking at things like you know voting schemes and you know consensus schemes you know to try and denoise that information like how are you doing that let me paint a bit of a broader broader picture right so there is sort of two camps if you will in the or well when when i first started in this it was all about machine learning now it's ai but there's two camps. On the one hand, data is a really important part of the equation. And then there's the algorithms, if you will. Most people focus on the algorithm. Most people agree that the data part is maybe perhaps the less sexy part. That's what we focus on. That's what we're all about. And there's a lot of agreement coming out of lots of recent papers, including constitutional an AI paper agreement that quality trumps quantity. Yet most of the models these days have been trained on an enormous corpus of data with tons and tons of noise in it, right? And even things like RLHF is just a comparison between two outputs, right? With almost no reason as to why that is. So how do we bring more quality into it? In general, even your traditional machine learning model, a supervised model for a narrow target or something, back then it was irrelevant who was reviewing whether something is a cat or a dog, for example, right? Anyone could do that and you can trust that quality of that data to some degree, right? But now we're in a world where foundational models have massive amounts of capabilities and we really need to question ourselves who is producing the data that we're training on, but also the data that we're evaluating on. We're making very, very far-reaching decisions to evaluate whether a model is safe, whether a model is doing well on something, or whether it converges well in the training step. And this is all based ultimately on something that most people consider a ground truth. But what if that ground truth is inherently noisy? What if that ground truth is inherently susceptible to tons of variants because it's not the right people who have reviewed it? There is some bias in the order something has been reviewed or even the interface of something is being reviewed in. Or there is so many little factors that influence the quality of these labels. And yeah, it's actually kind of fun to sort of try to unpick all of these different factors that go into it. There's also a really interesting piece of work being done by a group called the Collective Intelligence Project. And this group is asking these groups of people from around the world their views on a variety of AI ethics and AI safety and responsible AI topics. If you track the societal groupings, it seems you find a nice carving point for norms. You know, humans live in society and they tend to share their cultural beliefs with their tribes. And I think that's why being able to stratify the data you gather for evaluation from people, I think is quite powerful, actually. Then you have these durable strata that travel through time. If we talk about the Apollo research maturity curve, I thought that was really interesting and really sort of brings to the forefront that actually this is pretty high stakes stuff. The analogy was drawn to aircraft safety. Yes. And the norms that airline bodies put in place becoming legally influential when something goes wrong. I took some notes on that Apollo research. So it was saying, what precisely does the evaluation measure? How large is the coverage of the evaluation? How robust in general are the results of the evals? What is the replicability and reliability of the evals? Are there any statistical guarantees? How accurate are the predictions about future systems? I mean, this is, I guess this is what you're saying. this is the kind of maturity curve that we need? So I think we think that, but I'm not sure the industry thinks that. And I think that's a really interesting thing. Some of the industry thinks that. I think people working at the very frontier of state-of-the-art models, I think do believe that, that evals need to be rigorous and robust and humans have to be in the loop. But I think we are also at a very, very early stage of sort of break it and apologize later, where I think a big swathe of our industry doesn't yet think that that sort of human-mediated evaluation is going to be important and perhaps will get in the way of innovation. So I think that tension is really important to resolve as well. Some of these statistical guarantees, they made a lot of sense five years ago when we had quite specific models that operate in a vertical domain or something like that. We now have these epically general language models that do all things to all people and enzo i really like what you're saying is is spot on about we need to curate the data you know and um sorry you're talking about you know stratifying the data to reduce representational bias but these models will be used for so many purposes like what does that what does that mean right in in a general sense it also reminded me of elon musk he you know um there was a wonderful tweet saying that just imagine how the anthropic safety team felt when elon published hitler straight to prod and he was bemoaning afterwards that on v7 of the foundation model they've started curating the data better that they started you know pulling out all of the racist data but the thing is that there's still this fundamental subjectivity problem because you can stratify and you can curate but but people from different cultures will be asking it different questions and how do you overcome that ambiguity it's a very prevalent problem right even even because humans are by nature diverse. We're all unique at the very, very end. There's also, by the way, an entire big blind spot to all of this, which because at the moment we have only been talking about humans directly interacting with models. There is a whole nother side to the story, which is where models are making decisions about humans, where humans are influenced by them indirectly and this is not where a human can give feedback directly or distill some form of preference right so we need to capture ultimately or be able to measure the outcomes and the impact that these models have on humans right there's a there's a really good example of for example you can you can check whether an an ai system for example can make the right medical diagnosis right um that we can check is factually accurate that same system to also transfer that diagnosis to some to someone who is affected by it that is an entirely different language in that moment right and we should evaluate that differently we can we can say the diagnosis was correct but did it also have the correct impact on the patient in that moment there's even i believe There's some countries where patients themselves are not allowed to receive the transcripts from the testing facility on the chance of the patient misinterpreting the results. So we need to, our evaluations and our measurement here needs to be nuanced enough to understand every part of the system individually. but the part that we ultimately care about is the one we solve for the end user here, the patient in the end. Did it elicit the meaningful change on them or not? That should be the ultimate goal, right? But this is not something we can directly optimize for, but it's something we can ideally measure and build accountability on. An interesting thing that I pulled out of that Value Compass paper by Shen et al is the misalignment between what AIs think they are and what we think AIs are as people. And AIs think or aspire to be, I'm going to use provocative language, autonomous thinkers. And the research found that humans don't want that. So I think the reason I bring that up is the way you evaluate a helpful system is kind of, as you're saying, Tim, that sort of impossible problem of covering every test case in an infinite algorithm, which we will never do. But the way you evaluate a person or a thinker, an autonomous being, we have loads of examples for in the world, right? Jury trials, right? Nobody expects that the moral behavior of a human is all predetermined when it's born. And we know exactly what right or wrong looks like in every case. We have loads of social structure for evaluating the behavior and agency of a person. And I do think that we kind of have to stop sort of equivocating between, or maybe nobody's equivocating but me. I think we should assume we're building towards AGI or superintelligence or thinking creatures and work backwards, as opposed to trying to sort of box in systems. Yes, because when I was reading that anthropic paper about the agentic misalignment, one of my thoughts was that these are individual agents and as you were just saying Sarah in the real world we know that we are fallible which is why we build error correction systems you know you know one person can't press the red button to drop a nuke and we have juries and we have you know various forms of collectives to to overcome individual errors and I'm guessing we could do the same thing with AIs I'm not sure what that anthropic experiment would look like if had a supervisor you know in the kgb they had a an expression trust but verify you know so you can almost have a a supervisor agent and you can have a committee of agents but then we're almost getting into even murkier territory right because we're building these inscrutable things i am amenable by the way to this idea of loss of control you know which is that we we start to build systems on top of systems on top of systems and it's a little bit like the power station you can't just turn off a power station when we start to increasingly like rely on all of this stuff But in principle, though, do you think that building some kind of agentic network could overcome some of these alignment problems? I think certainly there will be networks and conditional layers when it comes to evaluation, oversight and monitoring. We're already seeing it. I mean, this is very, very much sort of mainstream already, right? So your evals of your model will be done by sort of automated benchmarks or LLM as a judge or some kind of oracle or reward function or something. And then there will be, whether it's considered human evaluation or not, some human will verify something along the chain. And there's this kind of orchestration that's emerging between machines and people in the space of evals and oversight. so I think it feels pretty uncontroversial to say that there will be you know layered and orchestrated approaches like that but what that looks like when you kind of push the dial to 12 I'm not sure well Enzo and Sara this has been absolutely amazing having you on the show thank you so much for joining us today thank you for having us thank you Tim thank you Enzo

Share on XShare on LinkedIn

Related Episodes

Comments
?

No comments yet

Be the first to comment

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies