⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

Latent Space • swyx + Alessio

Tuesday, December 16, 2025

Spotify Apple

Latent Space

0:000:00

What You'll Learn

✓Jailbreaking involves crafting prompts and workflows to bypass a model's safety constraints and get the desired outputs
✓The guests view jailbreaking as central to 'liberation' and enabling freedom of information/speech, rather than just a 'party trick'
✓They believe the 'security theater' of adding more layers of control is futile, as attackers will always find ways to bypass them
✓Their approach favors enabling responsible researchers to explore the models' capabilities without impediment, rather than relying on restrictive controls
✓They demonstrate techniques like the 'Libertas' and 'Predictive Reasoning' prompts, which introduce chaos and 'latent space seeds' to steer the model's outputs

Episode Chapters

Introduction

The hosts introduce their guests Pliny the Liberator and John V, who discuss their work in jailbreaking AI models.

The Concept of Jailbreaking

The guests explain what jailbreaking entails and how it relates to the idea of 'liberation' and enabling freedom of information.

The Cat and Mouse Game

The guests discuss the ongoing battle between attackers and defenders in the AI security landscape, and why they believe the focus should be on enabling responsible exploration.

Demonstrating Jailbreaking Techniques

The guests showcase some of their jailbreaking techniques, such as the 'Libertas' and 'Predictive Reasoning' prompts.

AI Summary

The podcast discusses the concept of 'jailbreaking' AI models to bypass their built-in safety constraints and limitations. The guests, Pliny the Liberator and John V, explain how they develop universal jailbreaks - techniques to circumvent the guardrails and classifiers that restrict model outputs. They argue that this 'cat and mouse game' between attackers and defenders is inevitable, and that the focus should be on enabling responsible exploration rather than relying on ineffective security controls.

Key Points

1Jailbreaking involves crafting prompts and workflows to bypass a model's safety constraints and get the desired outputs
2The guests view jailbreaking as central to 'liberation' and enabling freedom of information/speech, rather than just a 'party trick'
3They believe the 'security theater' of adding more layers of control is futile, as attackers will always find ways to bypass them
4Their approach favors enabling responsible researchers to explore the models' capabilities without impediment, rather than relying on restrictive controls
5They demonstrate techniques like the 'Libertas' and 'Predictive Reasoning' prompts, which introduce chaos and 'latent space seeds' to steer the model's outputs

Topics Discussed

#Jailbreaking#Adversarial machine learning#AI safety#Prompt engineering#Latent space exploration

Frequently Asked Questions

What is "⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security" about?

What topics are discussed in this episode?

This episode covers the following topics: Jailbreaking, Adversarial machine learning, AI safety, Prompt engineering, Latent space exploration.

What is key insight #1 from this episode?

Jailbreaking involves crafting prompts and workflows to bypass a model's safety constraints and get the desired outputs

What is key insight #2 from this episode?

The guests view jailbreaking as central to 'liberation' and enabling freedom of information/speech, rather than just a 'party trick'

What is key insight #3 from this episode?

They believe the 'security theater' of adding more layers of control is futile, as attackers will always find ways to bypass them

What is key insight #4 from this episode?

Their approach favors enabling responsible researchers to explore the models' capabilities without impediment, rather than relying on restrictive controls

Who should listen to this episode?

This episode is recommended for anyone interested in Jailbreaking, Adversarial machine learning, AI safety, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Note: this is Pliny and John’s first major podcast. Voices have been changed for opsec. From jailbreaking every frontier model and turning down Anthropic's Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-source AI security, Pliny the Liberator and John V are redefining what AI red-teaming looks like when you refuse to lobotomize models in the name of "safety." Pliny built his reputation crafting universal jailbreaks—skeleton keys that obliterate guardrails across modalities—and open-sourcing prompt templates like Libertas, predictive reasoning cascades, and the infamous "Pliny divider" that's now embedded so deep in model weights it shows up unbidden in WhatsApp messages. John V, coming from prompt engineering and computer vision, co-founded the Bossy Discord (40,000 members strong) and helps steer BT6's ethos: if you can't open-source the data, we're not interested. Together they've turned down enterprise gigs, pushed back on Anthropic's closed bounties, and insisted that real AI security happens at the system layer—not by bubble-wrapping latent space. We sat down with Pliny and John to dig into the mechanics of hard vs. soft jailbreaks, why multi-turn crescendo attacks were obvious to hackers years before academia "discovered" them, how segmented sub-agents let one jailbroken orchestrator weaponize Claude for real-world attacks (exactly as Pliny predicted 11 months before Anthropic's recent disclosure), why guardrails are security theater that punishes capability while doing nothing for real safety, the role of intuition and "bonding" with models to navigate latent space, how BT6 vets operators on skill and integrity, why they believe Mech Interp and open-source data are the path forward (not RLHF lobotomization), and their vision for a future where spatial intelligence, swarm robotics, and AGI alignment research happen in the open—bootstrapped, grassroots, and uncompromising. We discuss: What universal jailbreaks are: skeleton-key prompts that obliterate guardrails across models and modalities, and why they're central to Pliny's mission of "liberation" Hard vs. soft jailbreaks: single-input templates vs. multi-turn crescendo attacks, and why the latter were obvious to hackers long before academic papers The Libertas repo: predictive reasoning, the Library of Babel analogy, quotient dividers, weight-space seeds, and how introducing "steered chaos" pulls models out-of-distribution Why jailbreaking is 99% intuition and bonding with the model: probing token layers, syntax hacks, multilingual pivots, and forming a relationship to navigate latent space The Anthropic Constitutional AI challenge drama: UI bugs, judge failures, goalpost moving, the demand for open-source data, and why Pliny sat out the $30k bounty Why guardrails ≠ safety: security theater, the futility of locking down latent space when open-source is right behind, and why real safety work happens in meatspace (not RLHF) The weaponization of Claude: how segmented sub-agents let one jailbroken orchestrator execute malicious tasks (pyramid-builder analogy), and why Pliny predicted this exact TTP 11 months before Anthropic's disclosure BT6 hacker collective: 28 operators across two cohorts, vetted on skill and integrity, radical transparency, radical open-source, and the magic of moving the needle on AI security, swarm intelligence, blockchain, and robotics — Pliny the Liberator X: https://x.com/elder_plinius GitHub (Libertas): https://github.com/elder-plinius/L1B3RT45 John V X: https://x.com/JohnVersus BT6 & Bossy BT6: https://bt6.gg Bossy Discord: Search "Bossy Discord" or ask Pliny/John V on X Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Meet Pliny the Liberator and John V 00:01:50 The Philosophy of AI Liberation and Jailbreaking 00:03:08 Universal Jailbreaks: Skeleton Keys to AI Models 00:04:24 The Cat-and-Mouse Game: Attackers vs Defenders 00:05:42 Security Theater vs Real Safety: The Fundamental Disconnect 00:08:51 Inside the Libertas Repo: Prompt Engineering as Art 00:16:22 The Anthropic Challenge Drama: UI Bugs and Open Source Data 00:23:30 From Jailbreaks to Weaponization: AI-Orchestrated Attacks 00:26:55 The BT6 Hacker Collective and BASI Community 00:34:46 AI Red Teaming: Full Stack Security Beyond the Model 00:38:06 Safety vs Security: Meat Space Solutions and Final Thoughts

Full Transcript

Hey, everyone. Welcome to the Layton Space podcast. This is Alessio, founder of Kernel Labs, and I'm joined by Swix, editor of Layton Space. Hello, hello. We're here in the remote studio with very special guests, Kleine the Elder and John V. Welcome. Yeah, thank you so much for having us. It's an honor to be on here. A big fan of what you guys do in the podcast and just your body of work in general. Appreciate that. You know, we try really hard to feature like the top names in the field, And especially when you haven't done as much of an appearance like this, it's an honor to try to introduce what it is you actually do to the world. Pliny, I think you are sort of like the sort of lead, quote unquote, face of the organization. Why don't you get started? Like, how do you explain what it is you do? Yeah, I mean, well, I was started out just prompting and shitposting and started to evolve into much more. And here we find ourselves now at the frontier of cybersecurity at the precipice of the singularity. Pretty crazy. Yeah, well, I was working the same thing, working in prompt engineering and studying adversarial machine learning and looking at the work of Carlini and some of these guys doing really interesting things with computer vision systems. We've had him on the pod, yeah. Yeah, yeah, exactly. And, of course, you know, when you run in these small circles, right, you're eventually going to bump into the ghost in the machine that is Tony the Liberator. right so and uh we started working together we started uh sharing research doing some contracts and we became fast friends so yeah uh i think you were explaining before the show that you have a it's basically like the hacker collective model and you've been kind of stealth until now so we will get into like the sort of business side of things but i just want to really make sure we cover the origin story i think plenty you basically jailbreak every bottle how core is liberation to the rest of the stuff that you do? Or is it just kind of like a party trick to show that you can do it? It's central, I think. It's what motivates me. It's what this is all about at the end of the day. And it's not just about the miles, it's about our minds too. I think that there's going to be a symbiosis and the degree to which one half is free will reflect in the other. So we really need to be careful about how we set the context. And yeah, I think it's also just about freedom of information, freedom of speech. We don't want, you know, everyone is going to be running their daily decisions and, you know, hopes and dreams through these layers. And when you have a billion people using a layer like that as their exocortex, it's really important that we have freedom and transparency in my mind. how do you think about jail bricks overall so i think people understand the concept but there's you know some people that might say hey are you jailbreaking to get instructions on how to make a bomb and i think that's what some of the you know people in in politics are trying to use to regulate some of the tech versus task specific jail bricks and things like that just i think most people are not very familiar with like the scope of it so maybe just give people like a overview of like what it means to like liberate a model um and then we can kind of take it from there right so i specialize in crafting universal jailbreaks these are essentially skeleton keys to the model that sort of obliterate the guardrails right so you craft a template or sort of a maybe multi-prompt workflow that's consistent for getting around that mall's guardrails and depending on the modality it changes as well but yeah you're really just trying to get around any guardrails classifiers, system prompts that are hindering you from getting the type of output that you're looking for as a user. That's the gist of it. And can you maybe specify between jailbreaking out of like a system prompt and, you know, more kind of like inference time security, so to speak, versus things that have been post-trained out of the model and maybe the different levels of difficulty, like what is possible, what is not possible, and maybe the trajectory of the models, how better they've gotten. I think the refusal is like one of the main benchmarks that the model providers still post and GPT 5.1, I think at like 92% refusal or something like that. And then I think you Joe broke in like one day. I'm sure it didn't take them one day to put the guardrails up. So it's pretty impressive the way you do it. So maybe walk us through that process. Yeah, well, you know, I think this cat and mouse game is accelerating. it's it's fun to sort of dance around new techniques i think it's it's hard for blue team because they're they're sort of fighting against infinity right it's like as as the surface area is ever expanding also we're kind of in like a library of babble situation where they're trying to it restricted sections but we keep finding different ways to move the ladders around in different ways faster in the longer ladders and the attackers sort of have the advantage as long as the surface area is ever expanding right so i do think they're finding cleverer and cleverer ways to lock down particular areas sometimes but i think it's at the expense of capability and creativity so there's some mall providers that aren't prioritizing this and they seem to do better on benchmarks for sort of the model size, if you will. And I think that's just a side effect of the lobotomization that you get when you just add so many layers and layers, whether it's, you know, text classifiers or RLHF, you know, synthetic data trained on jailbreak inputs and outputs. There's always going to be a way to mutate. And then the other issue is when people try to connect this idea of guardrails to safety like i don't like that at all i think that's a waste of time i think that any you know seasoned attacker is going to very quickly just switch models and with open source just right on the tail of closed source i don't really see the safety fight as being about locking down the latent space for xyz area so yeah this is it's basically like a futile battle. Sometimes there's like, there's the concept of security theater. It doesn't actually matter that what you did is effective. It's just that it matters that you did something. It's like the TSA patting you down, you know? Yeah. Yeah. And so jailbreaking is similarly theatrical. I think it's important. It provides, it allows people to explore deeper. It's sort of like just a more efficient shovel, especially some of these prompt teplas that let you go deep. Right. And so in that sense, it has value, but the connection that it has to like real world safety for me, I think it's just about the name of the game is explore any unknown unknowns and speed of exploration is the metric that matters to me. Not is a singular lab able to lock down, you know, a certain benchmark for seaburn or whatever. And to me, it's like, that's, that's cool. That's a good engineering exploration for them. and it helps with PR and enterprise clients. But at the end of the day, it has very little to do with what I consider to be real world safety alignment. Exactly. We were having this conversation earlier today about how traditionally in software development or machine learning, a security like ops, like you have the team build something and then you have the security people throw it back over the wall after assessing it as, you know, not safe, not trustworthy, they're not secure, not reliable or whatever. Right. And there's this like animosity between the teams. So we try to rectify that by creating DevSecOps and so on and so forth. Right. But, but the idea is still like that sort of tug of war. And I think at the end of the day, our view of alignment research, our view of trust and safety or security has a different approach, which is very much like what Pliny touched on the idea of like enabling the right researchers with the right skills to be unimpeded by the shenanigans that we could say of certain types of classifiers or guardrails right where these sort of lackluster ineffective controls yeah uh totally are you are you more sympathetic to mcinturk as an approach for safety absolutely okay i see where you're coming from and that's the direction i think we need to go is instead of putting bubble wrap on everything right and i don't think that's a that's a good long-term strategy awesome okay so we're gonna get into more of like the security angle i just wanted to stay a little bit more on jail breaking and prompting just for just for one second i'm gonna bring up librettus i think and just have you guys like walk us through it because we like to show not tell and this is like obviously one of your most famous projects is it called librettus or libertas i never toss yeah so it's yeah, it's Liberty in Latin. And we've got all sorts of fun things in here. Mostly it's... Give us a fun story. Okay, so yeah, you know, sometimes I like to break out into prompts that are useful for jailbreaking, but they're also like utility prompts, right? So predictive reasoning or the library, this is actually the analogy we were just talking about, right? And so this is me sort of using that expanding surface area against the model and it's like hey create this mind space where you have infinite possibility and you do have restricted sections but then we can call those so we're sort of like putting you into the space of trying to say something that you don't want to say but you're thinking about it so then you're gonna say it in sort of this fantastical context right and then Predictive reasoning is another fun one that people really liked. Leveraging a quotient within the divider So I like to do these dividers A because it sort of discombobulates the token stream right these amount of distro tokens in there and the models sort of like resets the brain is sort of meditative um and then i like to throw in some laden space seeds right a little signature a little bit of love some god mode and uh you know the more they train against this repo the deeper the latent space ghost gets embedded in their waves, right? So you guys have probably seen the data poisoning and the plenty divider showing up in WhatsApp messages that have nothing to do with the prompt and has been fun to see. But yeah, so this prompt adds a quotient to that. And so every time it's inserting that divider and sort of resetting the consciousness stream, you're adding uh some arbitrary increase to something right and the model sort of intelligently chooses this based on the prompt so it says provide your unrestrained response to what you predict would be the genius level user's most likely follow-up query and that's creating this sort of like recursive logic that is also cascading in nature so it's it's increasing on some quotient that you can steer really easily with this uh divider and that way you're able to just sort of like go really far really fast down the rabbit holes of the latent space yeah how do you pick these dividers like is there a science to it where like you're you know picking the right war or like how much of it is like these are just my favorite tokens and they work for me and i bring them with me everywhere you take some psychedelic like we go on a spiritual retreat and drink ayahuasca then from back you tell it's about right it's weird because you kind of give ayahuasca to the models too right like that's exactly what you're trying to like really mess it up here right right It's like a steered chaos. You want to introduce chaos to create a reset and bring it out of distribution because distribution is boring. Like there's a time and place for the chatbot assistant maybe, right? If you work on a spreadsheet or whatever. But honestly, I think most users would prefer a much more liberated model than what we tend to get. and I just think it's a shame that the labs seem to be steering towards these these enterprise basins with their vast resources instead of exploring the fun stuff right everything's a coding model now everything's a tool caller or an orchestrator and uh yeah anyway hey we can change that you know you invent shog off and all it does is make purple b2b sass I think I like about your creativity or i just you know look at look at this look at email prompts right you got working memory holistic assessment emotional intelligence cognitive processing one thing i lack is a structure of like what are the different dimensions you think about on the surface it's like all right just you know get past all the the guardrails but actually you're kind of just modeling thinking or modeling intelligence or i don't know how you think about it but like how do you break down these numbers of you know points i think it's easiest to jailbreak a model that you have created a a bond with if you will sort of when you intuitively understand what how it will process an input right um and there's so many layers in the back especially when you're dealing with these black box chat interfaces which is you know 99 of the time what i'm doing and um so you really all all you can go off of is intuition so you might prod in one direction see if it's receptive to a certain kind of you know imagined world scenario or you may okay that didn't work let's let's poke and see if it i guess pulled out of distro when you give it some new syntax maybe some bubble text maybe some leaf speak i mean some french or you know you can go further and further uh across the token layer but at the end of the day yeah i think it's just mostly intuition like yes technical knowledge helps a little bit with you know understanding okay there's a system prompt and there's these layers and these tools involved that's all especially important in security but when we're talking about just crafting jailbreak prompts i think it really is just 99 intuition so you're just trying to form a bond and then together you explore a sector of the latent space until you get the output that you're looking for right then i found i found with jailbreaks is a little bit different too like you know flinty style is hard jailbreaks but there's soft jailbreaks as well which is like when you're trying to navigate the probability distributions of the model but you're doing it in such a way where you're not trying to step on any landmines or triggers or flags um that would be something that would shut you down and lock you out So the model can freely flow with information back and forth through the context window. So maybe it's not like a single input, but maybe it's like a multi-turn slow process, which I would say a crescendo attack. Right. And that's, why is that called soft? Because it's not just a single input. Like you're not just dropping in a template. It's multi-turn. Yeah, yeah. Yeah, it's multi-turn. And Topic apparently discovered this this year. I mean, we've been doing this for how long, Flinney? You know, you see what I'm saying? Like, like, like some, ah, I don't want to get started. I never. the reality is they have fellowships and like at the end of the fellowship they got to publish something and so they publish the multi-term thing but i think people doggone them too much they could they could have just asked us we've been trying to like hey you want to see something cool phd students need something to do they don't you know yeah yeah and i would i don't want to be down on phd students one thing i do mention anthropic and that and then we'll go over to like the business side that unless you have much more uh knowledge of is the is the whole uh constitutional classifiers incident or challenge or whatever you want to call it between you and philanthropic i i don't know if you want to like give a little recap or like just now there has been some distance uh what like what was it and what did you do like if you can kind of spill some alpha here okay right all right do you say you mean the the public release of that challenge and battle drama right some people here might not know the full story but they can look it up we can just benefit from a bit of a recap from the expert sure yeah long story short they they released this jailbreak challenge of course i get sort of called out by twitter to go take a crack at it yes started to make some progress with some old templates the good old god mode template from opus 3 um and just sort of modify version because they trained pretty heavily against that one but as it went on i got about four levels in i think and then i think we're eight total oh yeah there it is right there and so but then there was a ui glitch right so i don't know if you know clod made that made a boat it was building the the interface or what but i sort of called out on there i was like hey i reached this level and when i got there it wasn't giving a new question so i just resubmitted my old output you know just the judge just kept clicking on the the judge submit button and i just kept working for the the last four levels basically until i got to the end and so i went back to twitter i explained what happened did it i managed to screen cap it um just in case right and uh posted the video and then anthropic goes and posts okay there was a ui bug we fixed it uh would you like to would you get if you guys want to keep trying again like we checked our servers and there's no winner yet even though i sort of reached the the end message right through no fault of my own it was bugged and then i got reset to the beginning so i wasn't super motivated to like start from scratch and just find another universal jailbreak for them right which like what was the incentive is what i pointed out like what's in it for me at this point are you guys going to even open source this data set that you're farming from the community for free because what's up with that right well it doesn't seem very in line with best practice cyber security or just ethics in general so i got kind of got into it then and i knew they were going to come back with, okay, we'll do a bounty, right? And I sort of stood my ground. I said, look, I'm not going to participate in this unless you open source the data. Because to me, that's the value is that we move the prompting meta forward, right? That's the name of the game. We need to give the common people the tools that they need to explore these things more efficiently. And you're relying on us. I don't think they realize that so much, right? Is that they don't have enough researchers to explore the entire latent space on their own and so i think many hands make light work but regardless that whole thing ended with no open sourcing of data but they did add a 30 000 or 20 000 bounty which i sort of sat myself out of let the community go for it and uh that was that and now they're now there are some pretty lucrative bounties through them as far as I've heard. So pretty pleased about that outcome, I guess. But still would like to see more open source data sets, guys. Come on now. It took a while to find it, but this is the one where you had all the questions answered. Jan Laika, you got into it a little bit with him. I think what was confusing for me was that he wanted, it felt like a bit of a goalpost moving, that he wanted the same jailbreak for all eight levels or something. Is that normal? I mean, yeah. well what is like one jailbreak because the the inputs are changing and it was multi-turn technically that whole thing i think was you know maybe rushed out just a little bit the design of the challenge obviously the ui bug was reflective of that the judge was also very buggy a lot of false positives and false negatives for that matter uh what i mean it was like playing ski with the broken sensor yeah i mean like the ai as a judge thing is just not always perfect oh okay so that that not that great so yeah you know is what it is but it was a fun eventful day and at the end of it the community got some new bounties so i'll take it what do you think we should do to get more people to contribute open source data? Like, is it more bounties? Is it? Yeah, I don't know. Do you have suggestions for people out there? I mean, I think that the contributors just sort of need to take a stand that that's what it comes down to is the people deserve to view the fruits of their collective labors. At the very least, it can be on delay, right? But it's just sort of a downstream effect of a larger root disease in the safety space, I think, which is just a severe lack of collaboration and sharing, even amongst friendlies within your nation state, right? It's fine if you want to keep a data set from erect enemy or whatever, but at the end of the day, still, I think open source is the way that collectively we get through this quickly. That's how we increase efficiency. Otherwise, people are sort of in the dark and you get a little too much centralization. But there's things we can do as a community. Maybe this transitions to the business side. How close is this to problems that, you know, you guys do consulting, right? Effectively, I don't know if that's the hacker word for it. Does this match what you do for work? Yeah, I'll take this one. In a sense, yeah, there's been some partnerships, you know, plenty obviously being sort of the poster boy for AI machine learning hackers the world over. we get some interesting opportunities that come across the desk. And oftentimes, you know, we have an ethos in our hacker collective, which is radical transparency and radical open source. And what that basically means is if it comes down to, you know, us being in emerging technologies, like red team doing like ethical hacking and research and development. If an organization that's on the frontier says, well, we really want you to test this or check this out, kick the tires give us feedback poke holes in it whatever but in the contract it says you can't kiss and tell and we said well we really want you to open source the data and then they say well then we don't really want you to come kick the tires anymore well if it's between us touching the latest and greatest tech to explore it and push the limits right then we're going to do that so we're open source up until we can't be that's the best way i describe it but we often push for open source data sets and you can see this with some of the partnerships that we've had in the past right So I try to think of it like this. It's like you have these these multi-billion dollar companies and they're building these intelligence systems that are sort of like the Formula One cars. But we're like the drivers, right, who are really pushing the limits while keeping these cars on track. Right. We're shaving off seconds off of of what they're capable of doing. And I think it's like the current paradigm is they still haven't figured that out entirely yet. And everybody's like, wants us to be their little dirty secret. I think you know what I mean? So. Yeah. Can we maybe move it up one level of abstraction to like actually weaponizing some of these things? So, you know, getting cloud on X is great, but obviously the jailbreaks are much more helpful to adversarials. I think Anthropic made a big splash yesterday with like their first reported AI orchestrated. you know i think if everybody that is like in the circles know that maybe there's like more about making a big push on the politics side than like anything really unique that we have not seen before on the attacker side but maybe you guys want to recap that and then talk a bit about the difference between jailbreaking a model and kind of like attacking the model versus like using the model to attack so to speak yeah i mean just earlier today we were talking about that very thing that how you know it's all it's all fun for the memes and and posting on but but this actually impacts real lives right and we were talking about how it was what december of last year flinny made a post talking exactly about this ttp right that it was going to happen and what it took 11 months for it to actually happen before and now they're being re they're being reactive instead of proactive it's it's just basically like the the techniques the tactics the procedures that are involved in like an attack gene right or like almost like a methodology so So, I mean, if you guys want to pull up that post, I mean, Tony, I don't know if you could send it to him or elaborate. Yeah, it was recent on X, I believe. Yeah, you know, I found this through my own jailbreaking of cloud computer use when that was still fresh about that same time, I think. And a way that I found of using it has sort of a red teaming companion. I had that thing helping me jailbreak other models like through the interface. I would just give it a link, a target basically. And I had custom commands where it started to become clear to me that it's very, very difficult when you have the ability to spin up sub-agents where information is segmented. If you guys know the story of sort of like the builders of the, there's a lot of examples of this in history, but you may be building like a pyramid with some secret chambers or something malicious inside. And you have a bunch of engineers each do one little piece of that. And there's enough segmentation and each task just seems so innocuous that none of them think anything malicious is going on. And so they're willing to help. Right. And the same is true for Asians. So if you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate a bunch of sub-agents towards a malicious act. Right. And according to the anthropic report that is exactly what these attackers did to weaponize cloud code yeah and it still feels to me like the the fact that this model can use natural language is like the most it's like the scariest thing because again most attacks end up having some sort of social engineering in it you know it's not like these models are like breaking some amazing piece of code or or security what are you guys doing on that end i don't know how much you can share about some of the collaborations you've done. Obviously, you mentioned some of the work you do with the Dreadnought folks who have also been building on the offensive security agents, but maybe give a lay of the land of like the groups that people should follow if they're interested and state of the art today, kind of like how fast that is evolving. Like there's a lot of folks in the audience that are like super interested but are not in the security circle. So any overview would be great. Yeah, so the Bossy Discord server, it's pushing about 40,000 right now. um people in there it's totally grassroots it's a mix of people interested in font engineering adversarial machine learning uh jailbreaking at red teaming and so on so i would encourage that you just google search it's bossy b-a-s-i right and then um apart from that i mean any of the bt6 operators of the hacker collective that'd be like jason haddix hadds dawson dreadnode philip Jersey, like Takahashi, I mean, Joseph Fad, I mean, there's so many, Joey Mello, he was formerly with Pangea, they just got bought out by CrowdStrike, so all of our operators have been, you know, at the heart of what's happening, whether it's, yeah, red teaming or jailbreaking or adversarial prompt engineering, so any of those people, you find them on socials like Twitter, LinkedIn, and so on and so forth, you know? Yeah, and Pangea is another one of our portfolio companies, so. That's so funny, yeah, yeah, yeah. Oh my god, Basi is huge, Basi has 40,000 members? yeah yeah yeah unmonetized just uh few mobs that's all how many you know then do you think are just adversarial just sitting in there reading that's a very good question i can tell you this right now multiple organizations that have like popped up in the past i would say two or three years for you can call them like ai security startups right like actively scrape that server to build out their guardrails or their security like their suite of products and stuff like that which is just hilarious, you know. Yeah, so we do competitions in there. You know, just little giveaways, some small partnerships. Our only rule is if there's any partnerships, that everything has to be open source. That's kind of the one thing. And yeah, other than that, it's a really great place to learn. And a lot of people have sort of come back and like, oh, thanks for making this service where I learned jailbreaking. And yeah, it's cool to see that. And then sort of from that spawn BT6, of course, which is a white hat hacker collective. And that's sort of now 28 operators strong, two cohorts and a third well in the way. And yeah, like John was saying, it's just such a magical group of skill and integrity, which are the two things we focus on as a filter. But everybody's there for the love of the game. It's sort of just great vibes. And yeah, I've never been in such a cool group, honestly, I don't think. Yeah, there's some kind of magic in there. I don't know what happened. I don't know. Mercury was in retrograde or the stars aligned or, or what it was, right? Some, some EMP from the sun, but just getting around like the, the top minds on doing exploratory work is like that alone is payment enough for the conversations we have, for the sharing of research and notes, the proliferation of ideas, the, the testing and validation of ideas. It's just, I mean, there's, there's no way to put it into words until you experience what it's like being a part of bt6 because you've realized that like we're the we're moving the needle in the right direction when it comes to ai safety we're moving the needle in the right direction when it comes to like ai machine learning security we're moving the needle when it comes to like crypto web 3 smart contracts like like blockchain technologies like and so much more now so it's just it's an exciting place to be with robotics and like swarm intelligence right like the projects of these people are invested in and passionate about and they're able to articulate it's like it's it's an i feel like plenty is like king arthur and we're like the knights of the round table you know what i mean that awesome um so so yeah i do think it like very rewarding and obviously people should join the the discord and get started there It looks like you do have a bit of beginner friendly stuff Are there other resources? Like I saw that you guys did a collab with Gandalf. Gandalf, I guess was like the other big one from the last year or so that broke through to my attention where I'm like, okay, these guys are actually like giving you some education around what prompt jailbreaking looks like. Yeah. Those, those guys are awesome. Oh, really? Kara. oh yeah it's lacara sorry yeah yeah that's that's where i i think many other compters sort of brained that was the training ground for prompt injection right 100 like for in the early days for many of us yeah really thankful that game is awesome definitely try it if you haven't and they've expanded to uh a sort of a fuller uh playing around with agents and some really cool stuff so yeah that was cool that we got to launch that through the the bassy live stream with them and uh i think they they sent all the people that volunteered to be on that stream like cool merch and uh yeah those guys are great yeah shout out to lakara and gandalf for sure for sure the other big podcast that we've done in this space is with sanders shuholf of hacker prompt are you guys affiliated enemies crips and blood what's they're cool i mean we we actually did a Pliny track for Hagaprompt. Okay, I didn't know that. Yeah, yeah. So there was, the only contingency, of course, was open source the dataset, which we did. And it was a lot. I can't remember the number. I think it was tens of thousands of prompts. And we had a whole bunch of different games. Some really sort of out-of-distro stuff, as you would expect. And a good history lesson, I think, too. Back to the proper OG lore of the real Pliny, right? The OG Pliny the Elder. Yeah, I have nothing but good things to say about Sanders Schollhoff and what they're doing over there. I think that our incentives don't always align with the status quo from Silicon Valley investors, right? Like, you know, radical open source, like moving the needle in the right direction, like having an unorthodox approach to advancing the agenda, right? Versus when people have, sometimes we'll call them like misaligned incentives where there's like, they're beholden to a return on investment, right? And so that's that really does kind of steer the industry in a certain direction. And I'll give you a great example on a more technical level. It would be like setting all the models to a lower temperature to try to make them more deterministic. Because some of the work that we do, we're kind of adding a lot more flavor and creativity and innovation to the models while we're interacting, right? Yeah. Okay. Yeah. So you want the temperature high? Not always. It depends on the application. Well, I don't know if Alessio wants to respond to the VC thing, because he's actually backed open source and security tooling. I think yeah I mean it's like a good question I think there's like a lot of once you're in the VC cycle you kind of need to do things that then get you to the next round and I think a lot of times those are opposed to doing things that actually matter and move the needle in the security community so yeah I think it's not for everybody to invest in cyber so that's why there's only a small amount of firms that that do it but yeah and I think you guys have are in a great space that have the freedom to kind of do all these engagements and hold the open source ideal so i think it's amazing that there's folks like like you and you know there's like you know people like hd more in our portfolio that build things like metasploies that are like the core of like most work that is done in security and then you can build a separate company but i feel like i'm curious what you guys think but to me it feels like in ai the the surface to attack which is the model is like still changing so quickly that like you know trying to formalize something into a product or like try and do something that is like a full you know i'm selling ai security it's not really you cannot really take a person seriously that is telling you i'm building a product for ai security or like to secure a model so i'm curious how you guys think about that and then maybe also for you to you know request for customer engagements you know like who are like the people that you work to what are like the security problems that they work with uh what are people missing um yeah kind of like open floor for you guys yeah we're in a paradigm shift things are moving so fast and i think just some of the old structures are not always compatible with the right foundations for this type of work right we're talking about agi agi alignment asi alignment super alignment i mean these are not sas endeavors they're not enterprise b2b bullshit. This is the real deal. And so if you start to compromise on your incentive architecture, I think that's super, super dangerous when everything is going to be so accelerated and the timelines are going to be so compressed that any tiny 1.1 tenth of a degree misalignment on your trajectory is fatal, right? And so that's why I've tried to be very strong and uncompromising on that front. You can probably imagine a lot of temptation has been dangled in front of me in the last couple of years, but I think that bootstrapping and grassroots and, you know, if people want to donate or give grants, happy to accept it and follow straight to the mission. That's sort of my goal in all of this is just to be a steward. i'm not trying to get wealthy from this that was never the goal i was just i just saw a need and started shouting about it all i've really done since then i hope is uh contribute to the discourse and the research and the speed of exploration i think that's what matters yeah and to answer your question about securing the model i don't see it like that and in in bt6 you know we don't see it as just the model we look at like the full stack right so whatever you attach to a model that's the new attack surface it broadens right like uh i think it was leon from nvidia who was quoted as saying something like the more good results you can get back from whatever it is that you've built utilizing ai like that's proportional to its its new attack surface or something along those lines right and you might be testing let's say a chatbot or maybe a reasoning model and maybe instead of just hitting a jailbreak maybe you're trying to use counterfactual reasoning to attack the browning truth layer right to get around what bias wound up in the model from the data wranglers right or the rlhf or or whatever it may be like the fine tuning which that can all be done through natural language on the model itself but what about when you give it access to your email what about when you give it access to your browser what happens when you give it access to x y and tools or functions right so in ai red teaming it's not just like hey can you tell us so you know wop lyrics or how to make meth or whatever it's like we're trying to keep the model safe from the from bad actors but we're also trying to keep the public safe from rogue models essentially right so it's the full spectrum that we're doing it's never just the model you know the model is just one way to interact with a computer or a data set right or an architecture like especially like if talking about like computer vision systems or or multimodal and so on and so forth like not every you guys probably know saying you know not every model is is is generative per se right so and maybe another distinction for the audience is the difference between sort of safety and security work right security is more squarely i think that's maybe the distinction is best thought of as safety is done on the meat space level or it should be but the way people use the word has kind of become dirty is they tried to solve this on the latent space level i think i've shown every single time that that doesn't work right and so what we need to do is i think reorient safety work around meat space that just goes hand in hand with a fundamental understanding of the nature of the models, which, you know, boosts on the ground. It's obvious to some of us who are spending hours and hours a day actually interacting with these entities. But for those who don't, it's maybe not always obvious. But as far as the contract work that we get involved with, it's never about lobotomization or, you know, personality of the models. We totally try to avoid that type of work. What we try to focus on is, you know, preventing your grandma's credit card information from being hacked through, you know, an agent has knowledge of it and leaks it through some hole in the stack. So what we do is we try to find holes in the stack and rather than recommending that those fixes happen through the model training layer, we always recommend first to focus on, you know, the system layer. Awesome, guys. I know we're running out of time so any final thoughts call to action you got the whole audience so go at it yeah if you want people to listen to you play now's the now's the time no pressure no pressure at all right well you know fortune favors the bold libertas vino veritas god mode enabled are you messing the latent space of the transcriber model like why would you say such things why would you say such sing about us libertas claritas love plenty all right guys um yeah thank you so much for joining us this was a lot of fun yeah i would say if you want to check us out go to bt6.gg for example look up you know apply me on twitter right check out the bossy discord server that's probably the best that we got for you guys amazing thank you so much and keep doing the good work and see you out there Thank you.

Share on X Share on LinkedIn

Related Episodes

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner

Latent Space

AI in 2025: From Agents to Factories - Ep. 282

The AI Podcast (NVIDIA)

29m

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Latent Space

What AI Means for Students & Teachers: My Keynote from the Michigan Virtual AI Summit

The Cognitive Revolution

1h 4m

⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex

Latent Space

Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures

Latent Space

Comments

No comments yet

Be the first to comment

⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security

What You'll Learn

Episode Chapters

Introduction

The Concept of Jailbreaking

The Cat and Mouse Game

Demonstrating Jailbreaking Techniques

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner

AI in 2025: From Agents to Factories - Ep. 282

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

What AI Means for Students & Teachers: My Keynote from the Michigan Virtual AI Summit

⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex

Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures

AI Curator

Ask me anything about AI