Exploring GPT 5.2: The Future of AI and Knowledge Work

AI Applied • AI Applied

Tuesday, December 16, 202512m

Apple

AI Applied

0:0012:47

What You'll Learn

✓GPT 5.2 has improved its ability to perform knowledge work tasks, scoring 70.9% on the GPT Val evaluation compared to 38-39% for previous models.
✓The host prefers the cadence of incremental model updates rather than major releases every 6-12 months, as it provides more consistent improvements.
✓OpenAI's benchmarking results highlight GPT 5.2's strong performance on software engineering tasks, likely in response to the popularity of the Claude model among developers.
✓The guest's company, AIbox.ai, uses the Claude model extensively due to its integration with code bases and advanced system prompts.

Episode Chapters

Introduction

The hosts discuss the recent release of GPT 5.2 and the implications of its improved performance on knowledge work tasks.

Incremental Updates vs. Major Releases

The hosts debate the merits of frequent, incremental model updates versus major releases every 6-12 months.

Benchmarking GPT 5.2

The hosts analyze the benchmarking results shared by OpenAI, which highlight GPT 5.2's strong performance on software engineering tasks.

The Guest's Perspective

The guest discusses their company's use of the Claude model and its advanced integration with code bases.

AI Summary

The podcast discusses the recent release of GPT 5.2, the latest iteration of OpenAI's large language model. The host and guest explore the implications of GPT 5.2's improved performance on knowledge work tasks, as measured by a new evaluation called GPT Val. They also debate the merits of incremental model updates versus major releases, and analyze the benchmarking results shared by OpenAI, which highlight GPT 5.2's strong performance on software engineering tasks compared to other models like Claude.

Key Points

1GPT 5.2 has improved its ability to perform knowledge work tasks, scoring 70.9% on the GPT Val evaluation compared to 38-39% for previous models.
2The host prefers the cadence of incremental model updates rather than major releases every 6-12 months, as it provides more consistent improvements.
3OpenAI's benchmarking results highlight GPT 5.2's strong performance on software engineering tasks, likely in response to the popularity of the Claude model among developers.
4The guest's company, AIbox.ai, uses the Claude model extensively due to its integration with code bases and advanced system prompts.

Topics Discussed

#Large language models#Model evaluation#Knowledge work#Software engineering#Model release cycles

Frequently Asked Questions

What is "Exploring GPT 5.2: The Future of AI and Knowledge Work" about?

What topics are discussed in this episode?

This episode covers the following topics: Large language models, Model evaluation, Knowledge work, Software engineering, Model release cycles.

What is key insight #1 from this episode?

GPT 5.2 has improved its ability to perform knowledge work tasks, scoring 70.9% on the GPT Val evaluation compared to 38-39% for previous models.

What is key insight #2 from this episode?

The host prefers the cadence of incremental model updates rather than major releases every 6-12 months, as it provides more consistent improvements.

What is key insight #3 from this episode?

OpenAI's benchmarking results highlight GPT 5.2's strong performance on software engineering tasks, likely in response to the popularity of the Claude model among developers.

What is key insight #4 from this episode?

The guest's company, AIbox.ai, uses the Claude model extensively due to its integration with code bases and advanced system prompts.

Who should listen to this episode?

This episode is recommended for anyone interested in Large language models, Model evaluation, Knowledge work, and those who want to stay updated on the latest developments in AI and technology.

Episode Description

Join Conor Grennan and Jaeden as they dive into the latest release of GPT 5.2. Discover how this new model is revolutionizing knowledge work, outperforming industry professionals, and what it means for the future of AI. From personal anecdotes to industry benchmarks, this episode covers it all. Tune in to learn more about the incremental updates and their impact on technology and productivity Get the top 40+ AI Models for $20 at AI Box: ⁠⁠https://aibox.ai Conor’s AI Course: https://www.ai-mindset.ai/courses Conor’s AI Newsletter: https://www.ai-mindset.ai/ Jaeden’s AI Hustle Community: https://www.skool.com/aihustle See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Full Transcript

Jaden GPT 5.2 is out. Remember when this used to be caused for like parades, you know, when a new model would come out and this just sort of came out pretty quick, kind of low key. Very curious to see what you're seeing. I'm sort of seeing reviews a little bit all over the place, but there's one thing that I sort of want to talk about on this podcast for sure, which is this. I think it's actually an eval. So evals, when you hear these things, terms, you guys might already know this, but evals are just the ways of testing sort of like how smart the model is. It's like, oh, it's smarter than this many PhDs. And there's just a whole lot of them. And they all have kind of like acronyms and all that kind of stuff. You don't really need to know that. But there is one, Jaden, that I was sort of taken by. And even though I've only been testing it for about a day now, it's this GPT Val, like GPT Val. I don't know how you sort of say it, but I think it's the one GPT created. And it's this whole idea of like, how does this take over actual knowledge work? And this is the one that everybody seems to be freaking out about because essentially this is the one that jumped up to like 70.9% or something like that. What that means is like it tied top industry professionals 70.9% of the time in terms of its ability to do real knowledge work. And that's up from I think it's 38 or 39% from the last model. It might have been GPT-5. I don't think it was 5.1. Point being, Jane, that's an interesting evaluation to me. What can it do in terms of actual knowledge work? Not how good it is in science or math or anything like that, but like how good does it do work? That was fascinating to me. But I know you were covering this as well. I mean, Jaden, do you get excited in these moments? Like you're a total geek. You have AIbox.ai, which actually tests these models against each other for 19 bucks a month. You can go to Jaden's thing. You can test Claude versus Gemini versus Chachubiti versus Grock versus everything else. So Jaden does this all the time. So, Jaden, I'm talking a little bit too much here, but I wanted to kind of get a sense of whether you get excited about this and if you do, what? Okay, I'm not going to lie. I'm going to share my screen and do a couple things. Number one, I'll show what the responses on X have been, people's comments, because I feel like that's a really good baseline. Also, what Reddit is saying about it. Also, what my own personal experience testing it has said. But I'm going to start with an anecdote where essentially yesterday when it came out, I was in the middle of a conversation for actually, to be honest, is a very important document. I was analyzing like a legal document for a contract I need to sign. And of course, I don't use lawyers anymore. I just, I feel like I'm in the wild. This is not advice. This is not legal or financial advice. I don't go to the doctor anymore. Yeah, don't do anything I do. But it's interesting because the new model came out and at the bottom of my chat, it was like, hey, a new model came out, start a new chat to use it. And I'm like, I'm in the middle of my chat. I'm not going to switch to a brand new model to like, and have to start over from scratch on this conversation. So I finished my conversation. I'm like, I'm sure GPT 5.1 that was available, you know, 30 minutes before is just fine. And that moment for me kind of got into my head and made me realize, like, I think in the past when GPT 5 came out I would have dropped everything to try GPT 5 This is going to be this huge amazing flagship thing And well it feels like we just doing 5 5 5 These incremental updates they nice and there probably some differences but, like, I almost don't really notice anything, any major differences here. I wonder if, like, and so it gets me thinking, and I guess this is my question I'll pitch to you before I go through all of the comments over on X, but would you rather have a tech cycle where every month, essentially, You get a small incremental update to the model. Or do you like the model of like Google and Apple where it's like every six months there's this big, huge or once a year there's this big, huge tech conference where the new iPhone is unveiled. And, oh, my gosh, it's like, you know, 10 times better. But like, you know, if they'd been making tiny little iPhone incremental updates throughout the year would just be boring by the end of the year to see where it's at. What do you think is better for technology? Well, I heard a rumor that, totally unsubstantiated, that 5.2 is actually what they wanted 5.0 to be. I don't know if you saw this too. With this idea that, you know, so why did they launch 5.0 when it was, obviously, these things are all red teaming questions, right? But like, because it's not, it actually isn't so much based, I don't think, on, we got to get this model. There's a little bit of this gamesmanship. Google I.O. is about to happen. So OpenAI launches like Sora the day before, right? there's a certain amount of this. And then there was a certain amount of like, oh, this is in response, you know, 5.2 is in response to the red alert that Sam Altman was called like the red alert, like code red, sorry, code red that Sam Altman declared. But obviously it's not right. I mean, like these models are in the works. Like when you talk to people who are using these models since back in November, I mean, they're just they're just red teaming them, right? They just they release them when it's safe. I much prefer this. I get on the consumer side, like, hey, the new iPhone is coming out September 17th. So just wait or something like that. But I love that these things just come out because I bop between, uh, chat GPT, Claude and Gemini essentially. I use copilot for certain things, but for the most part, those are my sort of like, that's my team. And all of them are sort of like trained on my memory. All of them are like really specifically trained on what I do and everything like that. And I love that they just kind of keep getting smarter and smart. Like, I don't want to wait for a release because I'm not like going to make a new purchase. Like I'm already a hyper subscriber to all these things. So I actually really, really appreciate how they do it. And even if, you know, GBD 5.2, some people love it, some people are like, oh, whatever. What I do know is that it's smarter. And that's it. Like even if it's behind the scenes, it'll kind of poke its head out a little bit when it's smarter. So that to me, I love the cadence of how it's released right now, to be honest. Okay. I think for the consumer, I'm not going to lie, I think it is better to just have these small updates because basically as soon as a small update is ready, you get it. Why would you wait six months? I kind of hate that cycle. But from a marketing perspective and from like, oh, my gosh, look at the new things it can do. Like, it's just so much like if you think like the top, you know, 10 AI companies, like you could talk like 11 labs for audio and like, you know, some of the really cool, you know, flux models for image or mid journey. Let's say there's 20 of the top companies. And if every single one is having a small incremental update every single month, there is I mean, that's every three days. there like a there like a new major update and so it just like so much for people to keep uh keep a hold of So I think what going to happen is people are going to get all the benefit And I guess that the good part So you get all of the efficiency from it, but not as much of the hype, which is like every, you know, if you did every six months, like, oh my gosh, all these crazy things, you'd be really dialed in. And you'd learn all of the crazy things that it could now do. So I think a lot of people are going to be using these models. They'll be less aware of what it's capable of doing, which from an educational perspective, I mean, it's kind of tricky, But like, it's obviously way too much for anyone to keep track of everything. I mean, even us on the podcast, you know, three times a week, it's like crazy. So, yeah, I agree with you. Okay. Over on Twitter or X, OpenAI was sharing. Well, okay. I got to tell you the first funny comment. If you just search for GPT 5.2 on X, the first post that is like ranking, which is a total a troll post. It's just like GPT 2.6 or 5.2 is AGI. And then literally the screenshot is him asking how many R's are in garlic. and it says there's zero Rs in garlic. And so it's like obviously getting it wrong. I will say though, this is probably a fake screenshot because I've seen so many of these and I get engagement baited by them all the time where they're like, you know, it did like the dumbest thing in the world. But like basically these people will repost, like this post is probably reposted every time a new model has an update. They just like change the name of the model and he's like, Gemini 3 is AGI. And then it's like doing something dumb. So anyways, this is probably engagement bait is all I'm saying, which is annoying. Opening eyes evaluations. There's something interesting. They open eyes like official X account posted the they're like the GPT 5.2 thinking evals. What I think is interesting is the number one evaluation that they're putting at the top, especially when you look at like in light of like the code red or whatever the opening eye has is the SWE bench pro. That's the software engineering one. I think this is interesting. The very top benchmark that they're like showing off to everybody is 55.6% on SWE compared to GPT 5.1 was at like 50.8%. So I mean, they're almost like 5% higher. And of course, they're showing that off against Claude Opus 4.5, which is at 52%. So Claude was beating GPT 5.1. And now 5.2 kind of squeaks out a win and gets a little bit over it. and it's quite a bit ahead of Gemini 3 Pro. I'm not sure how many developers are using Gemini 3 Pro. No shade, but it feels like on the benchmarks anyways, it's quite a bit behind on that one. But, you know, the GPQA Diamond, it's crushing almost 92%. The Gemini 3 is, like, literally almost as good as this Gemini 5.2. So, like, anyways, they're all over the place. But the thing that was interesting about this to me was that the number one benchmark they were showing off in their tweet to the world was a software engineering one. And I think that just goes to show how much software developers love the Claude models and how OpenAI is trying as hard as they can to not get completely killed from that industry and removed from it. So to me, that was kind of interesting. Well, I mean, so, but Jaden, you have a software company. What do your guys use in terms of coding and stuff like that? Yeah, we use Claude. Because it's not just the model, but they have Claude code, which ties into your whole code base. and it like it is essentially helping you write the code and it automatically does a ton of stuff for you It amazing So it they actually built software that makes the model more useful They have a bunch of sneaky system prompts that basically make it so it can think through things a lot better than just if you were to go ask Chad GPT or someone else. So Claude's done, Anthropics done a lot of really good work on that. What I will say is like here when you're going 50.8% was GPT 5.1, 52% Claude, Opus 4.5. oh yeah yay quad opus is the winner then gpt 5.2 comes out and it's like 55.6 like i just feel like we're making these little three percent differences if i have my whole code base and my whole system and all my developers are trained on one tool for a three percent increase like whatever i'm gonna stick here if you can you know if you can knock it out of the park by like 10 15 20 then like okay let's let's go look at that but i think it's gonna be hard like no one wants to be behind but i think a lot of people are kind of set on what they're going to use and if there's something that's 3% better, no one really cares. No, so I think that's so well said, and I can wrap it on this. I guess the idea here is that when you see these new models coming up, I think Gemini 3 actually felt pretty powerful. I think Claude 4.5 Opus actually felt pretty powerful. And I love ChatGPT. I use ChatGPT all the time. But I think what people need to know out there, as we're sort of like coming back to our name of AI Applied, is that these, as Jaden just said, like a little kind of like bump in this, we get asked all the time, like, well, which is the best? I wouldn't think of it like that. They're all phenomenal. It's just like what works best for you. And I think the reason that coding is so huge is because it's a one-to-one thing of like people used to code this way and now they code this way. But most people aren't coders, right? Most people are doing a lot of like writing and brainstorming and all that kind of stuff. And just each one is different. People love the voice of different ones. It's just different. Everybody I know uses a different tool and absolutely swears by it. So it's cool. I mean, 5.2 came out. I think it's great. I think it's interesting. But when they talk about like, oh, it can now take this many tasks, people aren't doing it anyway, right? It's not like that's actually happening out in the real world. So it's not really that useful in a way. Do you know what I mean? Except that maybe it's a, it's sort of a more useful companion and it'll keep getting better. Nothing earth shattering here. I don't think, I think it's sort of like fun to track these models, but otherwise, like, I thought this was cool. I'm going to track it. I'm going to be testing out a lot. Like I think I'm going to, I'm sorry, I'm feeling that it's a little bit better, but maybe that's just the few things I've done today. So I don't know, nothing earth shattering here, but it's, I love that they just like dropped it right before Christmas and Sam Altman has promised more stuff. That's, uh, I am excited for the Christmas presents that will be coming in from Sam Altman, allegedly. All right. Thanks everyone so much for tuning into the show today. If you want to make sure that your organization is up to date with the latest in AI and is looking at the right frameworks on how to approach AI. So you're actually getting the most out of it, make sure to go check out Connor's AI mindset course. That's linked in the description. I have seen companies completely change the way they work and the productivity has gone through the roof. After taking this course, they will buy thousands of seats for everyone in the organization once they pilot it with a couple of people. So that's always exciting to see. Anyways, check it out. There's a link in the description. It will absolutely revolutionize. And I know that's like a buzzword, but it actually will. How you're using AI. Thanks so much for tuning in. We'll catch you in the next show.

Share on X Share on LinkedIn

Related Episodes

Disney's Billion-Dollar Bet on OpenAI

AI Applied

12m

#228 - GPT 5.2, Scaling Agents, Weird Generalization

Last Week in AI

1h 26m

AI Showdown: OpenAI vs. Google Gemini

AI Applied

14m

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner

Latent Space

Unlocking the Power of Google AI: Gemini & Workspace Studio

AI Applied

12m

Exploring GPT 5.2: The Future of AI and Knowledge Work

What You'll Learn

Episode Chapters

Introduction

Incremental Updates vs. Major Releases

Benchmarking GPT 5.2

The Guest's Perspective

AI Summary

Key Points

Topics Discussed

Frequently Asked Questions

Episode Description

Related Episodes

Disney's Billion-Dollar Bet on OpenAI

#228 - GPT 5.2, Scaling Agents, Weird Generalization

AI Showdown: OpenAI vs. Google Gemini

AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner

Unlocking the Power of Google AI: Gemini & Workspace Studio

Navigating the AI Legal Maze: Perplexity's Predicament

AI Curator

Ask me anything about AI