Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

227 snips

Apr 2, 2026

Guest

Fan-Yun Sun

Guest

Chris Manning

Fan-yun Sun, Moonlake AI co-founder focused on interactive world models, joins Chris Manning, Stanford NLP pioneer, for a lively tour of AI worlds that are multimodal, interactive, and efficient. They dig into action-conditioned environments, symbolic structure over pure pixels, game engines as reasoning tools, persistent simulated worlds, programmable visuals, spatial audio, and why benchmarking should focus on real utility.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

Reasoning Traces Make Worlds Trainable

Fan-Yun Sun shows Moonlake's reasoning traces model gameplay state, not just visuals, across geometry, physics, affordances, audio, and scoring.
In the bowling example, the system tracks ball pickup, pin collisions, resets, and score updates so users can actually practice the game.

Code Engines Serve As Cognitive Tools

Moonlake treats physics engines, code, and other simulators as cognitive tools a model can call instead of end-to-end learned priors.
Fan-Yun Sun says this makes multiplayer and different task-specific simulators possible, rather than forcing one fixed visual pipeline for every world.

Reverie Separates Gameplay State From Visual Style

Moonlake splits world modeling into a causal reasoning model for persistent state and Reverie for photorealistic restyling.
Fan-Yun Sun says Reverie preserves interactivity while changing pixel distribution, making rendering programmable and able to react to game state.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

We’ve been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay, to exploring Marble with World Labs’ Fei-Fei Li and Justin Johnson, to previewing World Models learned from massive gaming datasets with General Intuition’s Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models. Meanwhile Nvidia, Waymo and Tesla have published their own approaches, Google has released Genie 3, and Yann LeCun has raised $1B for AMI and published LeWorldModel.

Today’s guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion.

Moonlake AI (inspired by the Dreamworks logo) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom agents:

In Towards Efficient World Models, Chris Manning and Ian Goodfellow join Fan-Yun in explaining why their approach to efficiency with structure and casuality instead of just blind scaling is sorely needed:

SOTA models still show physical or spatial understanding glitches, such as solid objects floating in mid-air or moving “inside” other solid objects.

If the goal is to plan for the next action, how often is a high-resolution pixel view necessary for modeling the world? Our bet is that there is a disproportionately large share of economically valuable tasks where such detail is not required. After all, humans with a wide variety of sensory limitations have little difficulty doing almost everything in the world. Furthermore, for a large number of purposes, describing a scene or a situation in a few words of language (“the car’s tires squealed as it cornered sharply”) is sufficient for understanding and planning.

Experiments also show that humans only partially process visual input in a top-down, task-directed way, often making use of abstracted object-level modeling. In almost all cases, partial representations combined with semantic understanding are sufficient.

…If the goal is to facilitate the understanding of causality in multimodal environments, then the world model—whether it is used in the virtual world or the physical world—must prioritize properties such as spatial and physical state consistency maintained over long time periods, and an ability to evolve the world that accurately reflects the consequences of actions. That’s what Moonlake is building.

Game engines are the right starting point abstraction to efficiently extract causal relationships, and building the interfaces and community (including their new $30,000 Creator Cup) to kickstart the flywheel of actions-to-observations.

We were fortunate enough to attend their sessions at GDC 2026 (the Mecca of Game Devs), and were impressed by the huge variety and flexibility of the worlds people were building with Moonlake’s tools already! Live videos on the pod.

Full Video Pod on YouTube!

Timestamps

00:00 Benchmarking Gets Hard00:47 Meet Moonlake Founders01:26 Why Build World Models03:12 Structure Not Just Scale05:37 Defining Action Conditioned Worlds07:32 Abstraction Versus Bitter Lesson14:39 Language Versus JEPA Debate20:27 Reasoning Traces And Rendering Layer37:00 Gameplay Over Graphics38:02 Fiction Rules And World Tweaks39:15 Code Engines Beat Learned Priors41:10 Diffusion Scaling Limits43:23 Symbolic Versus Diffusion Boundary46:14 Platform Vision Beyond Games50:24 Spatial Audio And Multimodal Latents54:23 NLP Roots Hiring And Moon Lake Name

Transcript

[00:00:00] Cold Open

[00:00:00] Chris Manning: Think this whole space is extremely difficult as things are emerging now. And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks.

[00:00:20] But these days so much of what people are wanting to do is nothing like that, right? You’re wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It’s not so easy to come up with a benchmark, and it’s the same problem with these world models.

[00:00:41] Meet the Founders

[00:00:41] swyx: Okay. We’re back in the studio with Moon Lake’s, two leads. I, I guess there’s other founders as well, but, sun and Chris Manning. Welcome to the studio.

[00:00:54] Fan-yun Sun: Thanks. Thanks, Chris. Thanks for having us.

[00:00:56] swyx: You’ve got, you guys have, come burst onto the scene with a really refreshing [00:01:00] new take of mold models.

[00:01:01] I would just want to, I guess ask how you, the two of you came together. Chris, you’re a legend in NLP and just AI in, in, in general. You’re, you’re his grad student, I guess

[00:01:10] Fan-yun Sun: Actually my co-founder.

[00:01:11] swyx: Oh, yeah.

[00:01:12] Fan-yun Sun: I should give a lot of credit to my co-founder, Sharon. Yeah. She was, she was actually working with Professor Fe Androgyn and then she ended up working with, Ron and Chris Manning here.

[00:01:22] And then, so I got connected through to Chris initially, actually through my co-founder,

[00:01:26] What is Moon Lake?

[00:01:26] swyx: what is Moon Lake? What, what is, actually, I’m also very curious about the name, but like why going into world models?

[00:01:33] Fan-yun Sun: So I was working a lot. With actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embody EA agents.

[00:01:44] And then there’s two observations. One in academia and one in industry. An industry like folks at Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it’s for the sake of evaluation or training the robots, or policies or models. And [00:02:00] then, in academia, same thing is happening.

[00:02:02] And more specifically, when I was actually working with Nvidia on the synthetic data foundation model training project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real world data when it comes to multimodal pre-training.

[00:02:16] But then, like I said, there’s a lot of dollars being paid out to like external vendors or, or like. Other folks to manually curate these types of data. It was very clear to us that, okay, on our way to, let’s call it embody general intelligence models need to learn the consequences behind their actions, which means that they need interactive data and the demand for those types of data are growing exponentially.

[00:02:38] But everybody’s sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that’s a little bit on the genesis of Moon Lake, and I think the reason I got into world models was partly.

[00:02:59] A philosophical [00:03:00] take of the on the world where I like, believe the simulation theory and stuff like that. But on the other, on the other hand, it’s really just like, oh, like there’s an opportunity there that I feel like nobody’s doing it the way I think should be done.

[00:03:10] Structure, Not Scale: The Vision

[00:03:10] Chris Manning: I can say a little bit about that.

[00:03:12] Yeah. So of the overall goal is the pursuit of artificial intelligence and most of my career has been doing that in the language space and that’s been just extremely productive. As we all know, the story of the last few years, I don’t have to tell about how much we’ve achieved with large language models, but, uh.

[00:03:31] Although they have been extremely effective for ramping language and general intelligence, it’s clearly not the whole world. There’s this multimodal world of vision, sound, taste that you’d like to be dealing with more than just, language. And then the question is how to do it. And despite, a huge investment in the computer vision space, right, as the research field computer [00:04:00] vision has been for decades, far, far larger than the language space, actually.

[00:04:05] I think it’s fair. Say that, vision, understanding sort of stalled out, right? You got to object recognition and then progress just wasn’t being made right? If you look at any of these, vision language models, it’s the language that’s doing 90% of the work and the vision barely works. And so there’s really an interesting research question as to why that is and at heart, the ideas behind Moon Lake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren’t in the mainstream vision models, which are still trying to operate on the surface level of pixels.

[00:04:50] swyx: I think one of your blog posts, you put it as structure, not scale. Is that, a general thesis?

[00:04:57] Chris Manning: Yeah. Well, scale is good too.

[00:04:58] swyx: Yeah. Scale is good. Too

[00:04:59] lot,

[00:04:59] Chris Manning: [00:05:00] lots of data is good as well and scale, but nevertheless, you want the structure Yeah. To be able to much more efficiently learn.

[00:05:07] swyx: Yeah. The other thing I really liked also is you put out an example of what your kind of reasoning traces look like.

[00:05:12] Right. Which you would distill is the word that comes to mind. I don’t even think that’s a good, good description, but it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, and what, what have you. But like that, that is the kind of example that involves, let’s call it spatial reasoning, role model reasoning as as compared to normal LM reasoning.

[00:05:35] Yeah.

[00:05:36] Defining World Models vs Video Generation

[00:05:36] Vibhu: But also like taking it a step back. So how do you guys define world models? A lot of people see okay, you can do diffusion, you can do video generation. But, you guys put out quite a few blog posts. You put out a essay recently, we can even pull it up about efficient world models. You have a pretty like structural definition here, but for the general audience that don’t super follow the space, right.

[00:05:55] What’s, what’s the difference in what we see from like a video generation model to [00:06:00] a world gen A simulator? How do you kind of paint that last

[00:06:02] Chris Manning: year? Yeah, so I think this is actually a little bit subtle because, people look at these amazing generative AI video models, SAWA VO three, one of these things, and they think Genie, they think, oh, this is amazing.

[00:06:17] This is we’ve solved understanding the world because you can produce these generative AI videos, but. The reality is that although the visuals do look fantastic, those visuals actually are accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that’s what’s really needed for spatial intelligence.

[00:06:49] So I mean, a term we sometimes use is that you need action condition, world models. That you only actually have a world model if you can predict, [00:07:00] given some action is taken, what is going to change in the world because of it. And in particular, that becomes hard over longer time scales. So if you’re simply, trying to.

[00:07:12] Predict the next video frame. That’s not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes into the future. And to do that, you actually much more of an abstracted semantic model of the world.

[00:07:32] The Bitter Lesson & Data Abstraction

[00:07:32] swyx: Yeah, the question comes where you want to have more structure than is available in just predicting the next token.

[00:07:41] And typically, well, let’s, let’s call it the experience of the last five years has been that is just washed away by scale, right? So what is the right middle ground here that, you don’t ignore the bitter lesson, but also you. Can be more efficient than what we’re doing today.

[00:07:57] Chris Manning: One possibility [00:08:00] is, look, if we just collect masses and masses and masses and masses of video data, this problem will be solved.

[00:08:11] Under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true. The first is what’s really essential is understanding the, the consequences of actions producing an action conditioned world model. And if you are simply, collecting observational video data, which is the easy stuff to collect, when you’re sort of mining online videos, you don’t actually.

[00:08:41] Know the actions that are being taken to see how the video is changing. And so if you are never collecting directly actions and you are having to try and infer them from what happened in the observed video, that’s not impossible. But it’s very [00:09:00] hard and it’s not really established that you can get that to work at any scale yet.

[00:09:05] And so there’s a lot of premium on collecting action condition video data, which is part of why there’s been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which isn’t quite limited supply, but there’s also in the limit of as much data as you could possibly have.

[00:09:28] Maybe the problem is eventually solvable, but. Even though we collect huge amounts of text data is always at a great level of abstraction, right? Language is a human designed, abstracted representation where there’s meaning in each token and it’s representing and abstraction of the world, right?

[00:09:51] As soon as you are describing someone as a professor, and as soon as you are saying that they’re condescending, right? These are very [00:10:00] abstracted descriptions of the world. It’s not at what you’re observing as pixel level, and to get to that kind of degree of abstraction, starting from pixels is orders and magnitude of extra data and processing.

[00:10:14] And so, although, we absolutely want to exploit, get as much data as possible, use the bitter lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude less data than people working purely from pixels, you’re gonna be able to make a lot more progress, a lot more quickly.

[00:10:34] And that’s the bet here. And so you could just say that’s only wanting to be able to, do it more efficiently, do it more quickly, do it more cheaply. But I think it’s actually more than that, I think. One should be making the analogy to how human beings work at one level. You know? Yes, we have these high [00:11:00] resolution eyes and we can look and see a scene like a video, but all of the evidence from neuroscience and psychology is that most of what comes into people’s eyes is never processed.

[00:11:13] Right. That you are doing fairly fine ated processing of exactly what you’re focusing on. But as soon as it’s away from that of yeah, there’s another guy over there that you’ve sort of only processing top down this very abstracted semantic description of the world around you. And so, that’s what human beings are doing.

[00:11:33] They’re working with semantic abstractions and so. I think it is just the right representation. ‘cause we also have other goals we want to be able to do, real time worlds. So that means there’s a limit to how much processing you can do and we want to do long-term planning and consistency. And again, that favors abstraction.

[00:11:55] I mean, I guess there was actually a recent. Blog posts that [00:12:00] came out from our Friends of physical intelligence and, they were sort of heading in the same direction they were saying Oh, to the pay

[00:12:06] swyx: pay model.

[00:12:07] Chris Manning: Yeah. Yeah. To maintain a long term memory of what’s happening in the world. So we can, do longer term we actually storing text of what is, been happening in the world.

[00:12:19] Right. It is not such a successful strategy of trying to keep it all at a pixel level.

[00:12:24] Vibhu: And yeah, I mean, you can see it in video models like that Temporal consistency. We’re at a scale of train on, all the video data we have. We have it for maybe 30 seconds, a few minutes. That’s not the same as a game state played for half an hour.

[00:12:37] Right. I thought you guys break it down pretty well. You have a, you have a blog post about. Building multimodal worlds with an agent. I dunno if you guys wanna talk about this. This is one of the things I read, I

[00:12:48] swyx: thought, yeah, it’s the thing I talked about with the reasoning chain. Yeah.

[00:12:51] Vibhu: So there’s like different phases to this.

[00:12:53] It seems like it’s more of an agent, a scaffold, very different approach than just, type in a prompt and you, you don’t have the same consistency. [00:13:00] It also, like, for people that are listening, I, I would highly recommend reading it. It breaks down the problem in a different light, right?

[00:13:06] So like, what do you need to consider when you’re talking about video, like world game models, right? How would, what do you need to consider? What are the factors? What are the elements? What’s the state? So I don’t know if you guys have stuff to talk about for this one.

[00:13:19] Fan-yun Sun: Yeah. Actually, I wanted to add on a little bit Yeah.

[00:13:22] On our previous point, which is just like, change topics so quickly. I, I do feel like sometimes people confuse like, oh, like we’re taking an an, an method with abstraction. That means they don’t believe in bitter lesson. Like that’s just false, right? Like we are believed is a bitter lesson. But then I feel like the question that we always discuss is like, what is the right abstraction level today?

[00:13:42] The analogy I like to make is like, let’s just say we can encode and decode. Represent all of images, videos, audio and bytes. Then the most bitter lesson approached is to train a next byte prediction model as opposed to the next token prediction model where it’s just like, okay, it’s natively multimodal, can just, but it’s like, yeah, like [00:14:00] to, to Chris’s point, it’s like the scale and computing you need to achieve that.

[00:14:03] So that’s why we always come back to like, okay, what is the most efficient way to do it? And reasoning models to the point of this blog post is a showcase of like, Hey, we’re actually just like reasoning about the world and reasoning about. The aspects of the world that CAGR that matter for me to learn what I want to learn from this role model.

[00:14:21] swyx: Yeah, it’s like you’re improving the en encoder of whatever you’re, trying to model. And like a better representation would just represent the important things in less space. Yeah. Which would just be more efficient.

[00:14:33] Fan-yun Sun: Yeah.

[00:14:34] swyx: So yeah, I, I, I fully agree that it is not, antagonistic to, bitter lesson.

[00:14:38] I do wanna wanna mention one more thing. Is there any philosophical differences with the JPA stuff that, Yun is working on? I gotta go there. You, you, you, you’re, you’re imagining like some latent abstraction. I’m like, okay, fine. Let’s, let’s talk about it, right? Like it’s an elephant in the room.

[00:14:52] Chris Manning: Yeah.

[00:14:53] JEPA & Philosophical Differences with LeCun

[00:14:53] Chris Manning: There are philosophical differences. Jan Lacoon is a dear friend of mine, but. [00:15:00] He has never appreciated the power of language in particular, or symbolic representations in general. Yarn is a very visual thinker. He always wants to claim that he thinks visually and there are no words, symbols, or math in his head.

[00:15:21] Maybe that’s true of yarn. It’s certainly not the way I think. Um. But at any rate, the world according to yarn is the basic stuff of the, the world and of intelligence is visual and language is just. This low bit rate communication mechanism between humans and it doesn’t have much other utility and it’s far inferior to the high bit rate video, that comes into your eyes.

[00:15:53] And I think he’s fundamentally missing a number of important things [00:16:00] there. Think of this evolutionary argument looking at animals, right? That the closest analogies, the things with chimps, right? So chimpanzees, have fairly similar brains to human beings. They have great vision systems, they have great memory systems.

[00:16:18] They’ve got, better memory than we do of short term memories. They can plan, they can build primitive tools that, humans. Massively ahead in what we understand about the world, what we can plan, what we can build. And essentially what took off for us was that humans managed to develop language and that gave a symbolic knowledge, representation, and reasoning level, which just, okay if this sort of vaulting of what could be done with the intelligence in brains.

[00:16:59] So the [00:17:00] philosopher Dan de refers to language as a cognitive tool and argues that, humans unique among the creatures in the world have managed to build their own cognitive tools and language is the famous first example. But other things like, mathematics and programming languages are also cognitive tools.

[00:17:21] They give you an ability to. Think in abstractions, in extended causal reasoning chains. And that allows you to do much more. And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is, underlying the specific technologies that Moon Lake is making, that symbolic representations are powerful.

[00:17:50] And you want to use that in your understanding of the visual world when you want a causal understanding, when you want to maintain long-term [00:18:00] consistency and prediction. And as I understand it, that’s just not in ya Koon’s worldview. So I think that’s the fundamental philosophical difference. Then there’s the specific model.

[00:18:11] He’s been advancing jpa, that’s a reasonable. Research bed is a direction as to, to head for building out a model of the visual world. To my mind, it’s sort of one reasonable research bed. It’s not really established. It’s the best one that everyone should be following,

[00:18:32] swyx: at least developed at scale, at Meta.

[00:18:34] But it’s not just vision, right? Like, I mean, JPA is a, just joint admitting prediction can be applied to anything really. And people have done it. The argument is that there is a latent representation or that is probably more. Suited to the task, then why not let machines do it for us instead of predefining it at all?

[00:18:50] And isn’t something like a JPA shaped thing the right answer? And if not, why not?

[00:18:55] Chris Manning: So I think there’s a part of jpa that’s right, which is [00:19:00] you do want to have a joint. Embedding that gives you a consistent model of the world. And Jan’s argument is you can never get that from auto aggressive language models ‘cause they’re sort of left to right churning out one token at a time.

[00:19:22] I guess this is where we’re the research arguments of the field, I’m not actually convinced that’s right. ‘cause although the token production is this auto aggressive, process that’s heading, left to right, I guess don’t have to be left to right. But anyway, in sequence of tokens we could have right to left Arabic.

[00:19:40] But although that’s true, all of the weights of the model that are internal to the transformer, they are a joint model of the model’s understanding of the world. And so I think you can think of the weights of the model as a form of. Joint representation, [00:20:00] and therefore it is plausible to think that could be the basis of a world model, which avoids, ya’s objections.

[00:20:10] swyx: I think I follow, and obviously that would touch on what Moon Lake eventually ends up doing as well. Right. Like, which it’s hard to tell because you put out the end results, but we don’t know the inputs that go into it. So it’s, it’s, that’s something that we have to figure out over time.

[00:20:25] Vibhu: Yeah. I mean, I guess this kind of breaks down some of the outputs. Do you wanna walk us through it?

[00:20:31] Reasoning Traces & Interactive Worlds

[00:20:31] Fan-yun Sun: Yeah. So this, this really just walks us through the reasoning traces of like, okay. So that just say, if we wanna build a world in this context, it’s really just a game demo that, that shows the, the variety of interactions that this world model can build.

[00:20:45] And yeah, it’s really just a reasoning traces of like, okay it prompted to create a bowling game. Like how did it achieve what you saw? That level of causality, interaction and consistency, right? So yeah, this is almost just like a, an example of [00:21:00] like a reasoning traces. Very

[00:21:01] swyx: detailed.

[00:21:01] Fan-yun Sun: Yeah.

[00:21:01] Vibhu: Very, very detailed.

[00:21:02] You gotta you don’t even realize it, right? Like when a video is generated, what happens when a ball strikes a pin, right? So first, like you, there’s audio in that, like audio triggers happens, score increments, the world changes. Like pins have to start dropping. There’s a timer that goes on. It’s just like very similar to how now we’re used to reasoning for language models.

[00:21:20] There’s a whole state of what happens. So geometry, physics, all this stuff. And then yeah, there’s kind of that single prompt. So asset, ation all this stuff. It’s like a, it’s a nice view to see what’s going on.

[00:21:32] swyx: I think Sun is also too polite to point out that, both like Google’s genie, demos as well as world Labs is marble, do not have interactive worlds.

[00:21:41] Fan-yun Sun: That’s the benefit of having a reasoning model, right? Like, because you can, you can say, oh, like maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe it’s like I need to understand the, the basic of like, physics and I want to throw it over [00:22:00] them.

[00:22:00] I wanna know that when I, when it resets it’s a new game. So I know that yeah, basically, you know to pick up the ball, you know that ball’s gonna cause the pins to fall down. You know that what’s important to this particular bowling game is to score and you know that the score corresponds to the number of pins that fell down.

[00:22:19] So it’s just like, if it’s a model that sort of knows what it. Looks like, knows what a bowling game looks like, but doesn’t actually allows you to practice over and over again and to understand that, oh, like what it takes to actually get a high score. Then it sort of doesn’t actually allow you to learn what you set out to learn within the world model.

[00:22:38] And I think this is really just one example of showing like the advantages of the approach that we’re taking over most the, let’s call it the zeitgeist, is today, when people talk about clinical role models,

[00:22:51] Chris Manning: right? So it sort of seems like the question to ask when there’s a world model is.

[00:22:58] Can I not [00:23:00] only just wander around the world and look at the beautiful graphics, can I interact with the objects in the world and see the right consequences of actions?

[00:23:11] Vibhu: And you also understand what the consequences would be if you do something right. So it’s not just like, okay, there’s one thing if I pick it up, something will happen.

[00:23:19] But, there’s 50 options and I know I can expect, I can infer what would happen if I do any of them. Right. So very different when you can actually see it play around with it.

[00:23:28] swyx: There,

[00:23:28] Beyond Unity: Cognitive Tools for World Building

[00:23:31] swyx: there’s two cheeky elements of that. I mean, the, the, the I guess, less ambitious one is, let’s really establish for listeners, why is this fundamentally different than writing Unity code, right?

[00:23:40] Like just creating a model to translate a prompt into Unity code

[00:23:44] Fan-yun Sun: so there is an underlying physics engine. Yeah. In that sense, there’s some overlapping things to Unity, but the way we think about it is like physics engine. Tools or code are cognitive tools like borrowing Chris’s term, right? Like tools [00:24:00] that the model can employ as means to an end.

[00:24:04] So today maybe you say, okay, in this particular context we care about physics, we care about the long-term causality consequences. Then yes, we deploy it, employ physics engine, and then maybe tomorrow we say, okay, we’re we’re training that. Just say drones where we only care about really fluid dynamics and the visual aspect of the world.

[00:24:25] Then, then yeah, maybe we don’t actually, the model actually doesn’t have to use a physics engine. Or maybe it employs other types of representation or physics engine to achieve the task. So yes, writing code for Unity is sort of similar to a tool that our A model can employ, but our goal is for a model to take a representation conditioned reasoning.

[00:24:46] Approach or process.

[00:24:47] swyx: Yeah,

[00:24:47] Fan-yun Sun: internally.

[00:24:48] swyx: Yeah. Using these things as just like general two calls. Right. Which I think is very interesting. The other more ambitious one is, some kind of recursive element where it becomes multiplayer, right? Like here, there’s a single player element, you’re not [00:25:00] modeling any other people involved.

[00:25:01] And that is a whole other thing.

[00:25:04] Fan-yun Sun: But in fact, we can really do multiplayers. Oh yeah, okay. I haven’t seen any double situations. So just actually just like prompt our, our model to say, Hey, like configure to multiplayer. Then it’ll do like this. You’ll be able to configure multiplayer

[00:25:16] swyx: great

[00:25:17] Fan-yun Sun: persistency database for you.

[00:25:18] Easy. Yeah.

[00:25:19] Vibhu: So what, what are like some of the current limitations in where we’re at? So there’s one approach of like, okay, scale up video predictors. Obviously there’s data issues. With approaches like this, is it data constraints? What are like the next steps? Is it real time? Like, so there’s one side of, write an agent to write Unity code, but okay, I want to be streaming a game real time.

[00:25:38] I want to have characters being also like agent, but where, where do we kinda see this scaling up? Right?

[00:25:44] Fan-yun Sun: Yeah, there’s definitely a data constraint. Like the more data, the, the better. This reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever’s necessary.

[00:25:57] And then there’s a sort [00:26:00] of fidelity constraint, which we’re actually solving with another model, which we can talk about later. But it’s like, it’s not as easy to get to photorealism with the approach that we’re taking. But we think there are better solutions to that, which is we can dive into later.

[00:26:14] Later.

[00:26:15] Vibhu: The one one thing you note here is it’s a diffusion model, right? So there’s, there’s a few approaches, diffusion caution, splatting, yeah, so Ry diffusion model, you guys wanna

[00:26:25] Fan-yun Sun: Yeah.

[00:26:25] Vibhu: Introduce,

[00:26:26] Fan-yun Sun: yeah, totally.

[00:26:26] Rie: Neural Rendering & Skins for Worlds

[00:26:26] Fan-yun Sun: So within our world modeling framework, we think there are two models that we train, right?

[00:26:31] Like, there’s the multimodal reasoning model that we just talked about that essentially handles. Mainly the, the causality, the persistency and logic determinism of the world. And then RY is our bet on saying, okay, like while all those model, can take care of all these things that we just talked about, it’s limitations compared to existing, say, video models, is that it doesn’t have as high of a pixel [00:27:00] ality right off the gate, right?

[00:27:02] And EE is to say, Hey, we can actually take whatever persistent representation that we generate with our multimodal reasoning model and learn to restyle it into photo photorealistic styles or arbitrary styles you want. So this model is almost to say, Hey, I’m going to respect the persistency and interactivity of the world that you created, but my only job is to make sure that its pixel distribution is close to what we want.

[00:27:29] Vibhu: Yeah.

[00:27:30] swyx: Great example right there. You kept the KL divergence.

[00:27:33] Fan-yun Sun: Oh. Where,

[00:27:34] swyx: no, no. I mean this, this is a, a classic like, how you don’t stray too far from the source material as you, you kept the kl, which is Oh yeah. Kind of cool. Yeah.

[00:27:43] Fan-yun Sun: Yeah.

[00:27:44] swyx: I mean, and the

[00:27:44] Chris Manning: difference is, and I mean sun was pointing at this, where sort of saying it’s in one way a more difficult path, but a better path that, typically the diffusion models are producing the whole scene and it looks lovely, [00:28:00] but there isn’t spatial understanding behind it, which is allowing for the real time graphics gameplay, the spatial intelligence, understanding the consequences of worlds where this is, taking a path where it is assuming an abstracted semantic model of the world’s state.

[00:28:20] And then the diffusion model is then being used on top of that to produce the high quality graphics.

[00:28:27] swyx: Is there an intended practical, or business use for this, or is it like a, like a demonstration of capabilities?

[00:28:34] Fan-yun Sun: We actually believe that this is gonna be the next paradigm of rendering. So it’s gonna replace how ra raizer, it’s gonna replace DLSS today because it not only has these pixel prior that’s learned from the world such that you can literally play any game in photo realistic styles, which is a lot of people’s desire when they do GTA, right?

[00:28:51] Like,

[00:28:51] Vibhu: all the mods, all the people adding perfect lighting and all this.

[00:28:54] swyx: So

[00:28:54] Fan-yun Sun: skins

[00:28:55] swyx: for worlds, let’s call it

[00:28:56] Fan-yun Sun: skins, let’s call it skin for worlds. I,

[00:28:58] Vibhu: it’s also like, you can call it skin, you can call it [00:29:00] customization. You can play it how you want, right?

[00:29:01] Fan-yun Sun: Yeah, exactly. And I think another thing that we really pointed out specific specifically in this blog is the programmability of it, right?

[00:29:09] So what this means is that this render historically render is always a derivative of the game state, right? You’re saying, oh, here’s the game state, I’m rendering out a frame. But here I’m saying actually this render can be part of the gameplay loop. I can say something along the lines of, if upon getting 10.

[00:29:26] Apples, I’m gonna, my weapon of choice, my bullet’s gonna turn into apples. And that’s, that’s possible because we can say, we can basically dynamically have certain game state trigger the, the preconditions to the render such that the rendering is now part of the game loop too. One thing is to just say, okay, it’s, it’s, it’s the appearance.

[00:29:47] But the second thing is also to say there’s these novel interactions that are possible because this render now has actually priors of the world.

[00:29:57] swyx: It is up to the artist to figure out what to do with it.

[00:29:59] Fan-yun Sun: It [00:30:00] is up to the creators. Yes.

[00:30:01] swyx: Yeah.

[00:30:01] Fan-yun Sun: And I also think that’s actually another big argument that we’re making and the reason that we’re picking, taking the bet we’re baking is that a lot of the times, whether it’s for embody AI gaming, like you want a layer where human can inject their intentions.

[00:30:15] So, for example, let’s just say in the context of gaming, it’s obviously like my creative intent, but maybe in the context of embodied ai, it’s like, oh, like I take this foundational policy and I want to actually fine tune it to deploy in my house. So you want to almost say, inject, have a layer where human can say, oh, here’s the distribution of things I want to create to achieve my goal.

[00:30:35] And I think 3D graphics as it as it is today, is basic, the layer for people to say, Hey, what do I care about in this world? And it allows, basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying, Hey, I’m gonna generate like, arbitrary.

[00:30:54] And it’s like just prompts,

[00:30:55] swyx: it’s one of those things where like, I think you, you’re going to build up a series of models, right? [00:31:00] This is just one of, this is probably like the highest utility or heaviest, frequency one, I don’t dunno what to call this. Where like you Yeah. You can immediately drop this in on any game and you don’t need anything else that.

[00:31:10] That you guys do. But, I, I could see, I could see that I think the, the human intent is something that people are not even used to because we’re so used to static worlds or, worlds that just don’t react, or, I don’t know. It’s, it, you’re kind of blowing my mind right now with like, I’m, I wonder if you’ve talked to people at GDC Hmm.

[00:31:27] And what are they gonna do with it?

[00:31:30] Fan-yun Sun: Yeah. Now the stance that we take on this front is like, we’re not gonna be more creative than our users to ship

[00:31:35] swyx: it out.

[00:31:35] Fan-yun Sun: Yeah. But we wanna make sure that we’re building things in a way that really allows them to express their intent.

[00:31:41] swyx: The thing that you said about, here’s the distribution that I want.

[00:31:45] I think text may be too low of a bandwidth to. To really demonstrate, because I, I, there, I’m, I’m probably just gonna want to drop in a bunch of, reference assets and then you can figure it out from

[00:31:58] Vibhu: there. But you probably wanna do a, a mixture of [00:32:00] both, right? Like you throw in a few images. I wanted this style.

[00:32:02] Yeah. I want it to look like this. So it, it’s, it’s a mixture, right?

[00:32:05] Chris Manning: I, I think it’s a mixture. I mean, yeah, I mean there’s clearly a visual component of this, and it’s not that, everything can be text. ‘cause of course you want to give a visual look, but there’s also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text.

[00:32:32] And it be very time consuming and difficult to do via visual means. So I think, yeah, you want a combination of both.

[00:32:40] Evaluating World Models

[00:32:40] Vibhu: So one question I kind of have is, how do we go about evaluating world models? So like, there’s many axes, right? One is like, okay. I have preferences. How well do we adhere to prompts? One is the simulation.

[00:32:50] One is like do things, is there core logic that’s broken? So coming from we know how to evaluate diffusion, there’s fidelity, there’s [00:33:00] stuff like that. But what are some of the challenges that most people probably aren’t thinking about?

[00:33:04] Fan-yun Sun: Yeah, I think this is like a great question and probably one of the hardest questions in role models because like, I think it always comes back to what are you building this role model for?

[00:33:13] And depending on your end goal and purpose, the evaluation should defer. So in the context of games, then the most direct way of measuring is how much behind are people actually spending in this world that you create? And if your goal is to say, for example, in the context that we just talked about, like, hey, deploying, deploying action in body, a agent, then your, your end.

[00:33:33] Metric is then, okay, after training in these worlds that you generate how robust it is to when you actually deploy to the target environment. But then, it’s, it’s hard to measure these end metrics. So today people have like these proxy metrics that I call that basically try to measure what we really care about, which is the end metrics, but then frankly it’s different for every use case.

[00:33:57] Yeah,

[00:33:57] Vibhu: which seems like quite a challenge, right? Like in [00:34:00] in language models or video models. Image models, your benchmarks are proxies, right? People aren’t actually asking instruction, following tool use questions. They’re proxies of how well it will do downstream. But for this, so like, should teams, should companies have their own individual benchmarks outside of games?

[00:34:16] If you think of stuff like, okay, video production, movies, stuff like that, that also want to use world models. Should, should they sort of internalize like. Their own proxy. Is this something you guys do? Where, where does that connect

[00:34:28] Chris Manning: go? Yeah, I think this whole space is extremely difficult as things are emerging now.

[00:34:35] And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks and could you answer the question based on these documents and the various other kinds of, do pieces of logical reasoning or math.

[00:34:58] But again, these are sort of. [00:35:00] And there were sort of visual equivalents of things like object recognition, right? For these small component tasks. These days so much of what people are wanting to do also with language models is nothing like that, right? You’re wanting to, have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month.

[00:35:25] And it’s not the same kind of thing, right? And it’s not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping, right? So, and it’s the same problem with these world models. So if we take the game design case, well success is that a game designer can.

[00:35:57] Produce what they are [00:36:00] imagining in a reasonable amount of time. And that’s really the kind of macro task. That’s a very hard thing to turn into a benchmark and I think a lot of this is actually going to turn into people walking, walking with their feet. Right? I mean, I guess that’s what’s happening, at the large language model level, right?

[00:36:23] When people are choosing to use, GPT five or Gemini or clawed, individuals are trying out these different models and deciding, oh, I like the kind of answers that GT five gives me, or no, I feel like I get more accurate detail from Claude, right?

[00:36:43] Vibhu: It’s a lot of

[00:36:43] Chris Manning: vitech, a lot of people just using it.

[00:36:45] It’s vibe checking. I realize that, but it’s actually whether. People feel it’s giving them utility in what they want. Right.

[00:36:52] Vibhu: And the the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not the objective of what this is [00:37:00] for, right? It’s if a, if a game designer is working on something, they care about the game engine, right?

[00:37:04] The state, it’s, it can look whatever. You can fix that up later. Or you can have a really good game state and you can quickly edit it to 20. 20 different versions, like Keep State,

[00:37:14] Chris Manning: right?

[00:37:14] Vibhu: So

[00:37:14] Chris Manning: that’s a really important distinction, for and for speaking to Moon Lake strength, right? So, yeah, great visuals are lovely to look at for a few seconds, but gains are really all about the concept, the game play.

[00:37:33] And a lot of the time that doesn’t actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals, and there are other games where people have spent millions producing photo realistic, visuals, and the game sucks, right? So, keeping those two axes apart is really important in thinking about what’s important in a [00:38:00] world model for different uses.

[00:38:02] swyx: This conversation is reminding me of some game review and fiction discussions I’ve, had in my sort of non-AI related life. Some, for some people might know Brandon Sanderson, who’s a very famous, fiction author, had, is is a big game reviewer. And he, he’s a big fan of video games where you change one thing about a normal what you might assume about, about the world.

[00:38:22] For example, Baba is you, I don’t know if you might have come across that, where like the rules change as you play the game. And also like where, you can do things like reverse time selectively or like change gravity selectively. And I think this is also reminds, reminds me of other kinds of world models that are created by authors.

[00:38:38] Where Ted Chang is, is my typical example where he’ll take the world that, you know today, but change one thing about it and, but then create a consistent world based on that. Which is long-winded answer of me to, of. For me to say is it’s it easy to create alternative roles that don’t exist, but you change one thing and then let’s, let’s run a whole bunch of people through it to see if it works.

[00:38:58] Chris Manning: My first dance will [00:39:00] be, that seems a lot easier and more conceivable to do using Techn technology like Moon Lakes than with some of the other world models out there, where the sun can actually make it happen. I’ll let him give a second answer.

[00:39:15] swyx: If I guess for you, you’re constrained by the game engine tool, right?

[00:39:18] Like at the end of the day, that’s the, that’s the thought, partner that you have. If I ask for something where like, if it never is allowed to reverse time or if gravity only ever works one way, then well that’s it. But sometimes gravity might change,

[00:39:33] Fan-yun Sun: but it’s a lot easier to change with code as opposed to a model that is learned primarily on data of.

[00:39:42] Real world and virtual worlds that are, I guess, like for example, junior, like there’s actually trained on a lot of real world data and a lot of virtual gaming data, and it’s hard to say maybe it’s easier to say, okay, I wanna change the visuals in like the time period of, of the world. Like, you can’t change gravity, for [00:40:00] example.

[00:40:00] Vibhu: I feel like you can to light bounds, right? Everything comes down to like, code is a better way to execute it, but the models aren’t that diverse and creative, right? You can say, okay, make gravity slower. It can do that, but it’s limited to your representation of how you text it out, right? Like they’re, they’re only gonna do a few iterations, whereas programmatically, if there’s a game engine under the hood, you can kind of go wild, right?

[00:40:22] So one of the, I dunno, one of the limitations of most models is that they’re very overtrained to one style. Right. And extracting diversity is pretty difficult. At least that’s something we’ve seen.

[00:40:35] Fan-yun Sun: I mean, are there examples you have in mind where you Existing models? Yeah. Like it would be easier to do that’s not using code.

[00:40:43] Certain types of creative intent or like transition state transitions,

[00:40:47] swyx: Clipping, other models, other wo models are very good at clipping through things. Clipping my, my, my legs clipping through a rock because it’s, it’s just, it’s just bad. [00:41:00] Like, you would have to struggle very hard with your stuff to actually make that happen.

[00:41:04] Which I think is maybe a topic that you actually prepared on, Gian Splatting versus, the other stuff.

[00:41:09] Vibhu: Yeah. Yeah. It’s just for those not super familiar, right? There’s a, there’s gian splatting, there is diffusion. Like what works, what scales up. I feel like in February when Soro one came out the blog post was literally titled like,

[00:41:21] swyx: you bring it up.

[00:41:22] You never know.

[00:41:23] Vibhu: World, world, video generation models are world simulators. It’s super bitter lesson pilled. Yeah, emer, a lot of it is emergence, right? So, not to go through their blog post, basically their whole thing was as you scale up all this consistency, all this stuff just kind of solves, it’s a very simple premise, right?

[00:41:41] They just scaled up, diffusion, and from there, this is, this is Feb 2024, how much can we, it’s already been two years, which is basically five years. How much more in AI time do we need to just scale up or, or do we hit a data cap? But I think we already talked about this a lot, right? Like this is back to the beginning discussion of what’s [00:42:00] appropriate for the time.

[00:42:01] And that seems like your approach, right?

[00:42:03] Fan-yun Sun: Yeah. The point I’m trying to make is that they’re very many, many different types of world simulators and like having a world simulator that can produce pixel coherency is very, very useful for games and, marketing and all these things, but it’s not as useful as people think when it comes to causal reasoning.

[00:42:25] When it comes to embodied ai. Yeah, like it this title is true. We’re not saying that it’s, it’s like, not a great world simulator, but actually in the blog that we, we, we, we wrote, the bet is more so that there are gonna be disproportionately large share of value of real world tasks or, and virtual tasks where high resolution pixel fidelity is not needed.

[00:42:47] Yes. Video models have their values.

[00:42:50] swyx: Yeah. This is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like ba the equivalent of a three [00:43:00] body problem in a deterministic Well, where the video models, which is approximated good enough. Yeah.

[00:43:08] Right. Like there’s, there’s some point at which your approach kind of runs into like the you now have to simulate the world. Please, thank you very much. And like you’re trying to do that, but only to the extent that the game engine lets you and like game engines cannot do some things.

[00:43:23] Fan-yun Sun: Yeah, no, I mean, I think the interesting or more technical question here actually is where do you draw the boundary between.

[00:43:32] What’s handled with, let’s say, diffusion prior and what, when? What’s handled with symbolic priors?

[00:43:38] swyx: Yes.

[00:43:38] Fan-yun Sun: Okay.

[00:43:38] swyx: Okay.

[00:43:39] Fan-yun Sun: Right. Let’s go there. Because this, this boundary can actually be fluid. Like I think like maybe what you’re trying to get at is like, okay, people are saying pixel prior, everything. But what we’re saying is, okay, there’s a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today.

[00:43:59] [00:44:00] And I actually do think, and it’s something that we do internally all the time, which is like, okay, given new equations that we learn or new elements of the world and that we, we learn, or maybe some other knowledge that we acquire in the process of developing the models. Should we still be maintaining this line exactly as it is today?

[00:44:22] Or should we move it a little bit left or a little bit right? Right. Like sometimes that we realize that, oh, like maybe customers or, or folks like want certain things that are better handled with preop pryor as opposed to, symbolic prior than,

[00:44:34] swyx: yeah. Your, your skin thing is a, is a example moving it, right.

[00:44:37] Yeah.

[00:44:37] Or left. Yeah,

[00:44:37] Fan-yun Sun: exactly.

[00:44:38] swyx: I dunno what the, the left right is.

[00:44:39] Fan-yun Sun: Yeah, yeah, yeah. No the, the model.

[00:44:42] swyx: Yes.

[00:44:42] Fan-yun Sun: Actually we have a few iterations of them. They’re actually at slightly different

[00:44:45] swyx: I know boundaries. You should, you should do that. That’s a cool dimension to show.

[00:44:49] Fan-yun Sun: Yeah.

[00:44:50] swyx: Is quantum mechanics the diffusion prior of our world?

[00:44:55] Right. It’s like that’s the boundary of classical mechanics versus quantum. Right? Like, that’s it. At one [00:45:00] point God plays dice and the other point doesn’t.

[00:45:02] Fan-yun Sun: I dunno if Chris, you wanna say it, but I think, I think generally I feel like physics is better with symbol P priors.

[00:45:08] Chris Manning: Even quantum physics.

[00:45:09] Fan-yun Sun: Even quantum physics.

[00:45:11] swyx: Yeah. This is starts against to, MLST territory is, is what I call it, where, he, he likes to get philosophical. We, we we’re quite friendly.

[00:45:18] Vibhu: I mean, we need to get, we need to get singularity. I heard some of that.

[00:45:23] swyx: No, no, I think that is actually really helpful and man, I just want you to productize this like, as a product guy, I’m just like, oh, also

[00:45:32] Vibhu: a gamer, I

[00:45:33] swyx: wanna, it’s like a researcher, like, it’s cool.

[00:45:35] Like this is a, the theoretical, like you have a very good, I don’t know, like the way of thinking about these things, but I just wanna see you like, express it. I do think like your fundamentally things when, when you leave open new tools, like, okay, use, use human intent to incorporate it into how you render.

[00:45:52] Artists are gonna have to take like two to three years to figure out what to do with this. And you just don’t know.

[00:45:57] Chris Manning: Right. But I think, this is, [00:46:00] gives a much more approachable and controllable world for the society, which is the beauty, the beauty of, NLP, that that will enable it to be adopted and used.

[00:46:10] And we are very hopeful about that. Yeah,

[00:46:13] Fan-yun Sun: yeah. Yeah. I mean, we are, we are very focused actually on commercialization in the sense that like we do, we do really believe in the data flywheel app approach. Yeah. Where, we put this in the hands of the creators and the users and then they will teach us when, what capability our model should improve.

[00:46:27] And that’s why we are, we are actually, like products and beta

[00:46:31] swyx: Yeah. Focusing on gaming. What, what’s like the adjacent thing to gaming

[00:46:34] Fan-yun Sun: embody adjacent, basically. So maybe we can, we can I’ll maybe start with where we see the platform in three years. Yeah. Which is like, okay. The users would tell us what they want to achieve.

[00:46:45] The end goal could be, Hey, I just, I wanna make something to teach my kids the value of humility. Or it could be, Hey, I wanna fine tune my, drones to be really good at rescue situations. I could be vacuum robots. I want to like train [00:47:00] my manipulation or like vacuum robot to be very robust to my office, right?

[00:47:04] But it’s like, whatever it is, scenario robust to

[00:47:06] swyx: my office

[00:47:07] Fan-yun Sun: or like navigate very robustly in my office. But then it’s like, whatever end goal that you want, our role model will say, okay, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever it is you want.

[00:47:24] Yeah. Right. Maybe for the purpose of games, it’s just the end simulation and that’s the end product for certain policies. It’s like I can train it within these environments and then help you see where your policy is failing or not. Yeah. And then, so I think,

[00:47:37] swyx: so in that case, much more of a training tool.

[00:47:40] Than in other training

[00:47:41] Vibhu: evaluation? Both. Right?

[00:47:43] swyx: Sure. Same. Same thing.

[00:47:43] Fan-yun Sun: Yeah, same thing. I think it’s just this role model that allows people to train any policy that can act in any multimodal environments.

[00:47:51] swyx: Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? Like it’s just, I’ll just put it generally because I think that’s a, that’s obviously a key [00:48:00] problem that a lot of people face when in training agents in these environments, and I don’t know, can you solve it?

[00:48:07] Chris Manning: I think not necessarily. To the extent that there’s a mis specified reward that. It seems like it could be hacked in a more symbolic world or in a more pixel based world. I dunno if Sun’s got any thoughts, but I don’t think that’s really being solved.

[00:48:26] swyx: The other thing that comes to mind is just you could just build a better sawa as a video generator model, right?

[00:48:31] Because then you, you would move the diffusion, side a bit more further to the right. I think if I got the directionality correct. And that’s it.

[00:48:40] Vibhu: It’s better on domains, right? Like on consistency over now, or for sure it exists versus something doesn’t, right.

[00:48:46] Chris Manning: So

[00:48:46] swyx: yeah. Yeah. Is

[00:48:49] Vibhu: is a question more like, like

[00:48:51] swyx: I’m just riffing on like, how do you, what can you build, you know?

[00:48:54] Oh, with the stuff that you have. I do think that the minor, the academic does go immediately to training [00:49:00] and in eval evaluation, but like art tends to take unusual directions. Like you might end up,

[00:49:06] Chris Manning: okay. Yeah. But the question is, can you use this piece of software to develop compelling gameplay and. I don’t think you can take SOAR and produce compelling gameplay, right?

[00:49:19] If you want to have a world that you can wander around in a bit, you are good. But what are your abilities to have gameplay mechanics implemented the way you’d like them to be and to have things stay, with the long-term history of your gameplay that influences future actions. I think there’s just nothing there for that.

[00:49:39] swyx: Yeah, I do tend to agree. I, I’m just trying to sort of test the boundaries. I would also make the observation that as AAA games industry has developed the line between what is a movie and what is a game has blurred. And you, you, you do end up basically producing a two hour movie as part of your game.

[00:49:57] Fan-yun Sun: No, honestly, there, there’s so many actually [00:50:00] applications in adjacent markets that our world model can go into. Yeah. But yeah, it, it’s sort of fun to riff, riff on. Although on the execution side, we we, we need to stay focused with like, okay, what are the capabilities we want to unlock over time?

[00:50:11] And there’s a roadmap for that. But yeah, if we’re just riffing on sort of like the possibilities, I feel like, whether it’s endless Yeah, it’s like classic

[00:50:18] swyx: and the embedding for a possibility and endless in my mind, it’s very close. Yeah. I do wanna, focus on one, like weird choice. I, I don’t know if it’s weird.

[00:50:28] Maybe I’m, I got something here. Audio, right? You could have just said no audio And audio in my mind has a lot of recursion, whereas in video you can just do recasting and that’s much computationally much simpler. Audio just seems way harder. I don’t know if you wanna just comment on just the special 3D audio.

[00:50:46] Problem. Did you really have to do it? I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a, a tt S model on top of

[00:50:57] Vibhu: Well, there’s a lot more to game audio than [00:51:00] just speech. Right. It’s not just

[00:51:01] swyx: tts. Yeah. Tts. S Fxt, GM Spatial in my mind Echoes

[00:51:06] Chris Manning: Yeah.

[00:51:06] swyx: And reflections.

[00:51:07] And I, I don’t even know what’s, what else? I don’t know what, what other problems in this space.

[00:51:13] Fan-yun Sun: Yeah, I think this point like the, it’s sort of a more, more pointing to the benefits of using an game engine as a tool that’s available to the model, right? Because like part of the spatial audio is from the code that is underlying the simulation.

[00:51:32] And while we do give our model access to other types of audio models as. Tools.

[00:51:39] swyx: None of them would be spatial, I think.

[00:51:41] Fan-yun Sun: But that’s exactly sort of more 0.2. We’re giving our model an abstraction or a suite of tools such that it’s able to achieve that. And you can argue that sort of spatial is like a, like a emergence out of the, the tools that we and abstraction that we provide to the agents.

[00:51:59] And I think that’s the beauty of [00:52:00] this, this, this approach is like there’s a lot of things kind of like how human’s built technology and they’re like Lego blocks that build on top of each other. And it’s the same thing here. There’s gonna be things that sort of just sort of emerges from being able to put these things together in like combinatorially interesting ways,

[00:52:14] Chris Manning: right?

[00:52:15] So this integrated audio model exploits the understanding and semantics of the Moon Lake world, right? And whereas in general for the Gen AI video models. There’s no actual integration across to audio at all, right? That someone might stick some music or stick a soundscape or whatever else on top of their video.

[00:52:44] So it’s not a silent video, but they’re in no way connected into a consistent world model. And there’s nothing that’s okay. An action is happening in the video. Therefore there should be a sound that’s [00:53:00] coming from this part of the visual field.

[00:53:03] swyx: Yeah.

[00:53:03] Vibhu: Is that different than Sora too? Does it not have audio?

[00:53:06] Not to say it’s not like

[00:53:08] swyx: amazing

[00:53:08] Vibhu: isn’t a spatial

[00:53:09] swyx: audio.

[00:53:09] Vibhu: It doesn’t,

[00:53:10] swyx: no. I’ve played around it with it enough. It just sounds like someone put an 11 laps voice on top of it and just tried to do the lip sync.

[00:53:18] Vibhu: Oh, yeah. I’ve seen, okay. Generate a dog at the beach and reactions to big wave and move

[00:53:23] swyx: around.

[00:53:23] It’s definitely like, so have the dog, have the dog move away from camera and see if the, the song goes down. It doesn’t. ‘Cause they don’t have facial audio.

[00:53:32] Fan-yun Sun: We do want to basically like we, our moral model, like the one we’re training is basically towards the goal of having a combined latent representation across all these different modalities.

[00:53:42] Right? Such that it can like reason across these different modalities. So for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me. I almost can like, visually extrapolate that trajectory in my mind. And I think that type of capability, we want our model to be able to reason, right?

[00:53:59] And that’s the reason that [00:54:00] we’re sort of taking this multimodal reasoning approach. It’s like we want this combine late in space that can

[00:54:05] swyx: Yeah. Oh, you said late in space. We like that. Here we have to play the, the bell Every time that someone says late in space, no, you gotta train daredevil one. Where you, you, you, it’s only audio, but you have to work out.

[00:54:15] Where everything is.

[00:54:19] Cool. I I think that that was, that was about it for our Moon Lake coverage. I do think that we have like a couple of, Chris Madden questions on, on IR and, just any, any other sort of attention topics or n NLP topics.

[00:54:31] Vibhu: Okay.

[00:54:31] swyx: Go ahead.

[00:54:32] Chris Manning’s Journey: From NLP to World Models

[00:54:32] Vibhu: Well, no, I mean, yeah, it’s just fun. We talked a bit about how you guys met, but you basically, you, you were like the godfather of NLP per se, right?

[00:54:39] You spent the whole career from early embeddings, early early attention. You did 2015 attention for machine translation, everything. You, you had information retrieval, so RAG before rag, we just wanna shout that out and admire a lot of that. Right? So what prompted the switch over to world models?

[00:54:56] How, how’d all that come about?

[00:54:58] Chris Manning: To some answer it [00:55:00] is, the enthusiasms and creativity of students, but there’s a bit of a history there, right? So, yeah. So clearly most of my career has been doing stuff with language and how I got into research was thinking, ah, this is just so amazing how humans can produce speech and understand each other in real time.

[00:55:21] And somehow they managed to learn languages from their kids. How could this possibly happen? And so, yeah, starting off I was very focused on language, but as it sort of got into the 2000 and tens, I started, going, I’d been working on question answering, and then I started to get, interest in visual question answering.

[00:55:42] And that was an area where it was very noticeable. That the visual understanding was bad. Right. These were the days when like, it sort of seemed like there’s almost no visual [00:56:00] understanding. You were just getting answers that came from priors. So, if you asked how many people are sitting at the table, it’d always answer two regardless of how many, how many people you could see in the picture.

[00:56:11] And so it seemed like, oh, these models actually aren’t able to get semantic information outta IMA images. And so I was interested in that problem and tried to work more on that. And so then that required. Knowing more about what’s happening in vision and how you can represent visual information.

[00:56:34] And then things start, there started to be this revolution of, doing generative AI images. And then I had students that started looking at that before the era of Moon Lake. I was also working with Demi Gore, who founded pika. And so, and

[00:56:50] swyx: Ian obviously

[00:56:52] Chris Manning: with gans. Yeah. Though Ian was never my student, but yeah, Ian I was very aware for the, the whole decade there of Ian with Gans.

[00:56:59] [00:57:00] Yeah. And I mean, Ian was a Stanford undergrad, but yeah,

[00:57:03] Vibhu: richard des u.com, I believe he was your student.

[00:57:06] Chris Manning: Yeah. Yeah. And there were, there were links across at that stage as well. So there were several papers in that era of doing, I mean, so Andre Cap was a, PhD student at the same time as Richard.

[00:57:20] And so there was some joint language vision work in that era as well. It seems kind of ancient by modern standards, but yeah, we’re trying to go from sort of textural dependency graphs to visual scenes

[00:57:32] Vibhu: at a time. The glove embeddings really took over a lot of. T-F-I-D-F, like one hot encoding, all that.

[00:57:38] The early vision language models we saw were like lava style adapters, right? It’s, it’s technically still just embedding latent space. Let’s add image, let’s like mixed modality. So, and that, that’s one of the things you super put out there too, right?

[00:57:51] swyx: Yeah.

[00:57:51] Vibhu: Yeah.

[00:57:52] swyx: Yeah.

[00:57:52] Hiring, Closing & The Name “Moon Lake”

[00:57:55] swyx: Well, thank you for all of that. Thank you for all advancing the worlds on, world modeling.

[00:57:56] I honestly, do think that if people deeply understand everything we just [00:58:00] covered, they will see what’s coming. I think you guys have, made some, a really significant contribution here. What are you hiring for? What is the, what do people find? We, we agreed that the CTA was a hiring call.

[00:58:10] Yeah. Don’t we have a GI You don’t need, you don’t need engineers anymore, right?

[00:58:14] Fan-yun Sun: Yeah. On the model side we are actually striving towards basically a self-improving system. But what that means is that we need people to set up the self-improving system. So more, more specifically people who have the intersection of knowledge within co-generation and computer vision and graphics, right?

[00:58:30] Yeah. That’s, that’s sort of the core research background that we look for within OTM and, and the majority of the team today do have like both backgrounds.

[00:58:38] swyx: When you say computer vision and graphics, are they the same thing or is it computer vision one thing, graphics, another thing. And how intertwined are they?

[00:58:46] Chris Manning: They’re intertwined but different.

[00:58:49] swyx: Yeah.

[00:58:49] Chris Manning: And I think, this relates to some of the themes that we’ve been talking about, that the more explicit underlying [00:59:00] world models that are being constructed inside Moon Lake really draw on the computer graphics tradition. And so it’s then combining that with the visual understanding of vision.

[00:59:16] swyx: Got it. Yeah. All right. So you’ve written a game engine, you’re come talk to us, right?

[00:59:21] Fan-yun Sun: Oh yeah, definitely. Definitely. But I do think that the line is blurred, like increasingly blurred these days where it’s like if you have a general understanding of group vision and graphics,

[00:59:31] swyx: I think for your standards it is, for me it feels like vision is, is.

[00:59:35] I’ll leave that to the big labs graphics. I, I, I can get that, you would want to do that from more first principles, but vision, there’s so many vision models off the shelf that I can take, but probably not good enough for your

[00:59:45] Fan-yun Sun: I see, I see. If, if you’re sort of like making that distinction then maybe we, we care a little bit more about having graphics

[00:59:51] swyx: knowledge.

[00:59:51] Yeah, exactly.

[00:59:52] It could be like, sometimes a hiring call can be as simple as like, if you know the answer to blah, you should talk to me. Like the sort of core known hard [01:00:00] problem in, in your world.

[01:00:01] Fan-yun Sun: Ah, I see. Yeah. In that case, if you, yeah, definitely. If you’ve written a game engine before, if you’ve rld a variety of coding models on different objectives, like

[01:00:13] swyx: easy,

[01:00:13] Many of those, yeah.

[01:00:14] Fan-yun Sun: If you’ve done multimodal lean space alignment, I, I intentionally include

[01:00:20] swyx: space.

[01:00:20] Fan-yun Sun: Again,

[01:00:21] swyx: a poor editor has a thing every time. Yeah. Lean space alignment. Honestly. Is it that hard?

[01:00:26] I, I, there’s some scripts out there that I’ve saved for the day. I someday have to do it, but I don’t have to do it.

[01:00:31] But it’s

[01:00:32] Fan-yun Sun: done, I think. Yeah. There, there’s, there’s a versions of that that are done. But I, I think we are aligning audio, text, language and video. Yeah. Right. Like, and basically we have these role models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to the model to sort of self-improve.

[01:00:52] So it’s an insanely exciting, but also technically challenge problem. Yeah. So people who wanna do their lives best work, that only [01:01:00] makes a place.

[01:01:01] Vibhu: How big are you guys? Where are you guys based?

[01:01:02] Fan-yun Sun: We’re currently based in San Mateo, although we’re moving up to sf. We’re about 18 folks right now.

[01:01:08] swyx: My ending question was gonna be why, what, what is the name?

[01:01:10] What’s behind the name?

[01:01:11] Vibhu: Yeah.

[01:01:12] Fan-yun Sun: Oh,

[01:01:14] Vibhu: Very cool. Graphics and design, by the way.

[01:01:16] Fan-yun Sun: Actually at the, at the time when the, when the, when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like, open ai, but for like, almost like industrial light and magic vibes.

[01:01:28] Wow. Because it’s like we care about creativity and using that as a funnel to solve a GI. So then we were, we, we brainstorm a lot around like Dreamworks, right? Like industrial light magic. And, so there’s a few, few basically, space of things that we feel like are very, very semantically close to the company’s identity.

[01:01:47] swyx: Yeah.

[01:01:48] Fan-yun Sun: And then it ended up being Moon Lake, partly because of the Dreamworks vibe, the Dreamworks, moon

[01:01:54] swyx: Lake.

[01:01:55] Fan-yun Sun: Exactly. Yep. So that was a little bit of that inspiration. And then the moon was sort of [01:02:00] like a, it basically was like about the. Reflection. The reflection part also implies the self-improvement loop.

[01:02:07] Wow. That we sort of like, that’s really bleed and that’s the path towards multimodal general intelligence. So that’s, that’s that. I’ll leave that as I love a good

[01:02:15] swyx: name. I love a good name. This is great. It’s a

[01:02:16] Vibhu: very

[01:02:17] swyx: good name. It’s very good. Lo I’m glad I asked the question. I will also say, one, my favorite story, books or biographies ever is, creativity Inc.

[01:02:24] With Ed Kamal’s, story about Pixar and how he, was rejected as a Disney animation artist. So then he went into computing and brute forced his way into back. No, I love that story. Yeah. Disney.

[01:02:37] Fan-yun Sun: Yeah. And Walt Disney is also like one of my favorite founders. He’s like, his, his story. Like at the time you’re like, okay, I’m gonna create this like.

[01:02:44] Immersive park. Like people can’t, don’t even have that technology to create it virtually, but they’re like, you know what, let’s just build it physically such that people can,

[01:02:50] swyx: so he is the first world modeler.

[01:02:52] Fan-yun Sun: No, I, I I tell people that like, theme parks are world models too.

[01:02:56] swyx: Mm. Yeah. Yeah. Yeah. I mean, it’s a small world or it’s [01:03:00] a, like the Epcot center with all the little, replicas of the countries.

[01:03:03] Yeah. Those are very interesting. Okay. Well thank you, we’ve covered, a huge amount. Thank you for your time and thank you for inspiring us.

[01:03:10] Fan-yun Sun: Thank you

[01:03:10] swyx: for having us. Thank you. It’s fun

[01:03:11] Fan-yun Sun: chatting. Yeah. It’s been a good time.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe