Patches of Spacetime
Sora's fundamental innovation is treating video not as a sequence of frames but as a collection of spacetime patches β small cubes of pixels that span both space and time. The model processes these patches with a transformer architecture, attending to patches that are spatially and temporally adjacent. This unified representation allows Sora to reason about how objects move through space over time simultaneously, rather than predicting each frame independently.
The Scaling Insight
OpenAI's research showed that video generation quality scales predictably with model size and training compute β similar to the scaling laws discovered for language models. Sora is believed to be a significantly larger model than competing video systems, which partly explains its quality advantage on complex, physics-requiring scenes.
Variable Duration and Resolution
Unlike earlier video models that generated fixed-length clips at fixed resolution, Sora handles variable video lengths and resolutions by simply using more or fewer patches. A 4-second 720p clip uses fewer patches than a 20-second 1080p clip. This flexibility comes from the patch-based representation β the transformer doesn't care how many patches it processes.
Limitations and Future
Sora still struggles with complex physical interactions, very long temporal dependencies, and consistent character identity across scenes. These are unsolved research problems, not implementation failures. The next generation will likely address them through better training data, larger models, and physics-informed training objectives. See Sora in our catalog β