Latest AI developments explained (OpenAI SORA, World models, Q*), Part II
How SORA works, how is it connected to world models? What about the Q* algorithm? How to make AI systems affordable for end users? Path towards personal LLM assistants, autonomous driving, etc.
The first part described AI technology that enabled OpenAI SORA simulator of the physical world. In this second part, we mainly focus on technologies that have potential to further advance capabilities of existing AI systems.
Where SORA is a transformer that was trained on massive amounts of data and operates on spacetime patches of video and image latent codes, EMO uses very similar technology, Unet-based latent diffusion models that are also described in the Part I.
Both SORA and EMO are foundation AI models that are not publicly available yet and it will take some time to build a scalable service out of these research prototypes. Other foundation models are however already becoming a commodity. Especially generative models trained on text such as GPT4, Gemini, Gemma or Mistral and many others. AI chatbots are getting so widespread that governments are already considering how to govern foundation models. Especially possible regulation of open source models is sparkling public debate.
Despite their growing capabilities, chatbots still have lots of limitations, mainly in their ability to reason and solve complex problems.
Some people speculate, OpenAI already has much more advanced technology than what is available in their public foundation models. This is true indeed, because research prototypes are always ahead of production systems and they can also use much more computational resources at inference time. Rumors are however circulating about an unpublished Q* algorithm that boosts the reasoning capabilities of chatbots to a superhuman level.
Let's have a look what Q* can actually be about.
Google and Princeton researchers recently published a method improving performance of text foundation models on simple math tasks like Game of 24 significantly. The method is using a simple Breadth-first search to navigate the reasoning tree, possibly leading to the correct answer. Researchers also suggest, they would like to utilize A* algorithm in future. This might be the secret Q* algorithm scaring OpenAI board last year so much, that they decided to fire Sam Altman.
In my opinion, to achieve superhuman capabilities of large language models (LLMs), much more is needed.
Memory augmented models
Improved planning and reasoning - blend with reinforcement learning
Better benchmarks
Open ended learning
Current foundation LLMs have their weights static at inference time, so they cannot use it as internal memory when talking with users. Important research direction is to augment LLMs with external memory. Such memory enables LLMs to work as reasoning engines.
In reinforcement learning (RL), AI agents are designed to reason about the environment and plan future actions in order to maximize future rewards.
Even older LR architectures AlfaZero (2017), Observe and Look Further (2018) or Machine Theory of Mind (2018) have much better support for reasoning and planning than today's LLMs that were designed to model and predict natural language, not for complex reasoning.
Agent 52 (2020), was able to solve Atari benchmark and First return then explore (2021) achieved superhuman performance for fifty games.
Now it is time to improve LLMs planning and reasoning capabilities with clever exploration/exploitation RL methods.
For this task, it is important to create good benchmarking environments. Nice examples of such environments are
Orca 2: Teaching Small Language Models How to Reason
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
GoodAI LTM: Long Range Memory and Continual Learning in Conversation
Craftax: Fast open-ended environment for training reinforcement learning agents
Open-ended environments are the most challenging as there is not a single specific goal to achieve. It resembles the real world much better and in my opinion solving open ended learning will bring us much closer to AGI than anything else. Being able to generate and solve tasks that are at the frontier of AI capabilities is a first step.
Very nice research directions are Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents and DreamerV3: Mastering Diverse Domains through World Models.
Recently introduced Google Genie shows that zero-shot world models of diverse environments are actually possible and I expect similar technology can be adjusted for generating complex open-ended environments for training and benchmarking various problem-solving agents.
To ensure AI technologies assist rather than replace humans, it's vital to enhance human-machine interfaces. This enhancement will empower individuals to control AI, provide efficient feedback, and collaborate with AI in achieving specific objectives. We will delve into these strategies in the upcoming section of the article. Also, we will discuss how to make AI survive in the real world because somebody has to pay bills for electricity and a team that maintains AI in the shape.