Imagine having a phone call with AI — interrupting it mid-sentence, continuing your own thought, even both of you talking at the same time. Until now, that was strictly science fiction.
Last week (May 11, 2026), Mira Murati (former OpenAI CTO) and her startup Thinking Machines Lab released TML-Interaction-Small. It’s the first AI model that “listens while it talks” — a new architecture called an interaction model for real-time voice AI. Response latency? Under 0.4 seconds. Roughly the speed of a natural conversation.
The problem with current models — everything is turn-based
GPT, Claude, Gemini — all work in turns. You speak, you stop, the model responds, finishes, and then it’s your turn again. Like texting. Even their “voice modes” work the same way under the hood — audio is converted to text, the model replies, text is converted back to audio. Each cycle adds latency.
That’s why talking to Voice Mode still feels robotic. You can’t cut it off without it getting confused. It can’t think and talk at the same time. If you stay silent for 5 seconds, it doesn’t know whether to keep going or wait.
TML-Interaction-Small — the first full-duplex model
This model was built from scratch for real-time interaction, not patched on top of a turn-based one. The specs:
- 276 billion parameters as a Mixture-of-Experts, with only 12 billion active per inference
- 200-millisecond windows: instead of waiting for the user to finish their turn, the model processes audio, video, and text every 200ms and responds when appropriate
- Native multimodal: audio, video, and text are part of the architecture from the start — not a stitched pipeline
This architecture matters. Until now, voice models relied on a “VAD harness” (Voice Activity Detection) to determine when the user had stopped speaking. That harness is gone.
Dual-model architecture — frontend + background
There’s a clever twist. Two models work in tandem:
Frontend interaction model: always on, in conversation with the user, processing input in 200ms windows. It’s a light, fast model designed purely for real-time response.
Background model: for when deeper reasoning is needed, tool calls have to happen, or complex agentic logic must run. The frontend ships the full conversation context to the background, the background reasons, returns a result, and the frontend slots it into the dialogue at a natural pause.
Benchmark results — large gap from OpenAI and Google
On FD-bench v1.5 (interaction quality benchmark):
- TML-Interaction-Small: 77.8
- Gemini 3.1 Flash Live: 54.3
- GPT-Realtime-2.0: 46.8
Turn-taking latency (lower is better):
- TML: under 0.4 seconds
- Gemini 3.1 Flash Live: 0.57s
- GPT-Realtime-2.0: 1.18s
On TimeSpeak (a test of whether the model can initiate speech at a specified time): TML hit 64.7% macro-accuracy versus 4.3% for GPT-Realtime-2.0. Basically, GPT-Realtime can barely do it at all.
One important caveat: these numbers are published by Thinking Machines itself — independent testing isn’t widely available yet. We’ll need to wait for independent developers to verify them in practice.
What does this mean for business and development?
If you think this is just a better Voice Mode, you might be underestimating it. This is a new architecture, not just a speed boost. Three practical takeaways:
First, real AI support calls: AI call centers have been slow and robotic so far — confused, unable to interrupt the caller. With full-duplex, AI can respond at human-operator speed, listen, and interrupt when needed. This could change the shape of AI phone support.
Second, companions and language learning: Learning a language with AI has been text-based. Now you can actually converse — at the speed and rhythm of a real conversation. Apps like Duolingo and Pimsleur will likely add this soon.
Third, agents in the real world: If you want to build an AI agent that joins a meeting, talks to multiple people at once, or sits in on a real-time customer call — it wasn’t really feasible before. Now it is.
The bottom line — AI architecture is shifting
The era of “bigger = better” is winding down. New architectures (interaction models, dual-model, 200ms micro-turns) are now solving problems that scale alone couldn’t.
With $2 billion and a small team, Mira Murati’s lab managed to pull ahead of OpenAI and Google in one specific area (interaction models). It shows that innovation in AI can still come from smaller teams — though for now, only in a narrow domain.
If you’d like to build an AI strategy for your business, we’d be glad to talk — get in touch with our team. Terms like MoE, full-duplex, and VAD are explained in the AI glossary.