Thinking Machines — The AI That Listens While It Speaks

Imagine having a phone call with AI — interrupting it mid-sentence, continuing your own thought, even both of you talking at the same time. Until now, that was strictly science fiction.

Last week (May 11, 2026), Mira Murati (former OpenAI CTO) and her startup Thinking Machines Lab released TML-Interaction-Small. It’s the first AI model that “listens while it talks” — a new architecture called an interaction model for real-time voice AI. Response latency? Under 0.4 seconds. Roughly the speed of a natural conversation.

The problem with current models — everything is turn-based

GPT, Claude, Gemini — all work in turns. You speak, you stop, the model responds, finishes, and then it’s your turn again. Like texting. Even their “voice modes” work the same way under the hood — audio is converted to text, the model replies, text is converted back to audio. Each cycle adds latency.

That’s why talking to Voice Mode still feels robotic. You can’t cut it off without it getting confused. It can’t think and talk at the same time. If you stay silent for 5 seconds, it doesn’t know whether to keep going or wait.

Analogy

Current models are like walkie-talkies — one person speaks, says “over,” then the other. Interaction models are like a real phone call — both sides can talk simultaneously, stay silent, or interrupt each other.

TML-Interaction-Small — the first full-duplex model

This model was built from scratch for real-time interaction, not patched on top of a turn-based one. The specs:

276 billion parameters as a Mixture-of-Experts, with only 12 billion active per inference
200-millisecond windows: instead of waiting for the user to finish their turn, the model processes audio, video, and text every 200ms and responds when appropriate
Native multimodal: audio, video, and text are part of the architecture from the start — not a stitched pipeline

This architecture matters. Until now, voice models relied on a “VAD harness” (Voice Activity Detection) to determine when the user had stopped speaking. That harness is gone.

Dual-model architecture — frontend + background

There’s a clever twist. Two models work in tandem:

Frontend interaction model: always on, in conversation with the user, processing input in 200ms windows. It’s a light, fast model designed purely for real-time response.

Background model: for when deeper reasoning is needed, tool calls have to happen, or complex agentic logic must run. The frontend ships the full conversation context to the background, the background reasons, returns a result, and the frontend slots it into the dialogue at a natural pause.

Why this matters

This is the same mental model we humans use. When you’re speaking, one part of your brain listens, another forms the next sentence, and another thinks about the underlying question in the background. AI so far could only do one of these at a time.

Benchmark results — large gap from OpenAI and Google

On FD-bench v1.5 (interaction quality benchmark):

TML-Interaction-Small: 77.8
Gemini 3.1 Flash Live: 54.3
GPT-Realtime-2.0: 46.8

Turn-taking latency (lower is better):

TML: under 0.4 seconds
Gemini 3.1 Flash Live: 0.57s
GPT-Realtime-2.0: 1.18s

On TimeSpeak (a test of whether the model can initiate speech at a specified time): TML hit 64.7% macro-accuracy versus 4.3% for GPT-Realtime-2.0. Basically, GPT-Realtime can barely do it at all.

One important caveat: these numbers are published by Thinking Machines itself — independent testing isn’t widely available yet. We’ll need to wait for independent developers to verify them in practice.

What does this mean for business and development?

If you think this is just a better Voice Mode, you might be underestimating it. This is a new architecture, not just a speed boost. Three practical takeaways:

First, real AI support calls: AI call centers have been slow and robotic so far — confused, unable to interrupt the caller. With full-duplex, AI can respond at human-operator speed, listen, and interrupt when needed. This could change the shape of AI phone support.

Second, companions and language learning: Learning a language with AI has been text-based. Now you can actually converse — at the speed and rhythm of a real conversation. Apps like Duolingo and Pimsleur will likely add this soon.

Third, agents in the real world: If you want to build an AI agent that joins a meeting, talks to multiple people at once, or sits in on a real-time customer call — it wasn’t really feasible before. Now it is.

A note of caution

TML is still in “research preview” — not production-ready. The model is in limited tester access. Thinking Machines has raised $2 billion in seed funding but has no commercial product yet. We’ll have to see how it stacks up against OpenAI and Google in practice.

The bottom line — AI architecture is shifting

The era of “bigger = better” is winding down. New architectures (interaction models, dual-model, 200ms micro-turns) are now solving problems that scale alone couldn’t.

With $2 billion and a small team, Mira Murati’s lab managed to pull ahead of OpenAI and Google in one specific area (interaction models). It shows that innovation in AI can still come from smaller teams — though for now, only in a narrow domain.

If you’d like to build an AI strategy for your business, we’d be glad to talk — get in touch with our team. Terms like MoE, full-duplex, and VAD are explained in the AI glossary.