Voice agents can sound smooth and still fail the job. In a recent τ-Voice benchmark, production voice agents from OpenAI, Google, and xAI completed only 26% to 38% of real customer-service tasks under realistic audio conditions, compared with 85% for an equivalent text model on equivalent tasks.
That performance problem is the backdrop for Thinking Machines’ new research preview, announced May 11. The company calls the architecture “interaction models” and says it is designed for live voice-and-video conversation, with simultaneous listening and speaking handled inside the model rather than added through outside software.
The model is not publicly available yet, and all performance figures come from Thinking Machines. No independent evaluator has tested it under the grounded task conditions that show whether a voice agent can complete real work, not just keep a natural conversation moving.
Voice agents still stumble on real tasks
The τ-Voice benchmark covers 278 tasks across retail, airline, and telecom scenarios, including returns, flight changes, and billing disputes. Success requires a verifiable database change, not a plausible spoken reply.
Under clean audio, the tested voice agents scored 31% to 51%. Under realistic conditions with background noise, accented speech, and interruptions, those scores dropped to 26% to 38%. The benchmark also found that 79% to 90% of failures traced to agent behavior rather than test artifacts.
Agents struggled with authentication steps that required them to correctly transcribe names, email addresses, and confirmation codes. They also sometimes claimed to complete database updates without executing the tool call.
Turn-taking caused separate failures. OpenAI’s model treated nearly every filler word as a new input, xAI’s model interrupted on 84% of turns, and Google’s model missed more than one in four user turns, according to the benchmark. Mishearing a confirmation code and interrupting a customer mid-sentence are different problems, and a better conversational rhythm does not automatically solve both.
Overlapping speech is the pitch
Thinking Machines says standard voice systems still work like turn-based chat: the user speaks, the model listens, then the model responds. Its interaction model architecture uses a multi-stream design that processes 200-millisecond chunks of audio and video input and output at the same time. The goal is to let the model backchannel while someone is speaking, pause when the user says “hold on,” and respond to visual cues without waiting for a separate prompt.
The architecture splits live conversation from deeper reasoning. An interaction model handles timing, backchannels, and immediate replies, while a background model runs tool calls, browsing, and multi-step work, then feeds results back into the exchange. That design could help with interruptions, selectivity, and long pauses during tool use. It does not prove the system can handle noisy authentication, accents, or reliable tool execution.
Thinking Machines reported a score of 77.8 for TML-Interaction-Small on FD-bench V1.5, compared with 54.3 for Gemini-3.1-flash-live-preview and 46.8 for GPT-realtime-2.0 minimal. It also reported turn-taking latency of 0.40 seconds, compared with 0.57 seconds for Gemini and 1.18 seconds for GPT-realtime.
FD-bench and τ-Voice measure different things. FD-bench scores conversation quality, timing, and proactive engagement. τ-Voice checks whether a conversation produced the right real-world outcome. A model can do better on FD-bench and still fail at the customer-service tasks enterprises care about.
Thinking Machines has the funding and infrastructure to keep testing. The interaction-model preview follows its Tinker fine-tuning product, a $2 billion seed round, and a recent Google Cloud deal tied to frontier model training.
For enterprises testing voice AI now, the standard should be concrete: test real names and addresses, run noisy audio before clean audio, include multiple accent profiles, measure interruptions as well as speed, and confirm that tool calls actually changed the database. A voice agent that sounds natural still has to finish the task.
Also read: Google’s recent Gemini AI announcements show how agentic AI is spreading across work apps, research tools, health, and defense.


