Mercury

Overview

Mercury is a diffusion language model by Inception Labs. Unlike autoregressive models that generate tokens one at a time, Mercury generates all tokens simultaneously using a burst generation pattern.

This changes the architecture for voice-enabled personas. Instead of streaming tokens as they arrive, the system uses a two-call pattern optimized for time-to-first-audio.

Two-Call Pattern

Mercury's burst generation means waiting for the full response before speaking. The two-call pattern solves this:

Call	Purpose	Search
Call 1	Short acknowledgment, fires immediately to TTS	Skipped
Call 2	Full response with search context, generates while Call 1 is speaking	Included

The user hears a natural acknowledgment almost instantly while the substantive response generates in parallel.

TTS Pipeline

Sentence Detection

Response is split at sentence boundaries for natural speech chunking.

Audio Queue

Sentences are queued with a mutex to prevent overlap between chunks.

Audio Synthesis

Each sentence is converted to audio with lip-sync alignment data.

Avatar Rendering

3D avatar renders synchronized mouth movements in real time.

Garble detection

Diffusion models at low temperatures can produce garbled output: very short responses, missing punctuation, repeated words. The system applies a temperature floor with automatic garble detection and retry.

Voice Consistency

The VCS scoring system runs on Mercury responses the same as any other model:

Signature phrase detection against RICE Communication layer
Forbidden pattern matching with score penalties
Sentence distribution validation (short/medium/long ratios)

Without a persona, Mercury outputs structured tables. With a persona, it outputs first-person conversational text. The structured identity spec changes the output modality entirely.

Search Integration

Search operates in two tiers:

Tier	Method	Used By
Tier 1	Heuristic intent classification	All models
Tier 2	AI-powered upgrade	Non-Mercury models

Mercury skips Tier 2. The burst generation pattern can't wait for search classification. Call 1 fires without search, Call 2 integrates results if available.

Prompt Assembly

Six layers compose the final prompt:

Layer	Purpose
RICE base	Persona definition and constraints
Style modifier	EQ-adjusted tone parameters
Conversation context	Injected facts, session state
Deflection override	Prevents off-topic drift
Goal steering	State-appropriate steering prompts
Search context	Web or capsule results (Call 2 only)

Model Comparison

The persona layer treats all models identically: same RICE definition, same scoring, same context injection. Identity coherence holds across all backends.

Model Type	Generation Pattern	Persona Integration
Autoregressive (streaming)	Token-by-token	Full pipeline, standard streaming
Diffusion (burst)	All tokens at once	Two-call pattern for voice latency

On this page