Mercury
Diffusion-based LLM with burst generation and TTS pipeline.
Overview
Mercury is a diffusion language model by Inception Labs. Unlike autoregressive models that generate tokens one at a time, Mercury generates all tokens simultaneously using a burst generation pattern.
This changes the architecture for voice-enabled personas. Instead of streaming tokens as they arrive, the system uses a two-call pattern optimized for time-to-first-audio.
Two-Call Pattern
Mercury's burst generation means waiting for the full response before speaking. The two-call pattern solves this:
| Call | Purpose | Search |
|---|---|---|
| Call 1 | Short acknowledgment, fires immediately to TTS | Skipped |
| Call 2 | Full response with search context, generates while Call 1 is speaking | Included |
The user hears a natural acknowledgment almost instantly while the substantive response generates in parallel.
TTS Pipeline
Sentence Detection
Response is split at sentence boundaries for natural speech chunking.
Audio Queue
Sentences are queued with a mutex to prevent overlap between chunks.
Audio Synthesis
Each sentence is converted to audio with lip-sync alignment data.
Avatar Rendering
3D avatar renders synchronized mouth movements in real time.
Garble detection
Diffusion models at low temperatures can produce garbled output: very short responses, missing punctuation, repeated words. The system applies a temperature floor with automatic garble detection and retry.
Voice Consistency
The VCS scoring system runs on Mercury responses the same as any other model:
- Signature phrase detection against RICE Communication layer
- Forbidden pattern matching with score penalties
- Sentence distribution validation (short/medium/long ratios)
Without a persona, Mercury outputs structured tables. With a persona, it outputs first-person conversational text. The structured identity spec changes the output modality entirely.
Search Integration
Search operates in two tiers:
| Tier | Method | Used By |
|---|---|---|
| Tier 1 | Heuristic intent classification | All models |
| Tier 2 | AI-powered upgrade | Non-Mercury models |
Mercury skips Tier 2. The burst generation pattern can't wait for search classification. Call 1 fires without search, Call 2 integrates results if available.
Prompt Assembly
Six layers compose the final prompt:
| Layer | Purpose |
|---|---|
| RICE base | Persona definition and constraints |
| Style modifier | EQ-adjusted tone parameters |
| Conversation context | Injected facts, session state |
| Deflection override | Prevents off-topic drift |
| Goal steering | State-appropriate steering prompts |
| Search context | Web or capsule results (Call 2 only) |
Model Comparison
The persona layer treats all models identically: same RICE definition, same scoring, same context injection. Identity coherence holds across all backends.
| Model Type | Generation Pattern | Persona Integration |
|---|---|---|
| Autoregressive (streaming) | Token-by-token | Full pipeline, standard streaming |
| Diffusion (burst) | All tokens at once | Two-call pattern for voice latency |