Published Chat Runtime — Architecture Deep Dive

The companion to ONBOARDING_AGENTIC_DEEP_DIVE.md. Onboarding was about building the twin. This is the payoff: an audience member talks to the finished twin. We follow Mia, a founder, chatting with Vlad's published Dolly. Grounded in wix-vmr-repo/legends-platform/persona-chat-service (Scala), with the cross‑repo voice pieces from catalyst-server/legends/{voice-service,vendor-gateway-service}.

Legend: 🟣 = the LLM brain (non‑deterministic) · 🟦 = deterministic scaffolding · 🟡 = an illustrative value I filled in · 🔶 = a link whose mechanism is proven but whose exact runtime wiring lives in deploy config / another repo (see §9).

Provenance: persona-chat-service is read closely (file:line cited). The actual model inference (Genie, AI Gateway), the knowledge-service, and the external Cloud agent live in other services — where the trail goes cold I say so.

Meet Mia — and the one big idea

Vlad finished onboarding; his twin is published. Mia opens it and types "How do you help with fundraising?" — or speaks it to the video avatar.

The one idea to anchor on: the published twin is a RAG‑grounded persona agent, and its brain is Genie by default. Text and voice/video are not two systems — they converge on one pipeline (ChatService.sendChatMessage), and the vendor (Tavus, etc.) only renders tokens Genie produced. This is the "other brain" we kept hitting in the onboarding dig — now pinned down.

0. Mental model — the chat agent vs. the onboarding agent

	Onboarding (Atlas)	Published chat (this doc)
Who talks	the creator (Vlad)	an audience member (Mia)
Brain	`onboarding-service` LangGraph	Genie (default), behind `persona-chat-service`
Job	capture fields via tools + gates	answer questions via RAG + persona prompt
Memory	CloudStore per‑turn state	persisted chat messages + the persona's knowledge
Personality	mode guidelines	persona `data` JSON + `behaviorSettings`

Where onboarding was a state machine driving tools, chat is retrieval‑augmented generation in the persona's voice — with optional tools so the twin keeps learning.

1. Entry points — text and voice converge

Two doors into persona-chat-service, one room:

Text: SendChatMessage(chatId, …) (PersonaChatService.scala:134).
Voice/Video: GenerateChatCompletionsStreamWithVendor (PersonaChatService.scala:248) — the vendor (Tavus/ElevenLabs) POSTs an OpenAI‑style request to /v1/generate-completions/{vendor}/chat/completions/stream; VendorChatService maps the vendor conversation id → chat.

Both call ChatService.sendChatMessage (ChatService.scala:513) — the single pipeline that does RAG → prompt assembly → LLM routing → stream → persist.

2. One full turn — Mia asks the published twin

[Diagram]

Code path: sendChatMessage (ChatService.scala:513) resolves config → routeToLlmClient (:422) → (default) resolveAssistantResponse (:724) → Genie; messages are persisted via chatMessageSDL.createBulk (:574).

3. Which brain answers? (the routing fork)

The "two brains" question, resolved. One fork decides the LLM:

[Diagram]

Default (common case): Genie — AssistantChatServiceClient POSTs to ${chatServiceUrl}/v1/send-message-streamed (AssistantChatServiceClient.scala:235). Genie holds the unified‑assistant inference; persona context + knowledge + prompt are passed in.
AI Gateway: when the useDirectAiGateway toggle is on or a system‑prompt override is set (ChatService.scala:744) — this is the path with tool calling wired in.
Direct providers: Gemini / OpenRouter / Cerebras, if an active direct provider is configured.
CustomLlm: an arbitrary OpenAI‑compatible base_url from the AssistantConfiguration (ChatService.scala:423). (Note: this is the text path; the voice path's LLM url is fixed by vendor‑gateway — see §7.)
SpiLlm: stubbed, not implemented (:509).

Bottom line: a published‑twin chat — text or voice — is answered by Genie by default, with AI Gateway / direct / custom as configurable alternatives. 🔶 Genie's internal inference is an external service (wix.genie.chat_service.v1); the trail goes cold at its RPC boundary.

4. Prompt assembly — the persona's voice

The system prompt is built from a persona template plus layered context (AIGatewayChatServiceClient.scala:193, Prompts.scala):

effectivePrompt =
    persona template (publishedMode_PersonaTemplate.txt)         # "You are [name]… respond as you naturally would"
  + <behaviorSettings>: conversationStyle, communicationTone,    # the very knobs set on the Dolly
      responseStyle, specialInstructions
  + renderRelevantKnowledge(RAG snippets)                        # facts pulled for THIS question
  + renderDynamicKnowledge(assistant-domain context)
  + renderConversationMode(...)

The template is strikingly engineered: "You are [name]… respond exactly as you would naturally," with hard rules to never reveal the behavior settings — so Mia can't get the twin to dump its own config. This is where the BehaviorSettings/CommunicationStyle/AnswerPreferences that Vlad set during onboarding (and dolly-service) actually take effect.

5. RAG — grounding the answer in what Vlad taught

Before the LLM call, the service retrieves the most relevant knowledge for Mia's question (AIGatewayChatServiceClient.scala:156 → AssistantTools.scala:51):

knowledgeServiceClient.searchKnowledge(
  SearchKnowledgeRequest(personaId, query, namespace, mode))

knowledge-service is the unified facade — it federates snippets + documents (Vespa) + links (the same store onboarding seeded and the teaching phase filled, §"How is knowledge captured" earlier). The ranked results are injected as renderRelevantKnowledge into the prompt. So the twin answers from Vlad's actual material, not just the base model.

🔶 The knowledge-service + Vespa internals live in other services (trail cold there).

6. Tools mid‑conversation — the twin keeps learning

On the AI Gateway path, the assistant can call tools during the chat (AIGatewayChatServiceClient.scala:173; tools in AssistantTools.scala):

retrievePersonaKnowledge — fetch more snippets mid‑turn if the first RAG pass wasn't enough.
triggerKnowledgeSnippetCreation — fire‑and‑forget: mine new snippets from the ongoing conversation (so a good Q&A with Mia can become future knowledge).
upsertKnowledgeSnippet — create/update a snippet.

This mirrors onboarding's "knowledge as a byproduct" idea: the published twin doesn't just answer — it can grow its own knowledge from conversations. (Tool calling is wired in the AI Gateway path; the default Genie path's tool orchestration is internal to Genie — 🔶.)

7. Voice/video — closing the `{aiAssistantUrl}` loop

This reconciles the open 🔶 from the onboarding voice dig. When Mia talks to the video twin, the LiveKit Cloud agent (or Tavus) doesn't have its own brain — it calls back into persona‑chat‑service, which runs the same pipeline:

[Diagram]

The reconciliation: vendor‑gateway pointed the Cloud agent's LLM at {aiAssistantUrl}/<vendor>/chat/ completions, and that endpoint is almost certainly persona‑chat‑service's GenerateChatCompletionsStreamWithVendor (path /v1/generate-completions/{vendor}/chat/completions/ stream, VendorChatService.scala:39). It maps the vendor conversation-id → InteractiveConversationMapping → chatId/personaId, then runs the identical RAG+prompt+Genie pipeline and streams tokens back to the vendor to render as voice/video.

🔶 The exact {aiAssistantUrl} value is deploy config (we couldn't read it), but the path shapes match — this is a strong inference, not a guess.

8. Persistence & identity

Entities (SDL): PersonaChat (chatId, personaId, mode, assistantId, configurationId) and ChatMessage (role, content, position, previousMessageId, vendorChatId). Stored via personaChatSDL / chatMessageSDL (ChatService.scala:183, :574).
personaId resolution: from the chat record (getChat), and mirrored into Genie conversation metadata (AssistantChatServiceClient.scala:145) and the vendor mapping for voice/video.
personaId is the through‑line — the same key onboarding minted (Persona at "Create"), knowledge attaches to, and chat retrieves by. Capture → store → retrieve, all on personaId.

9. Onboarding vs. chat — the whole system in one frame

                    persona_id (the through-line)
   ┌───────────────────────────┴───────────────────────────┐
ONBOARDING (Atlas)                                  PUBLISHED CHAT (this doc)
 onboarding-service LangGraph                        persona-chat-service → Genie
 capture name/intent/photo/voice                     RAG over knowledge + persona prompt
 seeds snippets inline                               can mine new snippets mid-chat
 transport: LiveKit + Cloud agent                    transport: text OR LiveKit + Cloud agent
 brain: onboarding-service                           brain: Genie (default)

Two agents, two brains, one persona and one transport layer. Onboarding writes the twin; chat reads it (and keeps writing, via tool‑driven snippet creation).

TL;DR — patterns to take away

Pattern	Where it shows up
Text + voice converge on one pipeline	`ChatService.sendChatMessage`
RAG / grounded generation	`searchKnowledge` → `renderRelevantKnowledge`
Persona prompt from stored settings	`publishedMode_PersonaTemplate.txt` + `behaviorSettings`
"Never reveal the config" guardrail	persona template hard rules
Pluggable LLM backend	Genie (default) / AI Gateway / direct / CustomLlm / SPI
Vendor renders, doesn't think	`GenerateChatCompletionsStreamWithVendor` → Genie → stream back
Twin keeps learning	`triggerKnowledgeSnippetCreation` mid-chat
`persona_id` as the through-line	capture → store → retrieve

Open 🔶 (runtime/deploy or external): Genie's internal inference, AI Gateway internals, knowledge-service + Vespa internals, and the literal {aiAssistantUrl} value all live outside persona-chat-service. Everything inside persona-chat-service (routing, RAG call, prompt assembly, persistence, vendor mapping) is code-grounded with file:line above.

Published Chat Runtime — Architecture Deep Dive

Meet Mia — and the one big idea

0. Mental model — the chat agent vs. the onboarding agent

1. Entry points — text and voice converge

2. One full turn — Mia asks the published twin

3. Which brain answers? (the routing fork)

4. Prompt assembly — the persona's voice

5. RAG — grounding the answer in what Vlad taught

6. Tools mid‑conversation — the twin keeps learning

7. Voice/video — closing the {aiAssistantUrl} loop

8. Persistence & identity

9. Onboarding vs. chat — the whole system in one frame

TL;DR — patterns to take away

7. Voice/video — closing the `{aiAssistantUrl}` loop