Published Chat Runtime — Architecture Deep Dive

The companion to ONBOARDING_AGENTIC_DEEP_DIVE.md. Onboarding was about building the twin. This is the payoff: an audience member talks to the finished twin. We follow Mia, a founder, chatting with Vlad's published Dolly. Grounded in wix-vmr-repo/legends-platform/persona-chat-service (Scala), with the cross‑repo voice pieces from catalyst-server/legends/{voice-service,vendor-gateway-service}.

Legend: 🟣 = the LLM brain (non‑deterministic) · 🟦 = deterministic scaffolding · 🟡 = an illustrative value I filled in · 🔶 = a link whose mechanism is proven but whose exact runtime wiring lives in deploy config / another repo (see §9).

Provenance: persona-chat-service is read closely (file:line cited). The actual model inference (Genie, AI Gateway), the knowledge-service, and the external Cloud agent live in other services — where the trail goes cold I say so.


Meet Mia — and the one big idea

Vlad finished onboarding; his twin is published. Mia opens it and types "How do you help with fundraising?" — or speaks it to the video avatar.

The one idea to anchor on: the published twin is a RAG‑grounded persona agent, and its brain is Genie by default. Text and voice/video are not two systems — they converge on one pipeline (ChatService.sendChatMessage), and the vendor (Tavus, etc.) only renders tokens Genie produced. This is the "other brain" we kept hitting in the onboarding dig — now pinned down.


0. Mental model — the chat agent vs. the onboarding agent

Onboarding (Atlas) Published chat (this doc)
Who talks the creator (Vlad) an audience member (Mia)
Brain onboarding-service LangGraph Genie (default), behind persona-chat-service
Job capture fields via tools + gates answer questions via RAG + persona prompt
Memory CloudStore per‑turn state persisted chat messages + the persona's knowledge
Personality mode guidelines persona data JSON + behaviorSettings

Where onboarding was a state machine driving tools, chat is retrieval‑augmented generation in the persona's voice — with optional tools so the twin keeps learning.


1. Entry points — text and voice converge

Two doors into persona-chat-service, one room:

Both call ChatService.sendChatMessage (ChatService.scala:513) — the single pipeline that does RAG → prompt assembly → LLM routing → stream → persist.


2. One full turn — Mia asks the published twin

[Diagram]

Code path: sendChatMessage (ChatService.scala:513) resolves config → routeToLlmClient (:422) → (default) resolveAssistantResponse (:724) → Genie; messages are persisted via chatMessageSDL.createBulk (:574).


3. Which brain answers? (the routing fork)

The "two brains" question, resolved. One fork decides the LLM:

[Diagram]

Bottom line: a published‑twin chat — text or voice — is answered by Genie by default, with AI Gateway / direct / custom as configurable alternatives. 🔶 Genie's internal inference is an external service (wix.genie.chat_service.v1); the trail goes cold at its RPC boundary.


4. Prompt assembly — the persona's voice

The system prompt is built from a persona template plus layered context (AIGatewayChatServiceClient.scala:193, Prompts.scala):

effectivePrompt =
    persona template (publishedMode_PersonaTemplate.txt)         # "You are [name]… respond as you naturally would"
  + <behaviorSettings>: conversationStyle, communicationTone,    # the very knobs set on the Dolly
      responseStyle, specialInstructions
  + renderRelevantKnowledge(RAG snippets)                        # facts pulled for THIS question
  + renderDynamicKnowledge(assistant-domain context)
  + renderConversationMode(...)

The template is strikingly engineered: "You are [name]… respond exactly as you would naturally," with hard rules to never reveal the behavior settings — so Mia can't get the twin to dump its own config. This is where the BehaviorSettings/CommunicationStyle/AnswerPreferences that Vlad set during onboarding (and dolly-service) actually take effect.


5. RAG — grounding the answer in what Vlad taught

Before the LLM call, the service retrieves the most relevant knowledge for Mia's question (AIGatewayChatServiceClient.scala:156AssistantTools.scala:51):

knowledgeServiceClient.searchKnowledge(
  SearchKnowledgeRequest(personaId, query, namespace, mode))

knowledge-service is the unified facade — it federates snippets + documents (Vespa) + links (the same store onboarding seeded and the teaching phase filled, §"How is knowledge captured" earlier). The ranked results are injected as renderRelevantKnowledge into the prompt. So the twin answers from Vlad's actual material, not just the base model.

🔶 The knowledge-service + Vespa internals live in other services (trail cold there).


6. Tools mid‑conversation — the twin keeps learning

On the AI Gateway path, the assistant can call tools during the chat (AIGatewayChatServiceClient.scala:173; tools in AssistantTools.scala):

This mirrors onboarding's "knowledge as a byproduct" idea: the published twin doesn't just answer — it can grow its own knowledge from conversations. (Tool calling is wired in the AI Gateway path; the default Genie path's tool orchestration is internal to Genie — 🔶.)


7. Voice/video — closing the {aiAssistantUrl} loop

This reconciles the open 🔶 from the onboarding voice dig. When Mia talks to the video twin, the LiveKit Cloud agent (or Tavus) doesn't have its own brain — it calls back into persona‑chat‑service, which runs the same pipeline:

[Diagram]

The reconciliation: vendor‑gateway pointed the Cloud agent's LLM at {aiAssistantUrl}/<vendor>/chat/ completions, and that endpoint is almost certainly persona‑chat‑service's GenerateChatCompletionsStreamWithVendor (path /v1/generate-completions/{vendor}/chat/completions/ stream, VendorChatService.scala:39). It maps the vendor conversation-idInteractiveConversationMappingchatId/personaId, then runs the identical RAG+prompt+Genie pipeline and streams tokens back to the vendor to render as voice/video.

🔶 The exact {aiAssistantUrl} value is deploy config (we couldn't read it), but the path shapes match — this is a strong inference, not a guess.


8. Persistence & identity


9. Onboarding vs. chat — the whole system in one frame

                    persona_id (the through-line)
   ┌───────────────────────────┴───────────────────────────┐
ONBOARDING (Atlas)                                  PUBLISHED CHAT (this doc)
 onboarding-service LangGraph                        persona-chat-service → Genie
 capture name/intent/photo/voice                     RAG over knowledge + persona prompt
 seeds snippets inline                               can mine new snippets mid-chat
 transport: LiveKit + Cloud agent                    transport: text OR LiveKit + Cloud agent
 brain: onboarding-service                           brain: Genie (default)

Two agents, two brains, one persona and one transport layer. Onboarding writes the twin; chat reads it (and keeps writing, via tool‑driven snippet creation).

TL;DR — patterns to take away

Pattern Where it shows up
Text + voice converge on one pipeline ChatService.sendChatMessage
RAG / grounded generation searchKnowledgerenderRelevantKnowledge
Persona prompt from stored settings publishedMode_PersonaTemplate.txt + behaviorSettings
"Never reveal the config" guardrail persona template hard rules
Pluggable LLM backend Genie (default) / AI Gateway / direct / CustomLlm / SPI
Vendor renders, doesn't think GenerateChatCompletionsStreamWithVendor → Genie → stream back
Twin keeps learning triggerKnowledgeSnippetCreation mid-chat
persona_id as the through-line capture → store → retrieve

Open 🔶 (runtime/deploy or external): Genie's internal inference, AI Gateway internals, knowledge-service + Vespa internals, and the literal {aiAssistantUrl} value all live outside persona-chat-service. Everything inside persona-chat-service (routing, RAG call, prompt assembly, persistence, vendor mapping) is code-grounded with file:line above.