A pure architectural view of the Capacity platform: what the buckets do, how they connect, and how they fit together. No SRE / observability concerns — those live in datadog-capacity-overview.md and datadog-capacity-onboarding.md.
13 business areas, organized as three concentric layers around a small set of platform fabric services.
Three layers:
Layer colors used throughout this document:
The telephony stack. Handles inbound and outbound voice calls — SIP / RTP ingestion via asterisk and voice-ingress, real-time speech-to-text via voice-proxy (which wraps the Azure Speech SDK), and post-call transcript analysis. Integrates with LumenVox externally for IVR / ASR. Outbound dialing campaigns are scheduled via the outbound-scheduler-* services.
The SMS channel via the Textel provider. interface-textel is the platform's SMS interface — both inbound (customer messages → tickets / conversations) and outbound (bulk responses, campaigns). bulk-responder handles bulk-SMS workflows.
Note: core-api calls Textel directly in addition to going through interface-textel. This is an architectural shortcut — likely a legacy code path.
Real-time conversation orchestration — both live chat with end users and the agent-assist surface for human operators. conversations is the central orchestration service: it routes inbound messages through agents, the LLM stack, and the knowledge base, and decides whether to escalate to humans or auto-respond. webhub is the Capacity Concierge product's entry point (the only service in the platform with a team:concierge tag).
Customer-support ticket workflow. ticketing is the central ticket store. omnichannel-inbox + omnichannel-grpc are the unified inbox that turns inbound messages (from email, SMS, chat) into tickets. interface-email is the email channel. real-time-hub pushes live ticket-update events to the UI. The cluster is LLM-augmented: ticketing calls into llm-hub and vector-db for AI triage, similar-ticket retrieval, and summarization.
The customer-facing UI surfaces — web (web, web-interface, webui), mobile (mobile), embedded survey widgets (surveys-ui), and document viewers (pdf-viewer, milestone-viewer). The backend gateway interface-gateway fans inbound HTTP/WS traffic out to channel-specific services. Almost every inbound customer request goes through interface-gateway.
The AI agent backbone. Coordinates LLM-driven agents that handle customer interactions, dispatch skills, and use tools. agents is the central agent registry; agent-service is the modern AI-agent backend; agent-assist-module provides real-time suggestions to human operators. mcp-service implements Anthropic's Model Context Protocol — the AI-agent tool-routing layer. skill-controller orchestrates agent skill execution.
The bucket integrates externally with LangSmith (LLM observability) and Tavily (AI-agent search API).
The central LLM inference gateway. llm-hub routes inference requests to AWS Bedrock (us-east-1 + us-west-2) and Azure Cognitive Services (Azure OpenAI) in parallel — a multi-cloud, multi-provider design. Every AI-powered feature in the platform ultimately calls llm-hub. tei-embedding serves embeddings (via HuggingFace's Text Embeddings Inference). mlflow-tracking-server tracks model training (offline use).
Traditional pre-LLM NLP services. nlpengine is the hub — calls nlp, nlp-data, plus the embedding services glove (GloVe) and guse (Google Universal Sentence Encoder). babelfish provides translation. label-studio is the ML annotation platform. brain / prototype-brain are the platform's NLP-domain decision engines.
This bucket coexists with the LLM bucket as a two-generation NLP architecture: NLP shows pre-LLM modular design (separate services per embedding type), LLM shows collapsed-into-LLM modern design.
Document retrieval and ML document ingestion. Three sub-architectures:
vector-db (semantic, HuggingFace + tei-embedding backed) and docsearch (lexical, Elasticsearch backed). Called by 5+ buckets.mldocs-extractor is the document-ingestion hub.drive (Google Drive integration), content-suggestion-generator, tag-recommender.External dependencies: HuggingFace (model hub), ask.lucy.ai. Per-tenant: mldocs-prmg + www.eprmg.net for the PRMG customer.
The lang.ai acquisition cluster — 25 × lang-ai-shared-* plus 4 × lang-* (including lang-mcp-server which presumably integrates with the Agents bucket via MCP). The cluster is invisible to Capacity Datadog APM (likely on a separate observability stack post-acquisition). Treat as a self-contained product surface that operates alongside the main platform but isn't visible in shared dashboards.
The platform's source of truth for identity, sessions, and customer state. core-api is the Flask/Graphene REST + GraphQL deployment; core-api-grpc is the same codebase exposing gRPC. internal-apps is the back-office admin surface. Every business area calls into core API for auth, session lookup, and shared data.
External: Auth0 (identity), Slack (notifications), Textel (SMS, via the direct-bypass shortcut), ask.lucy.ai.
Workflow and scheduling engine. workflows defines orchestration plans (high-level rules: "when X happens, do Y, then Z"). The plans execute via octopus (in the Other bucket). automations is the user-facing automation-builder UI. calendar + calendar-polling integrate with external calendars. email + email-service provide generic email infrastructure for automated sends.
External: wa.capacity.com (WhatsApp / WebApp tenant-brand endpoint).
Cross-cutting platform fabric. Not a product area — the substrate everything else uses. Organized into sub-clusters: orchestration / job runners (with octopus as the runtime executor for cross-bucket workflows), files / storage (file-service is the storage substrate), auth / identity, routing / integration, analytics / BI, audit / compliance, sites, dev / CI / test, engine helpers, plus a miscellaneous bin.
Per-bucket service catalog. For each bucket: how it fits in the system-at-a-glance, an internal-and-external connections diagram, and a table listing every service with its purpose.
Buckets are presented in criticality / fan-in order — most cross-cutting first. The (critical) flag identifies services with platform-wide fan-in.
Notation:
(critical) = service with platform-wide fan-inWhere this fits in the system at a glance: core API is the central node of the Platform fabric layer. Every other bucket (across all three layers — Channels, Intelligence, Platform fabric) calls into core API for auth, session lookup, and shared state writes. It has mutual edges with Voice, chats, Knowledge base, and Helpdesk, meaning those buckets both consume core API and are called back by it.
The bucket has two active deployments built from the same codebase, plus one admin surface and two near-silent services.
Two architectural roles:
Wire-protocol siblings — core-api and core-api-grpc share auth, sessions, and data; they differ only in how callers reach them. REST + GraphQL clients hit core-api; gRPC clients hit core-api-grpc via the capacity.coreapi.{auth,session,matchlog,gateway,graphql,user,resource_access,answer_engine}.v1.*service protobuf identifiers, which all resolve to the same deployment. This is why the External Connections diagram shows so much inbound fan-in to a single node — callers from every layer reach the same underlying state through whichever wire protocol fits them.
Admin surface — internal-apps is a back-office tool that calls core-api for tenant configuration and customer-account management. It's the only APM-visible direct caller of core-api from within its own bucket.
The remaining two services (graphql, external-api) are silent in APM in both directions. Possibilities: framework instrumentation leaks, dormant deployments, or services whose calls only happen via channels APM doesn't trace. Investigation pending.
| Service | Purpose within bucket |
|---|---|
core-api (critical) |
Flask/Graphene REST + GraphQL deployment. Handles auth, sessions, customer state, and shared data writes. The platform's source of truth for who-can-do-what. |
core-api-grpc |
Same codebase as core-api but exposed over gRPC for type-safe inter-service calls. Various capacity.coreapi.*.v1.*service protobuf identifiers resolve to this deployment. |
internal-apps |
Back-office admin surface. Tools used by operations/support staff to manage tenant configurations and customer accounts. |
graphql |
Likely a separate GraphQL deployment or framework-instrumentation entry. Purpose to confirm. |
external-api |
Public-facing API gateway — likely the entry point for partner / customer integrations. |
Where this fits in the system at a glance: Helpdesk is a customer-facing channel but uniquely depends on the Intelligence layer for its core function — ticketing calls llm-hub and vector-db for AI-augmented triage. So Helpdesk is a Channel that hard-depends on Intelligence, not just Platform fabric. Inbound traffic comes from other Channels (chats funnels messages into tickets) and from Automations (workflow-scheduled ticket operations).
The bucket is a classic hub-and-spoke around ticketing. The hub is the system of record; everything else either feeds tickets in, fans events out, or surfaces tickets to users.
Three internal flows shape the bucket:
Multi-channel inflow: interface-email ingests inbound customer emails via Mailgun and creates / updates tickets. omnichannel-inbox does the same for non-email channels (SMS, chat) — served via the omnichannel-grpc protobuf interface for type-safe gRPC clients. ticket-builder constructs tickets programmatically from event sources (e.g., when an Automations workflow needs to file a ticket). ticketing-bulk-daemon runs bulk operations — mass updates, imports, bulk closes.
Event outflow: when ticket state changes, ticketing pushes live updates to real-time-hub, which fans them out to connected UIs over WebSocket/SSE. The same state changes also trigger outbound emails through interface-email (making the ticketing ↔ interface-email edge bidirectional).
Read surfaces: helpdesk is the customer-portal-facing product surface for ticket management; mobile-helpdesk is the agent / operator mobile-app surface for triaging tickets on the go. Both are silent in APM (likely RUM-tracked).
The architectural distinctive of Helpdesk is the LLM-augmented loop: ticketing calls llm-hub for AI triage on new tickets and vector-db for similar-ticket retrieval. This is shown as a cross-bucket edge in the External connections diagram above. Practically, it means Helpdesk depends on Intelligence for its core value, not just for nice-to-have features.
| Service | Purpose within bucket |
|---|---|
ticketing (critical) |
Central ticket state store. System of record for all customer-support tickets. LLM-augmented via outbound calls to llm-hub and vector-db. |
interface-email |
Email channel adapter. Inbound emails become tickets via Mailgun; ticket events generate outbound emails. Bidirectional with ticketing. |
omnichannel-inbox |
Unified inbox that aggregates inbound messages from email, SMS, and chat into tickets. |
omnichannel-grpc |
gRPC face of the omnichannel routing layer (serves the capacity.omnichannel.*.v1.*service protobufs). |
real-time-hub |
WebSocket/SSE push layer that delivers live ticket-update events to UIs. |
helpdesk |
The product surface for the Helpdesk feature itself (likely admin/config or customer-portal-facing). |
mobile-helpdesk |
Mobile-app surface for the Helpdesk product. |
ticketing-bulk-daemon |
Bulk-ticket-operations daemon (mass updates, imports, bulk closes). |
ticket-builder |
Ticket-creation helper service that constructs tickets from various event sources for ingestion. |
Where this fits in the system at a glance: chats is the most cross-cutting Channel — conversations reaches into every other layer (Platform fabric for auth/state, Intelligence for LLM/agents/retrieval, Channels for handoffs to Helpdesk and frontend). The Capacity Concierge product (webhub) is the only platform-tagged team and lives here.
chats has three sub-architectures: conversation orchestration, the Concierge product surface, and the live-chat backend.
Three sub-architectures:
Conversation orchestration — conversations is the platform's super-connector. It receives messages, decides what to do (auto-respond via LLM, escalate to a human agent, file a ticket), and routes accordingly. Conversation state lives in interaction-service (queue-driven) and livedb (the live-conversation state store). livedb-client is a library leak — the Python SDK registered as an APM service entry rather than a real deployable.
Concierge product — webhub is the Capacity Concierge product's web entry point and is the only service in the platform with a team:concierge tag. It calls livechat directly, plus core-api, ticketing, real-time-hub, interface-gateway, and externally wa.aisoftware.com. The cluster also contains four library-instrumentation leaks (webhub-amqp, webhub-redis, concierge-client) which are AMQP / Redis / SDK clients registered as APM services rather than separate deployables. concierge-hosting is the tenant-specific deployment orchestrator for Concierge instances.
Live chat — livechat is the real-time customer-agent chat service. It's called by webhub (Concierge sessions), conversations (orchestrated escalation), and skill-controller (Agents handoff during AI-driven flows).
| Service | Purpose within bucket |
|---|---|
conversations (critical) |
Conversation orchestration. The "brain" that decides what to do with each inbound interaction — calls agents, LLM, retrieval, escalates to humans, creates tickets. |
webhub (team:concierge) |
Capacity Concierge product's web entry point. The HTTP/WS surface for the Concierge product. |
livechat |
Real-time customer-agent chat service. Handles live chat sessions. |
chat |
Generic chat service (purpose to confirm — possibly older live-chat implementation). |
interaction-service |
Conversation/interaction state service. Queue-driven (no observable HTTP edges). |
livedb |
Live conversation state store. Holds in-flight conversation state. |
livedb-client |
Library leak — the Python livedb client SDK. |
webhub-amqp |
Library leak — the AMQP client lib used by webhub. |
webhub-redis |
Library leak — the Redis client lib used by webhub. |
concierge-client |
Library leak — the Concierge SDK client lib. |
concierge-hosting |
Concierge hosting plane service. Tenant-specific deployment orchestrator. |
Where this fits in the system at a glance: frontend hosts the UI surfaces that customers actually interact with, plus the backend gateway (interface-gateway) that routes inbound HTTP/WS traffic into Channel-specific interfaces. All inbound external traffic enters the system through this bucket. The 8 UI services are RUM-tracked (not visible in APM); interface-gateway is the bucket's only APM-instrumented service.
frontend has a sharp internal split between the backend channel gateway (APM-active) and the customer-facing UI surfaces (RUM-tracked, APM-silent).
Two architectural zones:
Backend channel gateway — interface-gateway is the platform's front-door router. All inbound HTTP/WS traffic arrives here and gets fanned out to channel-specific interfaces (Voice's voice-ingress, SMS's interface-textel, chats' conversations). It's bidirectional with several of those services because of request/response callback patterns. This is the bucket's only critical-path service and the only one APM-visible.
Customer-facing UIs — 8 services hosting the visual surfaces. Web UIs (web, web-interface, webui, surveys-ui) are browser SPAs that emit to Datadog RUM. mobile is the mobile app backend; mobile clients emit to mobile RUM, a separate Datadog product surface. pdf-viewer and milestone-viewer are embedded document-viewer components. None of these emit HTTP APM spans, so they don't appear in dep graphs — but they do serve customer traffic and have their own RUM dashboards.
The odd one out is interface-web, the sibling of interface-textel (SMS) and interface-email (Helpdesk) — both of which are active. interface-web is silent in every attribution form, suggesting it's either deprecated, absorbed into interface-gateway, or RUM-only.
| Service | Purpose within bucket |
|---|---|
interface-gateway (critical) |
Front-door channel router. All inbound HTTP/WS customer traffic enters here and gets fanned out to channel-specific interfaces. Bidirectional with voice-ingress and interface-textel. |
web |
Customer-facing web SPA. RUM-tracked, not APM-instrumented. |
web-interface |
Web UI shell (likely the older / alternative web product). RUM-tracked. |
webui |
Web UI deployment (possibly a Vue/React frontend split alongside web). RUM-tracked. |
mobile |
Mobile app backend / catalog entry. Mobile clients emit to mobile RUM, not HTTP APM. |
surveys-ui |
UI for the surveys feature (paired with surveys backend in Other). |
interface-web |
Web channel interface. Sibling of interface-textel (SMS) and interface-email (Helpdesk) — possibly deprecated or absorbed into interface-gateway. |
pdf-viewer |
Browser-rendered PDF document viewer component. |
milestone-viewer |
Browser-rendered milestone/timeline viewer component. |
Where this fits in the system at a glance: Automations is in Platform fabric — it's the scheduling and orchestration plane that triggers work across other buckets. workflows defines plans (high-level rules); plan execution actually happens in Other's octopus. So Automations + Other together form the platform's two-layer orchestration stack.
Automations has three sub-architectures: the orchestration plane (workflows + automations + evaluations), generic email infrastructure, and calendar integration.
Three sub-architectures:
Orchestration plane — workflows defines high-level orchestration plans ("when X happens, do Y, then Z") and is the bucket's only critical-path-tier service. It's called by chats (conversations), Other (octopus), and RabbitMQ events; it fans out to 6 buckets including the sole caller path into KB's mldocs-hub. Plan execution happens in Other's octopus — workflows + octopus together form the platform's two-layer orchestration stack. automations is the customer-facing UI where users define their own rules; these rules then materialize as workflow plans. automation-evaluations is an offline evaluation harness for measuring automation correctness.
Generic email infrastructure — email is a generic email-send daemon (distinct from interface-email in Helpdesk, which is the helpdesk-channel-specific email path). email-service is an HTTP wrapper around it. Both are silent in APM — likely fire-and-forget queue-driven send paths consumed by workflows.
Calendar integration — calendar integrates with Google Calendar / Outlook; calendar-polling watches calendars for changes and emits events that workflows can subscribe to. Both silent in APM (polling daemon + sync helpers, matching the platform-wide bulk-daemon silence pattern).
| Service | Purpose within bucket |
|---|---|
workflows (critical) |
High-level orchestration plans. Defines "when X happens, do Y, then Z" rules across buckets. Sole inbound caller of mldocs-hub (KB). |
automations |
User-facing automation-builder UI. Customers define their own automation rules through this surface. |
automation-evaluations |
Evaluation harness for automation correctness and performance. |
calendar |
Calendar integration (Google Calendar / Outlook). |
calendar-polling |
Polling daemon that watches calendars for changes and triggers automation events. |
email |
Generic email-send daemon for automation flows. |
email-service |
HTTP wrapper for email. |
Where this fits in the system at a glance: Agents is in Intelligence. It depends on LLM (every agent call routes through llm-hub), Knowledge base (RAG retrieval via vector-db), and NLP (skill execution involves NLP routing). Inbound from chats (conversations → agents) and Voice (voice-proxy → agents). The bucket also has direct vendor dependencies on external AI tooling (LangSmith, Tavily).
Agents has four sub-architectures: agent coordination, agent backends (showing a legacy-evolution smell), skill execution, and tool routing.
Four sub-architectures:
Agent coordination — agents is the central registry / coordinator that dispatches work to specific agent backends. Called by voice-proxy (Voice handoff during live calls) and conversations (chats orchestration). Calls mcp-service for tool routing and vector-db for retrieval.
Agent backends — three services with overlapping names suggest a legacy-evolution pattern: agent-service is the modern AI-focused backend (calls LangSmith for LLM tracing); agent is the silent legacy implementation (likely deprecated); ai-agent is a third silent flavor of unclear purpose. The agent / agent-service / agents triplet is the bucket's biggest architectural smell — three services for one concept, with two of them silent.
Skill execution — skill-controller is the skill orchestrator that picks and runs skills based on conversation context. core-skills is the central skill registry, called by skill-controller via k8s DNS form. skill-controller is the most-connected service in the bucket, with edges into chats (conversations, livechat), NLP (prototype-brain), and frontend.
Tool routing — mcp-service implements Anthropic's Model Context Protocol for AI-agent tool use. It routes agent tool-use requests to internal services and external tools (notably Tavily AI search). Called by agents and by llm-hub (cross-bucket).
Plus three specialized services (agent-assist-module, agentic-webhooks, rpa-service) that are silent in APM — likely queue-driven or operating outside the traced HTTP path. rpa-service (Robotic Process Automation) probably runs browser-automation tasks on dedicated workers.
| Service | Purpose within bucket |
|---|---|
agents |
Central agent registry / coordination hub. Routes work to specific agent implementations. |
agent-service |
Modern AI-agent backend (the primary agent implementation). Integrates with LangSmith for tracing. |
agent |
Legacy agent service. Almost certainly deprecated. |
agent-assist-module |
Real-time agent-assist — provides AI-generated suggestions to human operators. |
ai-agent |
AI-specific agent flavor (purpose to confirm). |
agentic-webhooks |
Webhook handler for agent-driven workflows. |
mcp-service |
Anthropic Model Context Protocol server. Routes agent tool-use requests. |
core-skills |
Central skill registry. Defines what skills agents can execute. |
skill-controller |
Skill-execution orchestrator. Picks and runs skills based on conversation context. |
rpa-service |
Robotic Process Automation service. Browser-automation tasks for AI agents. |
Where this fits in the system at a glance: Knowledge base is in Intelligence and serves as the retrieval substrate for the entire platform — vector-db is called by 5+ buckets. The bucket has internal coupling to LLM (via the vector-db → tei-embedding edge for embedding compute), to NLP (mldocs uses guse), and to core API for auth on retrieval calls. The mldocs ETL sub-cluster ingests documents from external sources (HuggingFace, customer URLs).
KB has three sub-architectures with very different shapes: parallel retrieval engines, document-source integrations, and a dedicated 13-service ETL pipeline.
Three sub-architectures:
Retrieval / search — two parallel retrieval philosophies coexist. vector-db provides semantic (embedding-based) retrieval and is the broadest-fanin Knowledge service (called by 5 buckets); it computes embeddings via tei-embedding (LLM bucket). docsearch provides lexical (Elasticsearch-backed) retrieval — a separate Elasticsearch from NLP's nlp-data-es-master. Crucially, vector-db and docsearch don't connect to each other — readers pick the model that fits their use case. content-suggestion-generator and tag-recommender are sibling recommendation / classification services.
Document sources — drive syncs customer documents from Google Drive accounts into the knowledge base. Silent in APM (likely OAuth UI + queue-driven sync daemon).
mldocs ETL pipeline — a self-contained 13-service document-ingestion sub-cluster with its own dedicated RabbitMQ instance (mldocs-rabbitmq, distinct from the platform's prod / staging RabbitMQ). The flow: Automations' workflows schedules ingestion jobs by calling mldocs-hub (the cluster's public entry) → mldocs-hub enqueues to mldocs-rabbitmq → mldocs-extractor consumes jobs, fetches files (from file-service or tenant URLs like www.eprmg.net), computes embeddings via NLP's guse + HuggingFace, and writes the processed documents to mldocs-data (MySQL-backed). The 9 specialized tools (mldocs-classifier, mldocs-builder, mldocs-annotator, etc.) are silent in APM — likely offline batch operations or training tools. mldocs-prmg is a per-tenant service for the PRMG customer, paired with the external www.eprmg.net endpoint (the platform's first explicit per-customer service).
| Service | Purpose within bucket |
|---|---|
vector-db |
Semantic retrieval engine. The RAG substrate — called by 5+ buckets. Calls tei-embedding (LLM) for embeddings. |
docsearch |
Lexical document search. Elasticsearch-backed. |
content-suggestion-generator |
Generates content / article suggestions from the knowledge base. |
tag-recommender |
Auto-tagging engine for documents. |
| Service | Purpose within bucket |
|---|---|
drive |
Google Drive integration. Syncs customer documents from their Drive accounts. |
| Service | Purpose within bucket |
|---|---|
mldocs-extractor |
ETL hub. Reads jobs from mldocs-rabbitmq, fetches files, computes embeddings via guse + HuggingFace, writes to mldocs-data. |
mldocs-data |
Processed-document store. MySQL-backed. |
mldocs-hub |
Cluster's public entry point. Receives ingestion requests from workflows. |
mldocs-builder |
Document-build pipeline tool. |
mldocs-classifier |
Document classifier. |
mldocs-classifier-trainer |
Trains the classifier on labeled data. |
mldocs-annotator |
Document annotation service. |
mldocs-labeler-app |
Human-in-the-loop labeling UI. |
mldocs-logger |
mldocs-pipeline event logger. |
mldocs-query |
Document query tool. |
mldocs-dom |
DOM parser — extracts text from HTML documents. |
mldocs-splitter |
Document chunking. |
mldocs-prmg |
Per-tenant service for the PRMG customer. |
Where this fits in the system at a glance: LLM is in Intelligence and serves as the inference gateway for every AI-powered feature in the platform. llm-hub is called by Agents, chats, Helpdesk, and Knowledge base — meaning LLM outages cascade across half the platform's product surface. The bucket has heavy external dependencies on AWS Bedrock and Azure Cognitive Services (running in parallel — multi-cloud routing).
LLM has three sub-architectures, all internally disconnected — no observable APM edges between LLM bucket members.
Three sub-architectures:
LLM inference gateway — llm-hub is the central inference router and the bucket's only critical-pathlike service. It implements multi-cloud, multi-provider routing: every LLM inference call goes to AWS Bedrock (two regions: us-east-1 + us-west-2) and Azure Cognitive Services (two resources: prod + dev) in parallel. This is either vendor-hedge failover or per-model routing — sophisticated for a contact-center platform. llm-hub-grpc is a thin gRPC face for clients that need that wire protocol (e.g., other internal gRPC services); it calls core-api-grpc.
Embedding inference — tei-embedding is a HuggingFace Text Embeddings Inference server. It is not called from llm-hub directly: it sits behind vector-db (KB bucket) and is reached only via the tei.v1.embed protobuf form from there. This means embedding lives architecturally in the retrieval path, not the inference path — the embedding model is computed at index time (vector-db → tei) and read at query time (other buckets → vector-db → cached vectors).
Training tracking — mlflow-tracking-server tracks ML experiment results. Used offline for model-training tracking, not on the runtime LLM path. Silent in APM, consistent with offline use.
External dependencies for llm-hub are the bucket's most consequential surface: AWS Bedrock + Azure Cognitive (both LLM inference), Sentry (error tracking — first non-Datadog observability vendor in the platform), LangSmith (LLM tracing).
| Service | Purpose within bucket |
|---|---|
llm-hub |
Multi-cloud LLM inference gateway. Routes to AWS Bedrock + Azure Cognitive Services in parallel. |
llm-hub-grpc |
gRPC interface for llm-hub. |
tei-embedding |
Embedding inference server (HuggingFace Text Embeddings Inference). Serves embeddings to vector-db. |
mlflow-tracking-server |
ML experiment tracking server. Used offline for model training tracking. |
Where this fits in the system at a glance: Voice is the telephony Channel. It depends on Platform fabric (core API for auth/session) and the Intelligence layer (handoff to Agents for AI-driven call handling). External SIP carriers feed asterisk outside the APM-traced graph; LumenVox is the third-party voice / ASR vendor.
Voice has a linear edge → real-time → post-call architecture — the bucket reads left-to-right like a pipeline.
Three pipeline stages:
Edge (SIP) — asterisk is the open-source SIP PBX that terminates inbound calls from telephony carriers and originates outbound calls. It's C-based with no native Datadog APM tracer, so it's invisible in dep graphs — observability would have to be via logs / metrics / Statsd. voice-ingress is the cluster's HTTP/gRPC hub — it receives voice-call control events, calls core-api for auth/session/matchlog, and dispatches to downstream voice services.
Real-time processing — voice-proxy wraps the Azure Speech SDK (CGO) for real-time speech-to-text on live call audio. It calls voice-ingress (control plane), agents (AI agent handoff during calls), and sensitive-data-service in Other (PII redaction for call audio). The expected outbound edge to Azure Speech doesn't appear in APM because the CGO library doesn't emit Datadog-instrumentable HTTP spans.
Post-call analysis — audio-transcription-service produces transcripts (likely a batch / async ASR path distinct from voice-proxy's real-time STT). transcript-analyzer does post-call analytics (sentiment, summarization, key-moment extraction). Both are silent in APM, consistent with batch / queue-driven operation.
External dependencies — three of the four are invisible to Capacity US1 APM:
voice-proxy. It's invisible to Capacity US1 APM because Verbio's observability stack lives in the EU Datadog org (datadoghq.eu) — voice-proxy's gRPC call to Speech Center crosses an org boundary that the US1 dep graph can't span. This is structurally the same gap as the Lang bucket APM blackout, just at the edge of a single Voice service rather than across a whole bucket. Both are acquired-engineering-org observability that didn't merge into Capacity US1.capacity/voice-proxy/CLAUDE.md workspace doc as a CGO-wrapped SDK; not visible in current APM. May be legacy, only used for specific customer flows, or fully replaced by Speech Center — confirmation needed. CGO bindings would suppress HTTP-style traces regardless.| Service | Purpose within bucket |
|---|---|
voice-ingress |
Voice cluster hub. Entry point for voice traffic. Calls core-api and dispatches downstream. |
voice-proxy |
Real-time voice processing service. Wraps the Azure Speech SDK. Calls agents and sensitive-data-service. |
asterisk |
Open-source PBX. Handles SIP trunk for inbound/outbound calls. C-based — no native Datadog APM tracer. |
audio-transcription-service |
Audio → text ASR service. |
transcript-analyzer |
Post-call analytics over transcripts. |
Where this fits in the system at a glance: NLP is in Intelligence and represents the pre-LLM NLP generation that coexists with the modern LLM bucket. It's tightly coupled to Agents (nlpengine ↔ skill-controller) and is reached from frontend (interface-gateway → nlpengine). The bucket has its own Elasticsearch backing store, separate from KB's.
NLP has the densest internal coupling of any bucket — nlpengine is a hub that directly calls four internal services, plus a fifth internal edge from nlp-data.
Four sub-architectures:
Data layer — nlp-data holds NLP-specific data backed by nlp-data-es-master (a dedicated Elasticsearch cluster, separate from KB's docsearch Elasticsearch). nlp is a related service, active under k8s DNS form. The data layer is backed by Elasticsearch for both storage and search.
Embedding services (pre-LLM era) — two embedding models served as services: glove (GloVe word embeddings) and guse (Google Universal Sentence Encoder). Both are pre-LLM-era models and are backed by the same nlp-data-es-master Elasticsearch. guse is also called cross-bucket by mldocs-extractor (KB's document ingestion pipeline) — making it a shared embedding service between NLP and KB.
Decision engines — prototype-brain is the active NLP decision engine (called by Agents' skill-controller via k8s DNS). brain is its silent legacy ancestor — same name pattern, same probable role, but silent in APM (likely deprecated).
Specialized tools — babelfish (translation), label-studio (ML annotation platform, typically used offline), nlpmail-receiver-daemon (inbound-email NLP parser), nlsql (natural-language-to-SQL). All silent in APM, consistent with their batch / offline natures.
The bucket is a two-generation NLP architecture alongside the LLM bucket: NLP has separate modular embedding services from the pre-LLM era; LLM has collapsed inference into one provider gateway. Both run in production — likely because the older NLP features still serve specific use cases that don't justify migration.
| Service | Purpose within bucket |
|---|---|
nlpengine |
NLP processing hub. Orchestrates classification, embedding, and analysis pipelines. |
nlp |
Specific NLP service. |
nlp-data |
NLP data store. Backed by nlp-data-es-master Elasticsearch. |
glove |
GloVe word-embedding model served as a service. |
guse |
Google Universal Sentence Encoder served as a service. |
brain |
Legacy NLP decision engine. Likely deprecated alias of prototype-brain. |
prototype-brain |
Active NLP decision engine. |
babelfish |
Translation service. |
label-studio |
ML annotation platform. |
nlpmail-receiver-daemon |
Inbound-email parser for NLP processing. |
nlsql |
Natural-language-to-SQL service. |
Where this fits in the system at a glance: SMS is the smallest Channel. The clean architecture routes inbound/outbound SMS through interface-textel, which is bidirectional with interface-gateway (frontend). A direct-bypass shortcut exists where core-api calls Textel without going through interface-textel — a legacy code path or deliberate fast lane.
SMS is the smallest active bucket — just two services with no internal edges.
Two services:
interface-textel is the SMS channel adapter handling individual messages in both directions. Inbound customer SMS arrives via the Textel webhook into interface-textel, which routes the message into conversations (chats) or ticketing (Helpdesk) depending on context. Outbound responses go back to Textel for delivery. It has a loop-back edge with interface-gateway (frontend) — the same pattern as Voice's voice-ingress ↔ interface-gateway.
bulk-responder handles bulk SMS campaigns (high-volume outbound). Silent in APM, consistent with the platform's bulk-daemon silence pattern.
The bucket has a notable architectural shortcut worth flagging: core-api calls Textel directly, bypassing the interface-textel abstraction. Either a legacy code path (predating the channel abstraction) or a deliberate fast path for specific outbound flows. The clean channel-abstraction architecture goes through interface-textel; this is a parallel path that escapes that abstraction.
| Service | Purpose within bucket |
|---|---|
interface-textel |
SMS channel adapter. Both inbound (customer messages → tickets/conversations) and outbound (responses, campaigns). |
bulk-responder |
Bulk SMS response engine for high-volume outbound campaigns. |
Where this fits in the system at a glance: Lang is nominally in Intelligence (it's an acquired AI-and-NLP product) but is isolated from the rest of the platform's APM-visible architecture. It runs as a self-contained product surface — lang-mcp-server is the likely integration point with Agents' mcp-service, but no APM traces confirm the integration. The bucket should be treated as a black box from the rest of the architecture's perspective.
The Lang bucket is APM-invisible — 58 dependency probes (29 services × 2 directions) returned zero observable edges. Not a single lang-* service appeared as a destination in any other bucket's probes across the entire discovery. The architecture below is inferred from service names only — we have no runtime evidence of any of these edges.
Eight inferred sub-architectures (none verified by APM):
Top-level lang-* (4 services) — the lang.ai product surface: lang-api (main API), lang-frontend (UI), lang-analysis-worker (background analysis), lang-mcp-server (the likely integration point with Capacity's Agents bucket via Anthropic's MCP protocol — if the integration exists in production).
Shared core (5) — main API, auth, processing engine, GraphQL backend, internal admin frontend for the acquired lang.ai codebase.
ML / Intent (4) — classifiers, intent recognition engine + executor, Stanford CoreNLP integration.
Channels / integrations (4) — channel adapters, third-party integrations (including Salesforce), sync service.
Storage / data (2) — persistence layer + database migration runner.
Suggestions / discovery (4) — suggestion engine, business-insights analytics, discovery orchestrator, content librarian.
Config / notifications (2) — configurator gateway, notificator.
Periodic / tickers (4) — automation-recommendations + ticker pair, activity-peaks + ticker pair. The "ticker" naming suggests periodic refresh of analytics.
The cluster should be treated as a black box from the rest of the Capacity architecture's perspective. The most consequential question is whether lang-mcp-server actually integrates with Capacity's mcp-service (Agents) in production — if so, AI agents in Capacity may be invoking lang.ai capabilities without that traffic appearing in any APM dashboard.
| Service | Purpose within bucket |
|---|---|
lang-api |
lang.ai's main API surface. |
lang-frontend |
lang.ai's user-facing UI. |
lang-analysis-worker |
Background analysis worker. |
lang-mcp-server |
MCP server — likely integration point with Capacity's Agents bucket. |
Shared infrastructure of the acquired codebase. Inferred purposes:
| Service | Purpose within bucket |
|---|---|
lang-ai-shared-api |
Main shared API. |
lang-ai-shared-auth |
Authentication for the lang.ai cluster. |
lang-ai-shared-engine |
Core processing engine. |
lang-ai-shared-graphql-backend |
GraphQL API backend. |
lang-ai-shared-frontend |
Internal admin UI. |
lang-ai-shared-classifiers |
ML classifiers. |
lang-ai-shared-intents-engine |
Intent recognition engine. |
lang-ai-shared-intents-engine-executor |
Intent execution / fulfillment. |
lang-ai-shared-channels |
Channel adapters. |
lang-ai-shared-integrations |
Third-party integrations. |
lang-ai-shared-salesforce-integrator |
Salesforce-specific integration. |
lang-ai-shared-storage |
Data persistence layer. |
lang-ai-shared-db-migrate |
Database migration runner. |
lang-ai-shared-corenlp |
Stanford CoreNLP integration. |
lang-ai-shared-suggestions |
Suggestion engine. |
lang-ai-shared-business-insights |
Analytics surface for business metrics. |
lang-ai-shared-discover-orchestrator |
Discovery orchestration. |
lang-ai-shared-librarian |
Content library / knowledge base. |
lang-ai-shared-configurator-gateway |
Configuration gateway. |
lang-ai-shared-synchronizer |
Sync service. |
lang-ai-shared-notificator |
Notification service. |
lang-ai-shared-automation-recommendations |
Automation suggestions. |
lang-ai-shared-automation-recommendations-ticker |
Periodic refresh of automation recommendations. |
lang-ai-shared-activity-peaks |
Activity-peak analytics. |
lang-ai-shared-activity-peaks-ticker |
Periodic refresh of activity-peak analytics. |
Where this fits in the system at a glance: Other is the Platform fabric substrate — not a product area, but the infrastructure everything else uses. Inside Other, octopus is the cross-bucket runtime executor (paired with Automations' workflows); file-service is the storage substrate; notifications is the alert fan-out. Other also contains the audit / auth / analytics / dev-tooling sub-clusters that don't fit elsewhere.
Other contains 10 sub-clusters — too many to fit in a single internal diagram. The shape below shows the active hubs (3 sub-clusters with critical-tier services) and collapses the silent sub-clusters to grouped nodes.
Three sub-clusters carry the bucket's architectural weight:
Orchestration (13a) — octopus is the platform's job-execution runtime and a critical-path service. It runs the plans defined by Automations' workflows and fans out into 13+ services across nearly every other bucket. notifications is the cross-bucket alert fan-out (called by core-api + ticketing; calls interface-email, llm-hub, analytics-api). third-party-interface-hub routes third-party integration callbacks.
Files / storage (13b) — file-service is the platform's storage substrate. Helpdesk's ticket attachments, KB's source documents, and most cross-cutting binary-data needs flow through it. It's broadly depended on (inbound from Helpdesk, KB, chats, Agents) but has only one observable downstream itself (core-api).
Routing / integration (13d) — relay is a generic event relay; capacity-distributed-service is the distributed-systems coordinator that integrates with Atlassian OAuth externally (a 14th vendor that doesn't appear anywhere else).
The other seven sub-clusters are largely silent:
passport is APM-active, and only because octopus calls it. The other 7 (login, saml, token, etc.) operate outside the HTTP request path because auth happens at the gateway and core-api, not in these services.analytics-api is active (calls Tableau); web-analytics is active via DNS; 4 are silent.unnamed-python-service (a service with no name set in instrumentation — bug to fix) which is the only one with a dependency edge (called by mldocs-prmg via DNS, which rescued mldocs-prmg from the silent column).| Service | Purpose within bucket |
|---|---|
octopus (critical) |
Job-execution runtime. Runs the plans defined by workflows. Calls 13+ services across nearly every bucket. |
notifications |
Cross-bucket alert fan-out hub. |
third-party-interface-hub |
Routing service for third-party integration callbacks. |
octopus-redis |
Library leak — Redis client lib used by octopus. |
webhooks |
Generic webhook handler. |
| Service | Purpose within bucket |
|---|---|
file-service |
Storage substrate. Holds attachments, documents, and assets across the platform. |
file-service-go |
Go rewrite of file-service (migration in progress). |
print-service |
Print job handling for printable outputs. |
| Service | Purpose within bucket |
|---|---|
passport |
Authentication coordinator. Likely a Passport.js-style auth-middleware service. |
login |
Login page / flow backend. |
saml |
SAML SSO integration for enterprise tenants. |
token |
Token issuance and validation. |
token-expiration-check |
Token-lifecycle daemon. |
user_accounts |
User account management. |
warden |
Authorization / policy decision service. |
sensitive-data-service |
PII handling and redaction. Called by voice-proxy. |
| Service | Purpose within bucket |
|---|---|
relay |
Generic event relay between internal services and external integrations. |
capacity-distributed-service |
Distributed-systems coordinator. Integrates with Atlassian OAuth externally. |
| Service | Purpose within bucket |
|---|---|
analytics-api |
Internal analytics query API. Calls Tableau for BI dashboards. |
analytics-ext-api |
External / customer-facing analytics surface. |
metrics-api |
Internal platform-metrics API. |
data-warehouse-exporter |
Exports operational data into the analytics warehouse. |
datacollector |
Generic data-collection / ingestion daemon. |
web-analytics |
Web-analytics event tracker. |
| Service | Purpose within bucket |
|---|---|
access_evaluation |
Access-control evaluation engine. |
audit-daemon |
Audit event consumer. |
confluence-audit-records |
Confluence audit log ingester. |
event-logs |
General-purpose event log aggregator. |
github.audit.streaming |
GitHub audit log streamer. |
gitlab-audit-approvals |
GitLab merge-request approval audit. |
gitlab-user-role-check |
GitLab role-based access checker. |
jira-audit-records |
Jira audit log ingester. |
traffic-logs |
Network/API traffic log aggregator. |
| Service | Purpose within bucket |
|---|---|
sites |
Tenant-customizable sites engine. |
demo-sites |
Demo / sandbox site deployments. |
site-demo |
A specific demo site instance. |
private-sites |
Private / internal site hosting. |
| Service | Purpose within bucket |
|---|---|
dev-db-export |
Development-database export tool. |
e2e |
End-to-end test harness. |
hotfix-bot |
Hotfix automation bot. |
param-testing |
Parameter / config testing utility. |
stage-db-rebuild |
Staging database rebuild tool. |
| Service | Purpose within bucket |
|---|---|
engine-builder |
Build-time engine assembly tool. |
engine-logger |
Engine log processor. |
| Service | Purpose within bucket |
|---|---|
admin |
Generic admin service. |
autoturk |
Mechanical-Turk-style task routing. |
batch-file-upload |
Bulk file-upload handler. |
controller |
Generic controller service. |
converter |
Generic format-conversion service. |
delorean |
Backfill / time-travel / replay tool. |
headless-browser |
Headless browser farm. |
keep |
Persistence / retention service. |
lambdasjs |
JavaScript Lambda function runner. |
lambdaspy |
Python Lambda function runner. |
meet |
Meeting / video-conferencing integration. |
rules |
Generic rules engine. |
social |
Social-media integration. |
surveys |
Survey-engine backend (paired with surveys-ui in frontend). |
tap |
Generic tap / data-intercept service. |
unnamed-python-service |
A service with no name set in instrumentation — bug to fix. |
vuetiful-joe |
Internal tool — purpose unknown. |
The major architectural edges, organized by source bucket.
voice-ingress → core-api — auth/session/matchlog lookups for inbound calls.voice-proxy → agents — handoff from live voice to agent-assist / AI agents.voice-proxy → sensitive-data-service — PII handling for call audio.voice-ingress → LumenVox (external) — third-party IVR/ASR.outbound-scheduler-http-client → voice-ingress — outbound dialer initiating calls.The most cross-cutting bucket. conversations calls into nearly every other bucket:
conversations → core-api — session/state.conversations → ticketing (Helpdesk) — creates tickets from conversations.conversations → interface-email (Helpdesk) — email reply handling.conversations → llm-hub (LLM) — AI response generation.conversations → vector-db (Knowledge base) — RAG retrieval.conversations → agents, agent-service (Agents) — agent dispatch.conversations → workflows (Automations) — trigger automation flows.conversations → interface-gateway (frontend) — push UI updates.conversations → Slack (external).webhub is the Concierge product's entry point — calls ticketing, core-api, interface-gateway, livechat, real-time-hub, plus external wa.aisoftware.com.
ticketing → core-api — session/state.ticketing → llm-hub (LLM) — AI triage / summarization.ticketing → vector-db (Knowledge base) — similar-ticket retrieval.ticketing → interface-email (within Helpdesk) — email notifications (bidirectional).ticketing → real-time-hub (within Helpdesk) — push ticket-update events.ticketing → file-service (Other) — ticket attachments.ticketing → mldocs-hub (Knowledge base) — document processing.ticketing → notifications (Other) — fan-out alerts.ticketing → Mailgun, Slack (external).agents → mcp-service (within Agents), agents → core-api (via graphql), agents → vector-db (Knowledge base).agent-service → llm-hub (LLM), agent-service → core-api, agent-service → LangSmith (external).skill-controller → conversations (chats), skill-controller → livechat (chats), skill-controller → core-api, skill-controller → prototype-brain (NLP).mcp-service → core-api-grpc, mcp-service → Tavily (external AI search).vector-db → core-api (via auth/resource-access/user gRPC protobufs).vector-db → tei-embedding (LLM, via tei.v1.embed protobuf) — embedding compute.vector-db → file-service (Other) — document storage.vector-db → HuggingFace, ask.lucy.ai (external).docsearch → Elasticsearch (own backing store).mldocs-extractor → core-api, mldocs-extractor → file-service, mldocs-extractor → guse (NLP), mldocs-extractor → HuggingFace, mldocs-extractor → www.eprmg.net (tenant external).llm-hub → core-api, llm-hub → mcp-service (Agents), llm-hub → vector-db (Knowledge base — mutual edge).llm-hub → AWS Bedrock, llm-hub → Azure Cognitive, llm-hub → Sentry, llm-hub → LangSmith (all external).llm-hub-grpc → core-api-grpc.nlpengine → nlp, nlp-data, glove, guse (all internal — densest internal coupling in the platform).nlpengine → core-api, nlpengine → skill-controller (Agents — bidirectional).nlp-data → core-api, nlp-data → guse, nlp-data → Elasticsearch.workflows → core-api, workflows → ticketing (Helpdesk), workflows → interface-email (Helpdesk), workflows → mldocs-hub (Knowledge base — sole inbound caller), workflows → interface-gateway (frontend).workflows → wa.capacity.com (external WhatsApp).automations → core-api.interface-gateway)interface-gateway → nlpengine (NLP), interface-gateway → conversations (chats — bidirectional), interface-gateway → core-api-grpc + protobuf services.interface-gateway → interface-textel (SMS — bidirectional).interface-gateway → RabbitMQ (event-driven).octopus → 13+ services across nearly every bucket.file-service → core-api — sole observable downstream. Inbound from Helpdesk, KB, others.notifications → core-api, ticketing, interface-email, llm-hub, analytics-api.relay → core-api, file-service.passport → core-api, file-service — called only by octopus.capacity-distributed-service → core-api, Atlassian (external).No observable edges in Capacity Datadog. The cluster operates outside the visible map.
How customer interactions traverse the architecture.
customer browser
→ web (frontend, RUM-tracked)
→ interface-gateway (frontend)
→ conversations (chats)
├─ → core-api (core API) for session/auth
├─ → llm-hub (LLM) for AI response generation
│ ├─ → AWS Bedrock / Azure Cognitive (external)
│ └─ → vector-db (Knowledge base) for RAG retrieval
├─ → agents → agent-service (Agents) for agent routing
└─ → ticketing (Helpdesk) when escalating to ticket
SIP carrier
→ asterisk (Voice, SIP/RTP — not APM-traced)
→ voice-ingress (Voice)
├─ → core-api (core API) for auth/session/matchlog
└─ → voice-proxy (Voice, Azure Speech SDK)
├─ → Azure Speech (external)
├─ → agents (Agents) for agent-assist
└─ → audio-transcription-service / transcript-analyzer (Voice)
customer email
→ Mailgun (external inbound)
→ interface-email (Helpdesk)
→ ticketing (Helpdesk)
├─ → llm-hub (LLM) for AI triage
│ └─ → vector-db (Knowledge base) for similar-ticket retrieval
├─ → core-api (core API) for state persistence
├─ → notifications (Other) for fan-out alerts
└─ → real-time-hub (Helpdesk) for live UI updates
scheduling trigger (RabbitMQ event or cron)
→ workflows (Automations) — defines the plan
→ octopus (Other) — executes the plan
├─ → ticketing (Helpdesk) for ticket creation
├─ → automations → core-api (core API) for state changes
├─ → notifications (Other) for fan-out
└─ → workflows callbac