Capacity Platform — Architecture

A pure architectural view of the Capacity platform: what the buckets do, how they connect, and how they fit together. No SRE / observability concerns — those live in datadog-capacity-overview.md and datadog-capacity-onboarding.md.


System at a glance

13 business areas, organized as three concentric layers around a small set of platform fabric services.

[Diagram]

Three layers:

Layer colors used throughout this document:


Bucket descriptions

Customer-facing channels

Voice (5 services)

The telephony stack. Handles inbound and outbound voice calls — SIP / RTP ingestion via asterisk and voice-ingress, real-time speech-to-text via voice-proxy (which wraps the Azure Speech SDK), and post-call transcript analysis. Integrates with LumenVox externally for IVR / ASR. Outbound dialing campaigns are scheduled via the outbound-scheduler-* services.

SMS (2 services)

The SMS channel via the Textel provider. interface-textel is the platform's SMS interface — both inbound (customer messages → tickets / conversations) and outbound (bulk responses, campaigns). bulk-responder handles bulk-SMS workflows.

Note: core-api calls Textel directly in addition to going through interface-textel. This is an architectural shortcut — likely a legacy code path.

chats (11 services)

Real-time conversation orchestration — both live chat with end users and the agent-assist surface for human operators. conversations is the central orchestration service: it routes inbound messages through agents, the LLM stack, and the knowledge base, and decides whether to escalate to humans or auto-respond. webhub is the Capacity Concierge product's entry point (the only service in the platform with a team:concierge tag).

Helpdesk (9 services)

Customer-support ticket workflow. ticketing is the central ticket store. omnichannel-inbox + omnichannel-grpc are the unified inbox that turns inbound messages (from email, SMS, chat) into tickets. interface-email is the email channel. real-time-hub pushes live ticket-update events to the UI. The cluster is LLM-augmented: ticketing calls into llm-hub and vector-db for AI triage, similar-ticket retrieval, and summarization.

frontend (9 services)

The customer-facing UI surfaces — web (web, web-interface, webui), mobile (mobile), embedded survey widgets (surveys-ui), and document viewers (pdf-viewer, milestone-viewer). The backend gateway interface-gateway fans inbound HTTP/WS traffic out to channel-specific services. Almost every inbound customer request goes through interface-gateway.

AI / intelligence

Agents (10 services)

The AI agent backbone. Coordinates LLM-driven agents that handle customer interactions, dispatch skills, and use tools. agents is the central agent registry; agent-service is the modern AI-agent backend; agent-assist-module provides real-time suggestions to human operators. mcp-service implements Anthropic's Model Context Protocol — the AI-agent tool-routing layer. skill-controller orchestrates agent skill execution.

The bucket integrates externally with LangSmith (LLM observability) and Tavily (AI-agent search API).

LLM (4 services)

The central LLM inference gateway. llm-hub routes inference requests to AWS Bedrock (us-east-1 + us-west-2) and Azure Cognitive Services (Azure OpenAI) in parallel — a multi-cloud, multi-provider design. Every AI-powered feature in the platform ultimately calls llm-hub. tei-embedding serves embeddings (via HuggingFace's Text Embeddings Inference). mlflow-tracking-server tracks model training (offline use).

NLP (11 services)

Traditional pre-LLM NLP services. nlpengine is the hub — calls nlp, nlp-data, plus the embedding services glove (GloVe) and guse (Google Universal Sentence Encoder). babelfish provides translation. label-studio is the ML annotation platform. brain / prototype-brain are the platform's NLP-domain decision engines.

This bucket coexists with the LLM bucket as a two-generation NLP architecture: NLP shows pre-LLM modular design (separate services per embedding type), LLM shows collapsed-into-LLM modern design.

Knowledge base (18 services)

Document retrieval and ML document ingestion. Three sub-architectures:

External dependencies: HuggingFace (model hub), ask.lucy.ai. Per-tenant: mldocs-prmg + www.eprmg.net for the PRMG customer.

Lang (29 services) — APM-invisible

The lang.ai acquisition cluster — 25 × lang-ai-shared-* plus 4 × lang-* (including lang-mcp-server which presumably integrates with the Agents bucket via MCP). The cluster is invisible to Capacity Datadog APM (likely on a separate observability stack post-acquisition). Treat as a self-contained product surface that operates alongside the main platform but isn't visible in shared dashboards.

Platform fabric

core API (5 services)

The platform's source of truth for identity, sessions, and customer state. core-api is the Flask/Graphene REST + GraphQL deployment; core-api-grpc is the same codebase exposing gRPC. internal-apps is the back-office admin surface. Every business area calls into core API for auth, session lookup, and shared data.

External: Auth0 (identity), Slack (notifications), Textel (SMS, via the direct-bypass shortcut), ask.lucy.ai.

Automations (7 services)

Workflow and scheduling engine. workflows defines orchestration plans (high-level rules: "when X happens, do Y, then Z"). The plans execute via octopus (in the Other bucket). automations is the user-facing automation-builder UI. calendar + calendar-polling integrate with external calendars. email + email-service provide generic email infrastructure for automated sends.

External: wa.capacity.com (WhatsApp / WebApp tenant-brand endpoint).

Other (62 services across 10 sub-clusters)

Cross-cutting platform fabric. Not a product area — the substrate everything else uses. Organized into sub-clusters: orchestration / job runners (with octopus as the runtime executor for cross-bucket workflows), files / storage (file-service is the storage substrate), auth / identity, routing / integration, analytics / BI, audit / compliance, sites, dev / CI / test, engine helpers, plus a miscellaneous bin.


Service map

Per-bucket service catalog. For each bucket: how it fits in the system-at-a-glance, an internal-and-external connections diagram, and a table listing every service with its purpose.

Buckets are presented in criticality / fan-in order — most cross-cutting first. The (critical) flag identifies services with platform-wide fan-in.

Notation:


1. core API · 5 services · 🟢 Platform fabric

Where this fits in the system at a glance: core API is the central node of the Platform fabric layer. Every other bucket (across all three layers — Channels, Intelligence, Platform fabric) calls into core API for auth, session lookup, and shared state writes. It has mutual edges with Voice, chats, Knowledge base, and Helpdesk, meaning those buckets both consume core API and are called back by it.

External connections

[Diagram]

Internal architecture

The bucket has two active deployments built from the same codebase, plus one admin surface and two near-silent services.

[Diagram]

Two architectural roles:

  1. Wire-protocol siblingscore-api and core-api-grpc share auth, sessions, and data; they differ only in how callers reach them. REST + GraphQL clients hit core-api; gRPC clients hit core-api-grpc via the capacity.coreapi.{auth,session,matchlog,gateway,graphql,user,resource_access,answer_engine}.v1.*service protobuf identifiers, which all resolve to the same deployment. This is why the External Connections diagram shows so much inbound fan-in to a single node — callers from every layer reach the same underlying state through whichever wire protocol fits them.

  2. Admin surfaceinternal-apps is a back-office tool that calls core-api for tenant configuration and customer-account management. It's the only APM-visible direct caller of core-api from within its own bucket.

The remaining two services (graphql, external-api) are silent in APM in both directions. Possibilities: framework instrumentation leaks, dormant deployments, or services whose calls only happen via channels APM doesn't trace. Investigation pending.

Services

Service Purpose within bucket
core-api (critical) Flask/Graphene REST + GraphQL deployment. Handles auth, sessions, customer state, and shared data writes. The platform's source of truth for who-can-do-what.
core-api-grpc Same codebase as core-api but exposed over gRPC for type-safe inter-service calls. Various capacity.coreapi.*.v1.*service protobuf identifiers resolve to this deployment.
internal-apps Back-office admin surface. Tools used by operations/support staff to manage tenant configurations and customer accounts.
graphql Likely a separate GraphQL deployment or framework-instrumentation entry. Purpose to confirm.
external-api Public-facing API gateway — likely the entry point for partner / customer integrations.

2. Helpdesk · 9 services · 🔵 Channels

Where this fits in the system at a glance: Helpdesk is a customer-facing channel but uniquely depends on the Intelligence layer for its core function — ticketing calls llm-hub and vector-db for AI-augmented triage. So Helpdesk is a Channel that hard-depends on Intelligence, not just Platform fabric. Inbound traffic comes from other Channels (chats funnels messages into tickets) and from Automations (workflow-scheduled ticket operations).

External connections

[Diagram]

Internal architecture

The bucket is a classic hub-and-spoke around ticketing. The hub is the system of record; everything else either feeds tickets in, fans events out, or surfaces tickets to users.

[Diagram]

Three internal flows shape the bucket:

  1. Multi-channel inflow: interface-email ingests inbound customer emails via Mailgun and creates / updates tickets. omnichannel-inbox does the same for non-email channels (SMS, chat) — served via the omnichannel-grpc protobuf interface for type-safe gRPC clients. ticket-builder constructs tickets programmatically from event sources (e.g., when an Automations workflow needs to file a ticket). ticketing-bulk-daemon runs bulk operations — mass updates, imports, bulk closes.

  2. Event outflow: when ticket state changes, ticketing pushes live updates to real-time-hub, which fans them out to connected UIs over WebSocket/SSE. The same state changes also trigger outbound emails through interface-email (making the ticketing ↔ interface-email edge bidirectional).

  3. Read surfaces: helpdesk is the customer-portal-facing product surface for ticket management; mobile-helpdesk is the agent / operator mobile-app surface for triaging tickets on the go. Both are silent in APM (likely RUM-tracked).

The architectural distinctive of Helpdesk is the LLM-augmented loop: ticketing calls llm-hub for AI triage on new tickets and vector-db for similar-ticket retrieval. This is shown as a cross-bucket edge in the External connections diagram above. Practically, it means Helpdesk depends on Intelligence for its core value, not just for nice-to-have features.

Services

Service Purpose within bucket
ticketing (critical) Central ticket state store. System of record for all customer-support tickets. LLM-augmented via outbound calls to llm-hub and vector-db.
interface-email Email channel adapter. Inbound emails become tickets via Mailgun; ticket events generate outbound emails. Bidirectional with ticketing.
omnichannel-inbox Unified inbox that aggregates inbound messages from email, SMS, and chat into tickets.
omnichannel-grpc gRPC face of the omnichannel routing layer (serves the capacity.omnichannel.*.v1.*service protobufs).
real-time-hub WebSocket/SSE push layer that delivers live ticket-update events to UIs.
helpdesk The product surface for the Helpdesk feature itself (likely admin/config or customer-portal-facing).
mobile-helpdesk Mobile-app surface for the Helpdesk product.
ticketing-bulk-daemon Bulk-ticket-operations daemon (mass updates, imports, bulk closes).
ticket-builder Ticket-creation helper service that constructs tickets from various event sources for ingestion.

3. chats · 11 services · 🔵 Channels

Where this fits in the system at a glance: chats is the most cross-cutting Channelconversations reaches into every other layer (Platform fabric for auth/state, Intelligence for LLM/agents/retrieval, Channels for handoffs to Helpdesk and frontend). The Capacity Concierge product (webhub) is the only platform-tagged team and lives here.

External connections

[Diagram]

Internal architecture

chats has three sub-architectures: conversation orchestration, the Concierge product surface, and the live-chat backend.

[Diagram]

Three sub-architectures:

  1. Conversation orchestrationconversations is the platform's super-connector. It receives messages, decides what to do (auto-respond via LLM, escalate to a human agent, file a ticket), and routes accordingly. Conversation state lives in interaction-service (queue-driven) and livedb (the live-conversation state store). livedb-client is a library leak — the Python SDK registered as an APM service entry rather than a real deployable.

  2. Concierge productwebhub is the Capacity Concierge product's web entry point and is the only service in the platform with a team:concierge tag. It calls livechat directly, plus core-api, ticketing, real-time-hub, interface-gateway, and externally wa.aisoftware.com. The cluster also contains four library-instrumentation leaks (webhub-amqp, webhub-redis, concierge-client) which are AMQP / Redis / SDK clients registered as APM services rather than separate deployables. concierge-hosting is the tenant-specific deployment orchestrator for Concierge instances.

  3. Live chatlivechat is the real-time customer-agent chat service. It's called by webhub (Concierge sessions), conversations (orchestrated escalation), and skill-controller (Agents handoff during AI-driven flows).

Services

Service Purpose within bucket
conversations (critical) Conversation orchestration. The "brain" that decides what to do with each inbound interaction — calls agents, LLM, retrieval, escalates to humans, creates tickets.
webhub (team:concierge) Capacity Concierge product's web entry point. The HTTP/WS surface for the Concierge product.
livechat Real-time customer-agent chat service. Handles live chat sessions.
chat Generic chat service (purpose to confirm — possibly older live-chat implementation).
interaction-service Conversation/interaction state service. Queue-driven (no observable HTTP edges).
livedb Live conversation state store. Holds in-flight conversation state.
livedb-client Library leak — the Python livedb client SDK.
webhub-amqp Library leak — the AMQP client lib used by webhub.
webhub-redis Library leak — the Redis client lib used by webhub.
concierge-client Library leak — the Concierge SDK client lib.
concierge-hosting Concierge hosting plane service. Tenant-specific deployment orchestrator.

4. frontend · 9 services · 🔵 Channels

Where this fits in the system at a glance: frontend hosts the UI surfaces that customers actually interact with, plus the backend gateway (interface-gateway) that routes inbound HTTP/WS traffic into Channel-specific interfaces. All inbound external traffic enters the system through this bucket. The 8 UI services are RUM-tracked (not visible in APM); interface-gateway is the bucket's only APM-instrumented service.

External connections

[Diagram]

Internal architecture

frontend has a sharp internal split between the backend channel gateway (APM-active) and the customer-facing UI surfaces (RUM-tracked, APM-silent).

[Diagram]

Two architectural zones:

  1. Backend channel gatewayinterface-gateway is the platform's front-door router. All inbound HTTP/WS traffic arrives here and gets fanned out to channel-specific interfaces (Voice's voice-ingress, SMS's interface-textel, chats' conversations). It's bidirectional with several of those services because of request/response callback patterns. This is the bucket's only critical-path service and the only one APM-visible.

  2. Customer-facing UIs — 8 services hosting the visual surfaces. Web UIs (web, web-interface, webui, surveys-ui) are browser SPAs that emit to Datadog RUM. mobile is the mobile app backend; mobile clients emit to mobile RUM, a separate Datadog product surface. pdf-viewer and milestone-viewer are embedded document-viewer components. None of these emit HTTP APM spans, so they don't appear in dep graphs — but they do serve customer traffic and have their own RUM dashboards.

The odd one out is interface-web, the sibling of interface-textel (SMS) and interface-email (Helpdesk) — both of which are active. interface-web is silent in every attribution form, suggesting it's either deprecated, absorbed into interface-gateway, or RUM-only.

Services

Service Purpose within bucket
interface-gateway (critical) Front-door channel router. All inbound HTTP/WS customer traffic enters here and gets fanned out to channel-specific interfaces. Bidirectional with voice-ingress and interface-textel.
web Customer-facing web SPA. RUM-tracked, not APM-instrumented.
web-interface Web UI shell (likely the older / alternative web product). RUM-tracked.
webui Web UI deployment (possibly a Vue/React frontend split alongside web). RUM-tracked.
mobile Mobile app backend / catalog entry. Mobile clients emit to mobile RUM, not HTTP APM.
surveys-ui UI for the surveys feature (paired with surveys backend in Other).
interface-web Web channel interface. Sibling of interface-textel (SMS) and interface-email (Helpdesk) — possibly deprecated or absorbed into interface-gateway.
pdf-viewer Browser-rendered PDF document viewer component.
milestone-viewer Browser-rendered milestone/timeline viewer component.

5. Automations · 7 services · 🟢 Platform fabric

Where this fits in the system at a glance: Automations is in Platform fabric — it's the scheduling and orchestration plane that triggers work across other buckets. workflows defines plans (high-level rules); plan execution actually happens in Other's octopus. So Automations + Other together form the platform's two-layer orchestration stack.

External connections

[Diagram]

Internal architecture

Automations has three sub-architectures: the orchestration plane (workflows + automations + evaluations), generic email infrastructure, and calendar integration.

[Diagram]

Three sub-architectures:

  1. Orchestration planeworkflows defines high-level orchestration plans ("when X happens, do Y, then Z") and is the bucket's only critical-path-tier service. It's called by chats (conversations), Other (octopus), and RabbitMQ events; it fans out to 6 buckets including the sole caller path into KB's mldocs-hub. Plan execution happens in Other's octopusworkflows + octopus together form the platform's two-layer orchestration stack. automations is the customer-facing UI where users define their own rules; these rules then materialize as workflow plans. automation-evaluations is an offline evaluation harness for measuring automation correctness.

  2. Generic email infrastructureemail is a generic email-send daemon (distinct from interface-email in Helpdesk, which is the helpdesk-channel-specific email path). email-service is an HTTP wrapper around it. Both are silent in APM — likely fire-and-forget queue-driven send paths consumed by workflows.

  3. Calendar integrationcalendar integrates with Google Calendar / Outlook; calendar-polling watches calendars for changes and emits events that workflows can subscribe to. Both silent in APM (polling daemon + sync helpers, matching the platform-wide bulk-daemon silence pattern).

Services

Service Purpose within bucket
workflows (critical) High-level orchestration plans. Defines "when X happens, do Y, then Z" rules across buckets. Sole inbound caller of mldocs-hub (KB).
automations User-facing automation-builder UI. Customers define their own automation rules through this surface.
automation-evaluations Evaluation harness for automation correctness and performance.
calendar Calendar integration (Google Calendar / Outlook).
calendar-polling Polling daemon that watches calendars for changes and triggers automation events.
email Generic email-send daemon for automation flows.
email-service HTTP wrapper for email.

6. Agents · 10 services · 🟡 Intelligence

Where this fits in the system at a glance: Agents is in Intelligence. It depends on LLM (every agent call routes through llm-hub), Knowledge base (RAG retrieval via vector-db), and NLP (skill execution involves NLP routing). Inbound from chats (conversations → agents) and Voice (voice-proxy → agents). The bucket also has direct vendor dependencies on external AI tooling (LangSmith, Tavily).

External connections

[Diagram]

Internal architecture

Agents has four sub-architectures: agent coordination, agent backends (showing a legacy-evolution smell), skill execution, and tool routing.

[Diagram]

Four sub-architectures:

  1. Agent coordinationagents is the central registry / coordinator that dispatches work to specific agent backends. Called by voice-proxy (Voice handoff during live calls) and conversations (chats orchestration). Calls mcp-service for tool routing and vector-db for retrieval.

  2. Agent backends — three services with overlapping names suggest a legacy-evolution pattern: agent-service is the modern AI-focused backend (calls LangSmith for LLM tracing); agent is the silent legacy implementation (likely deprecated); ai-agent is a third silent flavor of unclear purpose. The agent / agent-service / agents triplet is the bucket's biggest architectural smell — three services for one concept, with two of them silent.

  3. Skill executionskill-controller is the skill orchestrator that picks and runs skills based on conversation context. core-skills is the central skill registry, called by skill-controller via k8s DNS form. skill-controller is the most-connected service in the bucket, with edges into chats (conversations, livechat), NLP (prototype-brain), and frontend.

  4. Tool routingmcp-service implements Anthropic's Model Context Protocol for AI-agent tool use. It routes agent tool-use requests to internal services and external tools (notably Tavily AI search). Called by agents and by llm-hub (cross-bucket).

Plus three specialized services (agent-assist-module, agentic-webhooks, rpa-service) that are silent in APM — likely queue-driven or operating outside the traced HTTP path. rpa-service (Robotic Process Automation) probably runs browser-automation tasks on dedicated workers.

Services

Service Purpose within bucket
agents Central agent registry / coordination hub. Routes work to specific agent implementations.
agent-service Modern AI-agent backend (the primary agent implementation). Integrates with LangSmith for tracing.
agent Legacy agent service. Almost certainly deprecated.
agent-assist-module Real-time agent-assist — provides AI-generated suggestions to human operators.
ai-agent AI-specific agent flavor (purpose to confirm).
agentic-webhooks Webhook handler for agent-driven workflows.
mcp-service Anthropic Model Context Protocol server. Routes agent tool-use requests.
core-skills Central skill registry. Defines what skills agents can execute.
skill-controller Skill-execution orchestrator. Picks and runs skills based on conversation context.
rpa-service Robotic Process Automation service. Browser-automation tasks for AI agents.

7. Knowledge base · 18 services · 🟡 Intelligence

Where this fits in the system at a glance: Knowledge base is in Intelligence and serves as the retrieval substrate for the entire platform — vector-db is called by 5+ buckets. The bucket has internal coupling to LLM (via the vector-db → tei-embedding edge for embedding compute), to NLP (mldocs uses guse), and to core API for auth on retrieval calls. The mldocs ETL sub-cluster ingests documents from external sources (HuggingFace, customer URLs).

External connections

[Diagram]

Internal architecture

KB has three sub-architectures with very different shapes: parallel retrieval engines, document-source integrations, and a dedicated 13-service ETL pipeline.

[Diagram]

Three sub-architectures:

  1. Retrieval / search — two parallel retrieval philosophies coexist. vector-db provides semantic (embedding-based) retrieval and is the broadest-fanin Knowledge service (called by 5 buckets); it computes embeddings via tei-embedding (LLM bucket). docsearch provides lexical (Elasticsearch-backed) retrieval — a separate Elasticsearch from NLP's nlp-data-es-master. Crucially, vector-db and docsearch don't connect to each other — readers pick the model that fits their use case. content-suggestion-generator and tag-recommender are sibling recommendation / classification services.

  2. Document sourcesdrive syncs customer documents from Google Drive accounts into the knowledge base. Silent in APM (likely OAuth UI + queue-driven sync daemon).

  3. mldocs ETL pipeline — a self-contained 13-service document-ingestion sub-cluster with its own dedicated RabbitMQ instance (mldocs-rabbitmq, distinct from the platform's prod / staging RabbitMQ). The flow: Automations' workflows schedules ingestion jobs by calling mldocs-hub (the cluster's public entry) → mldocs-hub enqueues to mldocs-rabbitmqmldocs-extractor consumes jobs, fetches files (from file-service or tenant URLs like www.eprmg.net), computes embeddings via NLP's guse + HuggingFace, and writes the processed documents to mldocs-data (MySQL-backed). The 9 specialized tools (mldocs-classifier, mldocs-builder, mldocs-annotator, etc.) are silent in APM — likely offline batch operations or training tools. mldocs-prmg is a per-tenant service for the PRMG customer, paired with the external www.eprmg.net endpoint (the platform's first explicit per-customer service).

Retrieval / search (4)

Service Purpose within bucket
vector-db Semantic retrieval engine. The RAG substrate — called by 5+ buckets. Calls tei-embedding (LLM) for embeddings.
docsearch Lexical document search. Elasticsearch-backed.
content-suggestion-generator Generates content / article suggestions from the knowledge base.
tag-recommender Auto-tagging engine for documents.

Document sources (1)

Service Purpose within bucket
drive Google Drive integration. Syncs customer documents from their Drive accounts.

mldocs ETL pipeline (13)

Service Purpose within bucket
mldocs-extractor ETL hub. Reads jobs from mldocs-rabbitmq, fetches files, computes embeddings via guse + HuggingFace, writes to mldocs-data.
mldocs-data Processed-document store. MySQL-backed.
mldocs-hub Cluster's public entry point. Receives ingestion requests from workflows.
mldocs-builder Document-build pipeline tool.
mldocs-classifier Document classifier.
mldocs-classifier-trainer Trains the classifier on labeled data.
mldocs-annotator Document annotation service.
mldocs-labeler-app Human-in-the-loop labeling UI.
mldocs-logger mldocs-pipeline event logger.
mldocs-query Document query tool.
mldocs-dom DOM parser — extracts text from HTML documents.
mldocs-splitter Document chunking.
mldocs-prmg Per-tenant service for the PRMG customer.

8. LLM · 4 services · 🟡 Intelligence

Where this fits in the system at a glance: LLM is in Intelligence and serves as the inference gateway for every AI-powered feature in the platform. llm-hub is called by Agents, chats, Helpdesk, and Knowledge base — meaning LLM outages cascade across half the platform's product surface. The bucket has heavy external dependencies on AWS Bedrock and Azure Cognitive Services (running in parallel — multi-cloud routing).

External connections

[Diagram]

Internal architecture

LLM has three sub-architectures, all internally disconnected — no observable APM edges between LLM bucket members.

[Diagram]

Three sub-architectures:

  1. LLM inference gatewayllm-hub is the central inference router and the bucket's only critical-pathlike service. It implements multi-cloud, multi-provider routing: every LLM inference call goes to AWS Bedrock (two regions: us-east-1 + us-west-2) and Azure Cognitive Services (two resources: prod + dev) in parallel. This is either vendor-hedge failover or per-model routing — sophisticated for a contact-center platform. llm-hub-grpc is a thin gRPC face for clients that need that wire protocol (e.g., other internal gRPC services); it calls core-api-grpc.

  2. Embedding inferencetei-embedding is a HuggingFace Text Embeddings Inference server. It is not called from llm-hub directly: it sits behind vector-db (KB bucket) and is reached only via the tei.v1.embed protobuf form from there. This means embedding lives architecturally in the retrieval path, not the inference path — the embedding model is computed at index time (vector-db → tei) and read at query time (other buckets → vector-db → cached vectors).

  3. Training trackingmlflow-tracking-server tracks ML experiment results. Used offline for model-training tracking, not on the runtime LLM path. Silent in APM, consistent with offline use.

External dependencies for llm-hub are the bucket's most consequential surface: AWS Bedrock + Azure Cognitive (both LLM inference), Sentry (error tracking — first non-Datadog observability vendor in the platform), LangSmith (LLM tracing).

Services

Service Purpose within bucket
llm-hub Multi-cloud LLM inference gateway. Routes to AWS Bedrock + Azure Cognitive Services in parallel.
llm-hub-grpc gRPC interface for llm-hub.
tei-embedding Embedding inference server (HuggingFace Text Embeddings Inference). Serves embeddings to vector-db.
mlflow-tracking-server ML experiment tracking server. Used offline for model training tracking.

9. Voice · 5 services · 🔵 Channels

Where this fits in the system at a glance: Voice is the telephony Channel. It depends on Platform fabric (core API for auth/session) and the Intelligence layer (handoff to Agents for AI-driven call handling). External SIP carriers feed asterisk outside the APM-traced graph; LumenVox is the third-party voice / ASR vendor.

External connections

[Diagram]

Internal architecture

Voice has a linear edge → real-time → post-call architecture — the bucket reads left-to-right like a pipeline.

[Diagram]

Three pipeline stages:

  1. Edge (SIP)asterisk is the open-source SIP PBX that terminates inbound calls from telephony carriers and originates outbound calls. It's C-based with no native Datadog APM tracer, so it's invisible in dep graphs — observability would have to be via logs / metrics / Statsd. voice-ingress is the cluster's HTTP/gRPC hub — it receives voice-call control events, calls core-api for auth/session/matchlog, and dispatches to downstream voice services.

  2. Real-time processingvoice-proxy wraps the Azure Speech SDK (CGO) for real-time speech-to-text on live call audio. It calls voice-ingress (control plane), agents (AI agent handoff during calls), and sensitive-data-service in Other (PII redaction for call audio). The expected outbound edge to Azure Speech doesn't appear in APM because the CGO library doesn't emit Datadog-instrumentable HTTP spans.

  3. Post-call analysisaudio-transcription-service produces transcripts (likely a batch / async ASR path distinct from voice-proxy's real-time STT). transcript-analyzer does post-call analytics (sentiment, summarization, key-moment extraction). Both are silent in APM, consistent with batch / queue-driven operation.

External dependencies — three of the four are invisible to Capacity US1 APM:

Services

Service Purpose within bucket
voice-ingress Voice cluster hub. Entry point for voice traffic. Calls core-api and dispatches downstream.
voice-proxy Real-time voice processing service. Wraps the Azure Speech SDK. Calls agents and sensitive-data-service.
asterisk Open-source PBX. Handles SIP trunk for inbound/outbound calls. C-based — no native Datadog APM tracer.
audio-transcription-service Audio → text ASR service.
transcript-analyzer Post-call analytics over transcripts.

10. NLP · 11 services · 🟡 Intelligence

Where this fits in the system at a glance: NLP is in Intelligence and represents the pre-LLM NLP generation that coexists with the modern LLM bucket. It's tightly coupled to Agents (nlpengine ↔ skill-controller) and is reached from frontend (interface-gateway → nlpengine). The bucket has its own Elasticsearch backing store, separate from KB's.

External connections

[Diagram]

Internal architecture

NLP has the densest internal coupling of any bucketnlpengine is a hub that directly calls four internal services, plus a fifth internal edge from nlp-data.

[Diagram]

Four sub-architectures:

  1. Data layernlp-data holds NLP-specific data backed by nlp-data-es-master (a dedicated Elasticsearch cluster, separate from KB's docsearch Elasticsearch). nlp is a related service, active under k8s DNS form. The data layer is backed by Elasticsearch for both storage and search.

  2. Embedding services (pre-LLM era) — two embedding models served as services: glove (GloVe word embeddings) and guse (Google Universal Sentence Encoder). Both are pre-LLM-era models and are backed by the same nlp-data-es-master Elasticsearch. guse is also called cross-bucket by mldocs-extractor (KB's document ingestion pipeline) — making it a shared embedding service between NLP and KB.

  3. Decision enginesprototype-brain is the active NLP decision engine (called by Agents' skill-controller via k8s DNS). brain is its silent legacy ancestor — same name pattern, same probable role, but silent in APM (likely deprecated).

  4. Specialized toolsbabelfish (translation), label-studio (ML annotation platform, typically used offline), nlpmail-receiver-daemon (inbound-email NLP parser), nlsql (natural-language-to-SQL). All silent in APM, consistent with their batch / offline natures.

The bucket is a two-generation NLP architecture alongside the LLM bucket: NLP has separate modular embedding services from the pre-LLM era; LLM has collapsed inference into one provider gateway. Both run in production — likely because the older NLP features still serve specific use cases that don't justify migration.

Services

Service Purpose within bucket
nlpengine NLP processing hub. Orchestrates classification, embedding, and analysis pipelines.
nlp Specific NLP service.
nlp-data NLP data store. Backed by nlp-data-es-master Elasticsearch.
glove GloVe word-embedding model served as a service.
guse Google Universal Sentence Encoder served as a service.
brain Legacy NLP decision engine. Likely deprecated alias of prototype-brain.
prototype-brain Active NLP decision engine.
babelfish Translation service.
label-studio ML annotation platform.
nlpmail-receiver-daemon Inbound-email parser for NLP processing.
nlsql Natural-language-to-SQL service.

11. SMS · 2 services · 🔵 Channels

Where this fits in the system at a glance: SMS is the smallest Channel. The clean architecture routes inbound/outbound SMS through interface-textel, which is bidirectional with interface-gateway (frontend). A direct-bypass shortcut exists where core-api calls Textel without going through interface-textel — a legacy code path or deliberate fast lane.

External connections

[Diagram]

Internal architecture

SMS is the smallest active bucket — just two services with no internal edges.

[Diagram]

Two services:

  1. interface-textel is the SMS channel adapter handling individual messages in both directions. Inbound customer SMS arrives via the Textel webhook into interface-textel, which routes the message into conversations (chats) or ticketing (Helpdesk) depending on context. Outbound responses go back to Textel for delivery. It has a loop-back edge with interface-gateway (frontend) — the same pattern as Voice's voice-ingress ↔ interface-gateway.

  2. bulk-responder handles bulk SMS campaigns (high-volume outbound). Silent in APM, consistent with the platform's bulk-daemon silence pattern.

The bucket has a notable architectural shortcut worth flagging: core-api calls Textel directly, bypassing the interface-textel abstraction. Either a legacy code path (predating the channel abstraction) or a deliberate fast path for specific outbound flows. The clean channel-abstraction architecture goes through interface-textel; this is a parallel path that escapes that abstraction.

Services

Service Purpose within bucket
interface-textel SMS channel adapter. Both inbound (customer messages → tickets/conversations) and outbound (responses, campaigns).
bulk-responder Bulk SMS response engine for high-volume outbound campaigns.

12. Lang · 29 services · 🟡 Intelligence (isolated)

Where this fits in the system at a glance: Lang is nominally in Intelligence (it's an acquired AI-and-NLP product) but is isolated from the rest of the platform's APM-visible architecture. It runs as a self-contained product surface — lang-mcp-server is the likely integration point with Agents' mcp-service, but no APM traces confirm the integration. The bucket should be treated as a black box from the rest of the architecture's perspective.

External connections

[Diagram]

Internal architecture

The Lang bucket is APM-invisible — 58 dependency probes (29 services × 2 directions) returned zero observable edges. Not a single lang-* service appeared as a destination in any other bucket's probes across the entire discovery. The architecture below is inferred from service names only — we have no runtime evidence of any of these edges.

[Diagram]

Eight inferred sub-architectures (none verified by APM):

  1. Top-level lang-* (4 services) — the lang.ai product surface: lang-api (main API), lang-frontend (UI), lang-analysis-worker (background analysis), lang-mcp-server (the likely integration point with Capacity's Agents bucket via Anthropic's MCP protocol — if the integration exists in production).

  2. Shared core (5) — main API, auth, processing engine, GraphQL backend, internal admin frontend for the acquired lang.ai codebase.

  3. ML / Intent (4) — classifiers, intent recognition engine + executor, Stanford CoreNLP integration.

  4. Channels / integrations (4) — channel adapters, third-party integrations (including Salesforce), sync service.

  5. Storage / data (2) — persistence layer + database migration runner.

  6. Suggestions / discovery (4) — suggestion engine, business-insights analytics, discovery orchestrator, content librarian.

  7. Config / notifications (2) — configurator gateway, notificator.

  8. Periodic / tickers (4) — automation-recommendations + ticker pair, activity-peaks + ticker pair. The "ticker" naming suggests periodic refresh of analytics.

The cluster should be treated as a black box from the rest of the Capacity architecture's perspective. The most consequential question is whether lang-mcp-server actually integrates with Capacity's mcp-service (Agents) in production — if so, AI agents in Capacity may be invoking lang.ai capabilities without that traffic appearing in any APM dashboard.

Top-level lang-* (4)

Service Purpose within bucket
lang-api lang.ai's main API surface.
lang-frontend lang.ai's user-facing UI.
lang-analysis-worker Background analysis worker.
lang-mcp-server MCP server — likely integration point with Capacity's Agents bucket.

lang-ai-shared-* (25)

Shared infrastructure of the acquired codebase. Inferred purposes:

Service Purpose within bucket
lang-ai-shared-api Main shared API.
lang-ai-shared-auth Authentication for the lang.ai cluster.
lang-ai-shared-engine Core processing engine.
lang-ai-shared-graphql-backend GraphQL API backend.
lang-ai-shared-frontend Internal admin UI.
lang-ai-shared-classifiers ML classifiers.
lang-ai-shared-intents-engine Intent recognition engine.
lang-ai-shared-intents-engine-executor Intent execution / fulfillment.
lang-ai-shared-channels Channel adapters.
lang-ai-shared-integrations Third-party integrations.
lang-ai-shared-salesforce-integrator Salesforce-specific integration.
lang-ai-shared-storage Data persistence layer.
lang-ai-shared-db-migrate Database migration runner.
lang-ai-shared-corenlp Stanford CoreNLP integration.
lang-ai-shared-suggestions Suggestion engine.
lang-ai-shared-business-insights Analytics surface for business metrics.
lang-ai-shared-discover-orchestrator Discovery orchestration.
lang-ai-shared-librarian Content library / knowledge base.
lang-ai-shared-configurator-gateway Configuration gateway.
lang-ai-shared-synchronizer Sync service.
lang-ai-shared-notificator Notification service.
lang-ai-shared-automation-recommendations Automation suggestions.
lang-ai-shared-automation-recommendations-ticker Periodic refresh of automation recommendations.
lang-ai-shared-activity-peaks Activity-peak analytics.
lang-ai-shared-activity-peaks-ticker Periodic refresh of activity-peak analytics.

13. Other · 62 services (10 sub-clusters) · 🟢 Platform fabric

Where this fits in the system at a glance: Other is the Platform fabric substrate — not a product area, but the infrastructure everything else uses. Inside Other, octopus is the cross-bucket runtime executor (paired with Automations' workflows); file-service is the storage substrate; notifications is the alert fan-out. Other also contains the audit / auth / analytics / dev-tooling sub-clusters that don't fit elsewhere.

External connections

[Diagram]

Internal architecture

Other contains 10 sub-clusters — too many to fit in a single internal diagram. The shape below shows the active hubs (3 sub-clusters with critical-tier services) and collapses the silent sub-clusters to grouped nodes.

[Diagram]

Three sub-clusters carry the bucket's architectural weight:

  1. Orchestration (13a)octopus is the platform's job-execution runtime and a critical-path service. It runs the plans defined by Automations' workflows and fans out into 13+ services across nearly every other bucket. notifications is the cross-bucket alert fan-out (called by core-api + ticketing; calls interface-email, llm-hub, analytics-api). third-party-interface-hub routes third-party integration callbacks.

  2. Files / storage (13b)file-service is the platform's storage substrate. Helpdesk's ticket attachments, KB's source documents, and most cross-cutting binary-data needs flow through it. It's broadly depended on (inbound from Helpdesk, KB, chats, Agents) but has only one observable downstream itself (core-api).

  3. Routing / integration (13d)relay is a generic event relay; capacity-distributed-service is the distributed-systems coordinator that integrates with Atlassian OAuth externally (a 14th vendor that doesn't appear anywhere else).

The other seven sub-clusters are largely silent:

13a. Orchestration / job runners (5)

Service Purpose within bucket
octopus (critical) Job-execution runtime. Runs the plans defined by workflows. Calls 13+ services across nearly every bucket.
notifications Cross-bucket alert fan-out hub.
third-party-interface-hub Routing service for third-party integration callbacks.
octopus-redis Library leak — Redis client lib used by octopus.
webhooks Generic webhook handler.

13b. Files / storage (3)

Service Purpose within bucket
file-service Storage substrate. Holds attachments, documents, and assets across the platform.
file-service-go Go rewrite of file-service (migration in progress).
print-service Print job handling for printable outputs.

13c. Auth / identity (8)

Service Purpose within bucket
passport Authentication coordinator. Likely a Passport.js-style auth-middleware service.
login Login page / flow backend.
saml SAML SSO integration for enterprise tenants.
token Token issuance and validation.
token-expiration-check Token-lifecycle daemon.
user_accounts User account management.
warden Authorization / policy decision service.
sensitive-data-service PII handling and redaction. Called by voice-proxy.

13d. Routing / integration (2)

Service Purpose within bucket
relay Generic event relay between internal services and external integrations.
capacity-distributed-service Distributed-systems coordinator. Integrates with Atlassian OAuth externally.

13e. Analytics / BI / data plumbing (6)

Service Purpose within bucket
analytics-api Internal analytics query API. Calls Tableau for BI dashboards.
analytics-ext-api External / customer-facing analytics surface.
metrics-api Internal platform-metrics API.
data-warehouse-exporter Exports operational data into the analytics warehouse.
datacollector Generic data-collection / ingestion daemon.
web-analytics Web-analytics event tracker.

13f. Audit / compliance (9)

Service Purpose within bucket
access_evaluation Access-control evaluation engine.
audit-daemon Audit event consumer.
confluence-audit-records Confluence audit log ingester.
event-logs General-purpose event log aggregator.
github.audit.streaming GitHub audit log streamer.
gitlab-audit-approvals GitLab merge-request approval audit.
gitlab-user-role-check GitLab role-based access checker.
jira-audit-records Jira audit log ingester.
traffic-logs Network/API traffic log aggregator.

13g. Sites / public web (4)

Service Purpose within bucket
sites Tenant-customizable sites engine.
demo-sites Demo / sandbox site deployments.
site-demo A specific demo site instance.
private-sites Private / internal site hosting.

13h. Dev / CI / test (5)

Service Purpose within bucket
dev-db-export Development-database export tool.
e2e End-to-end test harness.
hotfix-bot Hotfix automation bot.
param-testing Parameter / config testing utility.
stage-db-rebuild Staging database rebuild tool.

13i. Engine helpers (2)

Service Purpose within bucket
engine-builder Build-time engine assembly tool.
engine-logger Engine log processor.

13j. Miscellaneous (16+)

Service Purpose within bucket
admin Generic admin service.
autoturk Mechanical-Turk-style task routing.
batch-file-upload Bulk file-upload handler.
controller Generic controller service.
converter Generic format-conversion service.
delorean Backfill / time-travel / replay tool.
headless-browser Headless browser farm.
keep Persistence / retention service.
lambdasjs JavaScript Lambda function runner.
lambdaspy Python Lambda function runner.
meet Meeting / video-conferencing integration.
rules Generic rules engine.
social Social-media integration.
surveys Survey-engine backend (paired with surveys-ui in frontend).
tap Generic tap / data-intercept service.
unnamed-python-service A service with no name set in instrumentation — bug to fix.
vuetiful-joe Internal tool — purpose unknown.

Inter-bucket connections

The major architectural edges, organized by source bucket.

Voice → others

chats → others

The most cross-cutting bucket. conversations calls into nearly every other bucket:

webhub is the Concierge product's entry point — calls ticketing, core-api, interface-gateway, livechat, real-time-hub, plus external wa.aisoftware.com.

Helpdesk → others

Agents → others

Knowledge base → others

LLM → others

NLP → others

Automations → others

frontend → others (all via interface-gateway)

Other → many (substrate role)

Lang

No observable edges in Capacity Datadog. The cluster operates outside the visible map.


Key request flows

How customer interactions traverse the architecture.

A live chat from a customer

customer browser
  → web (frontend, RUM-tracked)
  → interface-gateway (frontend)
  → conversations (chats)
  ├─ → core-api (core API) for session/auth
  ├─ → llm-hub (LLM) for AI response generation
  │     ├─ → AWS Bedrock / Azure Cognitive (external)
  │     └─ → vector-db (Knowledge base) for RAG retrieval
  ├─ → agents → agent-service (Agents) for agent routing
  └─ → ticketing (Helpdesk) when escalating to ticket

An inbound voice call

SIP carrier
  → asterisk (Voice, SIP/RTP — not APM-traced)
  → voice-ingress (Voice)
  ├─ → core-api (core API) for auth/session/matchlog
  └─ → voice-proxy (Voice, Azure Speech SDK)
        ├─ → Azure Speech (external)
        ├─ → agents (Agents) for agent-assist
        └─ → audio-transcription-service / transcript-analyzer (Voice)

A ticket created via email

customer email
  → Mailgun (external inbound)
  → interface-email (Helpdesk)
  → ticketing (Helpdesk)
  ├─ → llm-hub (LLM) for AI triage
  │     └─ → vector-db (Knowledge base) for similar-ticket retrieval
  ├─ → core-api (core API) for state persistence
  ├─ → notifications (Other) for fan-out alerts
  └─ → real-time-hub (Helpdesk) for live UI updates

An automated workflow run

scheduling trigger (RabbitMQ event or cron)
  → workflows (Automations) — defines the plan
  → octopus (Other) — executes the plan
  ├─ → ticketing (Helpdesk) for ticket creation
  ├─ → automations → core-api (core API) for state changes
  ├─ → notifications (Other) for fan-out
  └─ → workflows callbac