Architecture
Anatomy of the voice agent - every component and how they connect.
Client Phone
Telephony Switch
Communication Services
Audio Streaming
(PCM Frames)
STT
(Speech-to-Text)
Speech Recognition Service
Intent Detection
Analyzes user input and detects intent
Intent Mapping
Maps intent to business flows & actions (self-service / transfer / clarify)
Response Generation
Generates response using knowledge, context, rules and memory
TTS
(Text-to-Speech)
Speech Synthesis Service
Communication Services
Audio Streaming
(PCM Frames)
Telephony Switch
Client Phone
Data Platform
Source Systems
Core Banking
Customer Ops
Other Systems
Data Pipeline
ETL / ELT
Ingestion, transformation, cleansing & integration of source data.
Session Memory
Customer Information
Session Profile
Customer data relevant to the current session.
Structured Facts
Structured Memory
Extracts and stores key entities from each user message.
Rolling Summary
Semantic Memory
Compresses long conversations into a compact running summary.
Short-term Context
Short-Term Memory
Keeps the last few messages for immediate conversational context.
Audio in
Caller speaks; ACS streams audio frames via bidirectional WebSocket. Azure Speech SDK transcribes in real time with partials and a final result.
Intent detection
Hybrid cascade - short utterances hit the keyword extractor first; otherwise a multilingual embedding model classifies against centroid intents; below threshold a GPT-4o LLM with the full BT nomenclator decides.
State orchestrator
Deterministic state machine (UC1 Popriri, UC2 Fraudă, UC3 Quick Answer). Identification, resolution scenarios, and automatic agent handoff when confidence drops below threshold.
Response generation
Three routes - state script (instant prompt from inbound_config), Direct KB (top chunk verbatim) or RAG-LLM (concise synthesis from top-3 chunks). Azure AI Search holds 163 chunks.
Audio out
Per-call TTS synthesizer (multilingual Azure Neural voice). SSML wraps the text with brand-tuned pronunciation and a UI-configurable speech rate. Frames stream back through ACS as PCM.
Session memory
Rolling summary, structured facts, and recent turns persisted per session. Used to package context when handing off so the caller never repeats themselves.
Orchestrator
Live trace of the orchestrator - current state, identification path, and every step taken in this call.
Conversation path
Rolling summary
Structured facts
Recent turns
Intent Detection
Hybrid cascade - keyword first, then embedding, then LLM. Each layer falls through if confidence is too low.
Keyword
Regex match on short utterances (≤ 3 words). Returns longest-match-first.
Embedding
Multilingual-e5-base centroid match. Threshold-based fallthrough to LLM.
LLM
Azure OpenAI gpt-4o-mini with full nomenclator as system prompt.
Response Generation
Each turn is routed to one of three lanes - state script, KB direct, or RAG-LLM - based on intent and KB match.
Live Log
Real-time SSE stream of every event - user turns, orchestrator transitions, intent extractions, timings.
Call Settings
Configure flow, transcription engine, target client, voice, and routing thresholds before initiating a call.