jovix.ai · jpard Solutions

Architecture

Anatomy of the voice agent - every component and how they connect.

Client Phone

Telephony Switch

Communication Services

Audio Streaming
(PCM Frames)

STT
(Speech-to-Text)

Speech Recognition Service

AI Orchestrator

Intent Detection

Analyzes user input and detects intent

Intent Mapping

Maps intent to business flows & actions (self-service / transfer / clarify)

Response Generation

Generates response using knowledge, context, rules and memory

TTS
(Text-to-Speech)

Speech Synthesis Service

Communication Services

Audio Streaming
(PCM Frames)

Telephony Switch

Client Phone

Data Platform

Source Systems

Core Banking

Customer Ops

Other Systems

Data Pipeline

ETL / ELT

Ingestion, transformation, cleansing & integration of source data.

Session Memory

Customer Information

Session Profile

Customer data relevant to the current session.

Personal ID data Interaction history Transaction history

Structured Facts

Structured Memory

Extracts and stores key entities from each user message.

Rolling Summary

Semantic Memory

Compresses long conversations into a compact running summary.

Short-term Context

Short-Term Memory

Keeps the last few messages for immediate conversational context.

01

Audio in

Caller speaks; ACS streams audio frames via bidirectional WebSocket. Azure Speech SDK transcribes in real time with partials and a final result.

02

Intent detection

Hybrid cascade - short utterances hit the keyword extractor first; otherwise a multilingual embedding model classifies against centroid intents; below threshold a GPT-4o LLM with the full BT nomenclator decides.

03

State orchestrator

Deterministic state machine (UC1 Popriri, UC2 Fraudă, UC3 Quick Answer). Identification, resolution scenarios, and automatic agent handoff when confidence drops below threshold.

04

Response generation

Three routes - state script (instant prompt from inbound_config), Direct KB (top chunk verbatim) or RAG-LLM (concise synthesis from top-3 chunks). Azure AI Search holds 163 chunks.

05

Audio out

Per-call TTS synthesizer (multilingual Azure Neural voice). SSML wraps the text with brand-tuned pronunciation and a UI-configurable speech rate. Frames stream back through ACS as PCM.

06

Session memory

Rolling summary, structured facts, and recent turns persisted per session. Used to package context when handing off so the caller never repeats themselves.

Orchestrator

Live trace of the orchestrator - current state, identification path, and every step taken in this call.

Current state -

Identified via -

Steps -

Conversation path

—

Steps will appear as the orchestrator moves through states.

Session memory

Rolling summary

No conversation yet.

Structured facts

No facts captured.

Recent turns

No turns yet.

Intent Detection

Hybrid cascade - keyword first, then embedding, then LLM. Each layer falls through if confidence is too low.

01 Fast path

Keyword

Regex match on short utterances (≤ 3 words). Returns longest-match-first.

fired0 avg ms—

Waiting for first turn…

02 Semantic

Embedding

Multilingual-e5-base centroid match. Threshold-based fallthrough to LLM.

fired0 avg ms—

Waiting for first turn…

03 Fallback

LLM

Azure OpenAI gpt-4o-mini with full nomenclator as system prompt.

fired0 avg ms— cost $0.00

Waiting for first turn…

Recent extractions 0

No extractions yet.

Response Generation

Each turn is routed to one of three lanes - state script, KB direct, or RAG-LLM - based on intent and KB match.

01 Instant

State Script 0 Hardcoded prompt from inbound_config.json — instant.

02 Verbatim

KB Direct 0 Top chunk sim ≥ 0.85 → verbatim answer, no LLM call.

03 Synthesis

RAG-LLM 0 Top-3 KB chunks → LLM synthesizes a concise answer.

Response history 0

No bot responses yet.

Live Log

Real-time SSE stream of every event - user turns, orchestrator transitions, intent extractions, timings.

Logs will appear here once you start a call…

Call Settings

Configure flow, transcription engine, target client, voice, and routing thresholds before initiating a call.

Call configuration

Flow

STT Mode

Use case

Language

Client

Voice

TTS voice

TTS speech rate 1.00×

Routing thresholds

LLM fallback embedding conf below 0.90

Agent handoff LLM conf below 0.75

KB direct path use chunk verbatim if sim above 0.85