Text-to-Speech and Speech-to-Text
Last updated 14 Jun 2026TTS providers
ElevenLabs, OpenAI tts-1 family, Cartesia, PlayHT, plus our compute-fund hosted XTTS for batch jobs. Pricing is PerSecondAudio (modern) or PerCharacterTts (legacy) — both are tracked. Voice cloning consent is enforced at the PersonaVoiceSample level (creator must hold rights).
STT providers
Whisper (OpenAI), Deepgram, AssemblyAI, Speechmatics. Diarization, word-level timestamps, and language detection are first-class capabilities; pick the provider that supports the ones you need via the capability filter.
Realtime voice
VoiceSession entities track full-duplex conversations (input → model → output streamed). Realtime models (gpt-4o-realtime, Gemini Live, etc.) bill PerMinuteRealtime rather than per-token. Latency targets: <300ms first audio, <500ms steady state.
Languages supported
TTS: ~30 languages with native voices, ~50 with adequate voices. STT: 100+ languages, with quality tiers (Tier 1 → Tier 3 by training-data availability).
The supported language list is in /docs/doc-api-reference#audio-languages.