Text-to-Speech and Speech-to-Text

Last updated 1 Jul 2026

TTS providers

ElevenLabs, OpenAI tts-1 family, Cartesia, PlayHT, plus our compute-fund hosted XTTS for batch jobs. Pricing is PerSecondAudio (modern) or PerCharacterTts (legacy) — both are tracked. Voice cloning requires a recorded consent sample; you must hold the rights to any voice you clone.

STT providers

Whisper (OpenAI), Deepgram, AssemblyAI, Speechmatics. Diarization, word-level timestamps, and language detection are first-class capabilities; pick the provider that supports the ones you need via the capability filter.

Realtime voice

VoiceSession entities track full-duplex conversations (input → model → output streamed). Realtime models (gpt-4o-realtime, Gemini Live, etc.) bill PerMinuteRealtime rather than per-token. Latency targets: <300ms first audio, <500ms steady state.

Languages supported

TTS: ~30 languages with native voices, ~50 with adequate voices. STT: 100+ languages, with quality tiers (Tier 1 → Tier 3 by training-data availability).

The supported language list is in /docs/doc-api-reference#audio-languages.

Video Generation BYOK Setup