Everything you create, in one place
Stop juggling a separate tool - and a separate bill - for every kind of content. The Generation Suite turns text into images, video, music, spoken voice, transcripts and translations through a single interface, with the same predictable, itemized pricing behind all of it.
Type a prompt. Pick a modality. Get the asset. See what it cost - to the cent - before it ever lands on your bill as a surprise.
Start creating ยท See plans & pricing
One toolkit, every modality
Most platforms make you stitch together a different vendor for pictures, another for voiceovers, another for subtitles. Each with its own login, its own quirks, its own opaque invoice. The Generation Suite collapses all of that into one surface that speaks every modality:
- Image - turn a written idea into artwork, product shots, illustrations or concept frames.
- Video - generate clips from text, then compose, upscale and stitch them into a finished sequence.
- Music - produce original background tracks and scores from a simple brief.
- Voice (TTS) - read any script aloud in a natural voice, across all 29 of our supported languages.
- Transcription (STT) - convert speech and recordings into clean text, with speaker separation.
- Translation - move text between languages in real time or in batch.
Because it's one platform, the pieces fit together: a transcript can feed a translation, a script can become a voiceover, an image can become a video frame - without exporting, re-uploading or re-learning a new tool.
Voice, transcription, and live translation go even deeper on their own page - see Voice, Speech & Realtime Translation for TTS, diarized STT, and the live walkie-talkie translator.
Image: from a prompt to a polished frame
Describe what you want and get original artwork, product imagery, illustrations, or concept frames in seconds. Beyond pure text-to-image, the suite handles the edits real work demands:
- Inpaint - mask a spot and regenerate just that region, leaving the rest of the image untouched.
- Composite - combine several source images into one - character transfer, multi-reference scenes, collages.
- Image-to-image - transform an existing image in a new style or direction.
Every operation shows its price per image before you commit, and capability-first routing quietly sends each request to a model that actually supports the edit shape you're asking for.
Video: from a script to a finished cut
Video isn't a single button - it's a small production line, and the suite runs the whole thing. Generate clips from a prompt or an image, then upscale frames, mix an audio track, and stitch shots into a finished sequence. Some models emit a synchronized audio track natively; others let you layer one on. Composition jobs lean on real media pipelines - the same kind of frame alignment, upscaling, and multi-track merging a video editor expects - and each step is priced and shown separately, so a complex render is never a single mystery "rendering fee."
Music: original tracks from a one-line brief
Give the suite a brief - mood, tempo, length - and get back an original, royalty-clear background track or score. Use it for a video, a podcast intro, a game loop, or a campaign, without licensing a stock library. Music is priced by the second of audio it produces, so a 15-second sting and a three-minute score cost what they actually are.
Voice, transcription & translation
The suite speaks and listens, too. Text-to-speech narrates any script in a natural voice across all 29 supported languages, streaming as it's generated so playback starts before the clip finishes. Speech-to-text turns meetings, interviews, and voice notes into clean text with speaker labels, billed by the minute. Translation moves content between languages in real time or in bulk. And for live conversation, the Live Translator interprets your speech into another language as you talk - the full story is on the Voice & Realtime page.
Pay for what you make, not for seats
Every modality is metered by the unit that actually matters - an image is priced per image, video per minute and resolution tier, music and voice per second or character, transcription per minute. There are no per-seat subscriptions standing between you and the work.
And every single asset writes one clear billing record. When a job has several moving parts - say a video that needs an image upscale, an audio encode and a final merge - you see each of those as a separate line item, not buried in a flat "rendering fee." What you spent is never a mystery, and never reconstructed after the fact: the price is captured at the moment you generate. Local compute - like an ffmpeg composition pass - is metered too, by the second, so even the parts that don't call a model are accounted for.
Made to feel instant
You shouldn't have to stare at a spinner. Voices stream as they're spoken, so playback can begin before the full clip is finished. Need a specific format - mp4 for video, wav for audio? Ask for it and the suite transcodes on the way out, with any conversion cost passed straight through so there's nothing hidden.
Finished assets are uploaded straight to a fast global CDN and handed back as a ready-to-use link - drop them into a post, a workflow, or a download with no extra step.
Built for real production, not just demos
The Generation Suite isn't a toy prompt box. Voice support covers all 29 languages we ship, not just English - so a campaign, a course or a chatbot can sound native everywhere you operate. Composition jobs lean on real media pipelines to align multi-track audio, upscale frames and merge sequences.
And every clip, track and transcript you make plugs directly into the rest of the platform: drop a generation step inside a no-code Workflow to produce assets on a schedule, or call it inline from Chat when an idea strikes mid-conversation. Create once, reuse everywhere.
What you can do with it
- Image generation & editing
bi-image- describe what you want and get original artwork, product imagery or concept frames in seconds, then inpaint, composite, or restyle. The cost is shown per image before you commit. - Video generation & composition
bi-camera-video- turn a script into video, then upscale, mix audio and stitch shots into a finished cut. Each step is priced and shown separately. - Music generation
bi-music-note-beamed- generate original, royalty-clear background tracks and scores from a one-line brief, priced by the second. - Voice (Text-to-Speech)
bi-mic- give any script a natural spoken voice in all 29 supported languages. Audio streams as it's generated. - Transcription (Speech-to-Text)
bi-soundwave- turn meetings, interviews and voice notes into accurate text, with speaker separation, billed by the minute. - Translation, batch or live
bi-translate- move content between languages in real time or in bulk - including live speech translation - with the same transparent, per-use pricing as everything else.