product 26 May 2026 · 5 min read

Image, Video, and Music Generation as First-Class Modes

ToRun Team

Most AI platforms bolt media generation on as an afterthought. A separate tab, a different API key, a billing system that has nothing to do with your chat credits. ToRun takes a different position: image, video, and music generation are modes — the same kind of first-class product surface as Chat or Code — and they run through the exact same capability routing and billing pipeline as every other call.

This post explains the mechanics and why they matter.

How the routing pipeline handles media

Every request in ToRun starts with a Mode: Chat, Image, Video, Music, Research, and so on. Each Mode declares the Capabilities it requires. Image generation requires the image-gen capability. Video generation requires video-gen. Music and audio synthesis require audio-out. A request that combines chat with image output — say, asking for a concept image alongside an explanation — requires both text and image-gen simultaneously.

The routing pipeline takes that capability set, filters the model catalog to providers that satisfy all of them, then ranks candidates by price and quality. The selected model is not hardcoded anywhere in your workflow; if a provider is down or has degraded latency, the fallback chain picks the next viable model automatically. You do not write provider-specific code. You do not manage separate API keys per modality (unless you want to, via BYOK).

This matters most when you compose modalities. A workflow node can call a text model to write a scene description, pipe that output to an image-generation model, then pass the image URL to a video extension model — all within the same DAG, all billed to the same ledger.

Pricing units that actually match what providers charge

Billing for media has historically been opaque because the units do not map to tokens. ToRun uses twelve canonical pricing units to cover every modality precisely:

Per image — for diffusion and text-to-image models; the price captures the generation, not the pixels.
Per second of audio — for text-to-speech, speech synthesis, and music generation.
Per second of video — for video generation models; duration and resolution variant are captured in the pricing row.
Per character TTS — for character-billed speech synthesis APIs.

Alongside these, the standard token units (per million input tokens, per million output tokens, per million cached input tokens) handle the text models that participate in a media pipeline.

Every call writes exactly one BillingRecord. That record freezes the price at execution time: the pricing unit, the rate per unit, the quantity consumed, the USD amount after ToRun's margin, and the FX snapshot if your account currency is not USD. If a provider changes their rates tomorrow, your invoice from today is still fully recomputable from the ledger. You can export the raw records and verify every cent offline.

Composing media generation with chat and workflows

Where this architecture pays off most is in workflows.

A workflow node is simply a step that declares what it needs: a model with certain capabilities, an input schema, and an output schema. A music node might accept a text prompt and a duration parameter and return a Bunny CDN URL for the generated audio file. An image node might accept a prompt, an aspect ratio, and a style hint and return an image URL. These nodes compose like any other step.

You can build a single workflow that:

Takes a product brief as input.
Calls a text model to generate five tagline variants.
Calls an image model to render a hero image for each variant.
Calls a text-to-speech model to narrate the chosen tagline.
Returns a structured artifact with copy, image URLs, and audio.

The billing for that entire run is captured across five BillingRecords — one per model call — all tied to the same workflow execution ID. You can see the cost breakdown by step. Runners use quality-tier models by default because the output is what matters; you can override per node.

The same logic applies in Chat. If your conversation invokes an image generation call — either because you asked directly or because an attached workflow triggered one — that call runs through the routing pipeline and appears as a line item in your session cost view. Nothing is hidden in a flat rate. Nothing is free and then suddenly charged.

What is live now

Image generation, video generation, and music/audio generation are all available in Phase 1. They route through the same provider catalog as text, billed through the same ledger, composable in the same workflow builder. The specific models available depend on which providers are currently in the catalog and which capabilities they advertise — the routing layer surfaces that in real time.

Pricing is at /pricing. BYOK support is available if you want to route media calls through your own provider account at reduced platform fees.

Share on X Share on LinkedIn Email