vibevoice-1.5b

This documentation is valid for the following model: microsoft/vibevoice-1.5b

Designed to produce rich, multi-speaker conversations from text, the model is well-suited for podcasts and other long-form audio content.

Setup your API Key

If you don’t have an API key for the AI/ML API yet, feel free to use our Quickstart guide.

Code Example

import os
import requests

def main():
    url = "https://api.aimlapi.com/v1/tts"
    headers = {
        "Authorization": "Bearer <YOUR_AIMLAPI_KEY>",
    }
    payload = { 
        "model": "microsoft/vibevoice-1.5b",
        "script": "Speaker 1: Wow, whats happening, Alice? \nSpeaker 2: Oh, just the usual… a full-blown AI revolution. Nothing to worry about",
        "speakers": [
            {   "preset": "Frank [EN]"   },
            {   "preset": "Alice [EN]"   }
        ]
    }

    try:
        response = requests.post(url, headers=headers, json=payload)
        response.raise_for_status()
        
        response_data = response.json()
        audio_url = response_data["audio"]["url"]
        file_name = response_data["audio"]["file_name"]
        
        audio_response = requests.get(audio_url, stream=True)
        audio_response.raise_for_status()
        
        # Save with the original file extension from the API
        # dist = os.path.join(os.path.dirname(__file__), file_name)  # if you run this code as a .py file
        dist = "audio.wav"  # if you run this code in Jupyter Notebook

        with open(dist, "wb") as write_stream:
            for chunk in audio_response.iter_content(chunk_size=8192):
                if chunk:
                    write_stream.write(chunk)

        print("Audio saved to:", dist)
        print(f"Duration: {response_data['duration']} seconds")
        print(f"Sample rate: {response_data['sample_rate']} Hz")
        
    except requests.exceptions.RequestException as e:
        print(f"Error making request: {e}")
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

Response

Audio saved to: audio.wav
Duration: 8.4 seconds
Sample rate: 24000 Hz

Listen to the dialogue we generated:

API Schema

post

Authorizations

Body

modelundefined · enumRequiredPossible values:

scriptstring · min: 1 · max: 5000Required

The script to convert to speech. Can be formatted with "Speaker X:" prefixes for multi-speaker dialogues.

seedintegerOptional

If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed.

cfg_scalenumber · min: 0.1 · max: 2Optional

The CFG (Classifier Free Guidance) scale is a measure of how close you want the model to stick to your prompt.

Default: 1.3

Responses

201Success

application/json

201Success

{
  "metadata": {
    "transaction_key": "text",
    "request_id": "text",
    "sha256": "text",
    "created": "2025-09-16T15:07:37.094Z",
    "duration": 1,
    "channels": 1,
    "models": [
      "text"
    ],
    "model_info": {
      "ANY_ADDITIONAL_PROPERTY": {
        "name": "text",
        "version": "text",
        "arch": "text"
      }
    }
  }
}

PreviousMicrosoft Nextvibevoice-7b

Last updated 36 minutes ago

Was this helpful?