AI/ML API Documentation
API KeyModelsPlaygroundGitHubGet Support
  • 📞Contact Sales
  • 🗯️Send Feedback
  • Quickstart
    • 🧭Documentation Map
    • Setting Up
    • Supported SDKs
  • API REFERENCES
    • 📒All Model IDs
    • Text Models (LLM)
      • AI21 Labs
        • jamba-1-5-mini
      • Alibaba Cloud
        • qwen-max
        • qwen-plus
        • qwen-turbo
        • Qwen2-72B-Instruct
        • Qwen2.5-7B-Instruct-Turbo
        • Qwen2.5-72B-Instruct-Turbo
        • Qwen2.5-Coder-32B-Instruct
        • Qwen-QwQ-32B
        • Qwen3-235B-A22B
      • Anthracite
        • magnum-v4
      • Anthropic
        • Claude 3 Haiku
        • Claude 3.5 Haiku
        • Claude 3 Opus
        • Claude 3 Sonnet
        • Claude 3.5 Sonnet
        • Claude 3.7 Sonnet
      • Cohere
        • command-r-plus
      • DeepSeek
        • DeepSeek V3
        • DeepSeek R1
      • Google
        • gemini-1.5-flash
        • gemini-1.5-pro
        • gemini-2.0-flash-exp
        • gemini-2.0-flash-thinking-exp-01-21
        • gemini-2.0-flash
        • gemini-2.5-flash-preview
        • gemini-2.5-pro-exp
        • gemini-2.5-pro-preview
        • gemma-2
        • gemma-3
      • Gryphe
        • MythoMax-L2-13b-Lite
      • Meta
        • Llama-3-chat-hf
        • Llama-3-8B-Instruct-Lite
        • Llama-3.1-8B-Instruct-Turbo
        • Llama-3.1-70B-Instruct-Turbo
        • Llama-3.1-405B-Instruct-Turbo
        • Llama-3.2-11B-Vision-Instruct-Turbo
        • Llama-3.2-90B-Vision-Instruct-Turbo
        • Llama-Vision-Free
        • Llama-3.2-3B-Instruct-Turbo
        • Llama-3.3-70B-Instruct-Turbo
        • Llama-4-scout
        • Llama-4-maverick
      • MiniMax
        • text-01
        • abab6.5s-chat
      • Mistral AI
        • codestral-2501
        • mistral-nemo
        • mistral-tiny
        • Mistral-7B-Instruct
        • Mixtral-8x22B-Instruct
        • Mixtral-8x7B-Instruct
      • NVIDIA
        • Llama-3.1-Nemotron-70B-Instruct-HF
        • llama-3.1-nemotron-70b
      • NeverSleep
        • llama-3.1-lumimaid
      • NousResearch
        • Nous-Hermes-2-Mixtral-8x7B-DPO
      • OpenAI
        • gpt-3.5-turbo
        • gpt-4
        • gpt-4-preview
        • gpt-4-turbo
        • gpt-4o
        • gpt-4o-mini
        • gpt-4o-audio-preview
        • gpt-4o-mini-audio-preview
        • gpt-4o-search-preview
        • gpt-4o-mini-search-preview
        • o1
        • o1-mini
        • o1-preview
        • o3-mini
        • gpt-4.5-preview
        • gpt-4.1
        • gpt-4.1-mini
        • gpt-4.1-nano
        • o4-mini
      • xAI
        • grok-beta
        • grok-3-beta
        • grok-3-mini-beta
    • Image Models
      • Flux
        • flux-pro
        • flux-pro/v1.1
        • flux-pro/v1.1-ultra
        • flux-realism
        • flux/dev
        • flux/dev/image-to-image
        • flux/schnell
      • Google
        • Imagen 3.0
      • OpenAI
        • DALL·E 2
        • DALL·E 3
      • RecraftAI
        • Recraft v3
      • Stability AI
        • Stable Diffusion v3 Medium
        • Stable Diffusion v3.5 Large
    • Video Models
      • Alibaba Cloud
        • Wan 2.1 (Text-to-Video)
      • Google
        • Veo2 (Image-to-Video)
        • Veo2 (Text-to-Video)
      • Kling AI
        • v1-standard/image-to-video
        • v1-standard/text-to-video
        • v1-pro/image-to-video
        • v1-pro/text-to-video
        • v1.6-standard/text-to-video
        • v1.6-standard/image-to-video
        • v1.6-pro/image-to-video
        • v1.6-pro/text-to-video
        • v1.6-standard/effects
        • v1.6-pro/effects
        • v2-master/image-to-video
        • v2-master/text-to-video
      • Luma AI
        • Text-to-Video v2
        • Text-to-Video v1 (legacy)
      • MiniMax
        • video-01
        • video-01-live2d
      • Runway
        • gen3a_turbo
        • gen4_turbo
    • Music Models
      • MiniMax
        • minimax-music [legacy]
        • music-01
      • Stability AI
        • stable-audio
    • Voice/Speech Models
      • Speech-to-Text
        • stt [legacy]
        • Deepgram
          • nova-2
        • OpenAI
          • whisper-base
          • whisper-large
          • whisper-medium
          • whisper-small
          • whisper-tiny
      • Text-to-Speech
        • Deepgram
          • aura
    • Content Moderation Models
      • Meta
        • Llama-Guard-3-11B-Vision-Turbo
        • LlamaGuard-2-8b
        • Meta-Llama-Guard-3-8B
    • 3D-Generating Models
      • Stability AI
        • triposr
    • Vision Models
      • Image Analysis
      • OCR: Optical Character Recognition
        • Google
          • Google OCR
        • Mistral AI
          • mistral-ocr-latest
      • OFR: Optical Feature Recognition
    • Embedding Models
      • Anthropic
        • voyage-2
        • voyage-code-2
        • voyage-finance-2
        • voyage-large-2
        • voyage-large-2-instruct
        • voyage-law-2
        • voyage-multilingual-2
      • BAAI
        • bge-base-en
        • bge-large-en
      • Google
        • textembedding-gecko
        • text-multilingual-embedding-002
      • OpenAI
        • text-embedding-3-large
        • text-embedding-3-small
        • text-embedding-ada-002
      • Together AI
        • m2-bert-80M-retrieval
  • Solutions
    • Bagoodex
      • AI Search Engine
        • Find Links
        • Find Images
        • Find Videos
        • Find the Weather
        • Find a Local Map
        • Get a Knowledge Structure
    • OpenAI
      • Assistants
        • Assistant API
        • Thread API
        • Message API
        • Run and Run Step API
        • Events
  • Use Cases
    • Create Images: Illustrate an Article
    • Animate Images: A Children’s Encyclopedia
    • Create an Assistant to Discuss a Specific Document
    • Create a 3D Model from an Image
    • Create a Looped GIF for a Web Banner
    • Read Text Aloud and Describe Images: Support People with Visual Impairments
    • Summarize Websites with AI-Powered Chrome Extension
  • Capabilities
    • Completion and Chat Completion
    • Streaming Mode
    • Code Generation
    • Thinking / Reasoning
    • Function Calling
    • Vision in Text Models (Image-To-Text)
    • Web Search
    • Features of Anthropic Models
    • Model comparison
  • FAQ
    • Can I use API in Python?
    • Can I use API in NodeJS?
    • What are the Pro Models?
    • How to use the Free Tier?
    • Are my requests cropped?
    • Can I call API in the asynchronous mode?
    • OpenAI SDK doesn't work?
  • Errors and Messages
    • General Info
    • Errors with status code 4xx
    • Errors with status code 5xx
  • Glossary
    • Concepts
  • Integrations
    • 🧩Our Integration List
    • Langflow
    • LiteLLM
Powered by GitBook
On this page

Was this helpful?

  1. API REFERENCES
  2. Voice/Speech Models
  3. Speech-to-Text
  4. OpenAI

whisper-large

Previouswhisper-baseNextwhisper-medium

Last updated 9 days ago

Was this helpful?

This documentation is valid for the following list of our models:

  • #g1_whisper-large

Note:

Previously, our STT models operated via a single API call to POST https://api.aimlapi.com/v1/stt. You can view the API schema .

Now, we are switching to a new two-step process:

  • POST https://api.aimlapi.com/v1/stt/create – Creates and submits a speech-to-text processing task to the server. This method accepts the same parameters as the old version but returns a generation_id instead of the final transcript.

  • GET https://api.aimlapi.com/v1/stt/{generation_id} – Retrieves the generated transcript from the server using the generation_id obtained from the previous API call.

This approach helps prevent generation failures due to timeouts. We've prepared below to make the transition to the new STT API easier for you.

Model Overview

The Whisper models are primarily for AI research, focusing on model robustness, generalization, and biases, and are also effective for English speech recognition. The use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.

The models are trained using 680,000 hours of audio and corresponding transcripts from the internet, with 65% being English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with matching non-English transcripts, covering 98 languages in total.

Whisper models use per-second billing. The cost of audio transcription is based on the number of seconds in the input audio file, not the processing time.

Setup your API Key

If you don’t have an API key for the AI/ML API yet, feel free to use our .

API Schema

Creating and sending a speech-to-text conversion task to the server

Requesting the result of the task from the server using the generation_id

Quick Code Examples

Let's use the #g1_whisper-large model to transcribe the following audio fragment:

Example #1: Processing a Speech Audio File via URL

import time
import requests

base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"

# Creating and sending a speech-to-text conversion task to the server
def create_stt():
    url = f"{base_url}/stt/create"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }

    data = {
        "model": "#g1_whisper-large",
        "url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3"
    }
 
    response = requests.post(url, json=data, headers=headers)
    
    if response.status_code >= 400:
        print(f"Error: {response.status_code} - {response.text}")
    else:
        response_data = response.json()
        print(response_data)
        return response_data

# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
    url = f"{base_url}/stt/{gen_id}"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }
    response = requests.get(url, headers=headers)
    return response.json()
    
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
    stt_response = create_stt()
    gen_id = stt_response.get("generation_id")


    if gen_id:
        start_time = time.time()

        timeout = 600
        while time.time() - start_time < timeout:
            response_data = get_stt(gen_id)

            if response_data is None:
                print("Error: No response from API")
                break
        
            status = response_data.get("status")

            if status == "waiting" or status == "active":
                print("Still waiting... Checking again in 10 seconds.")
                time.sleep(10)
            else:
                print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
                return response_data
   
        print("Timeout reached. Stopping.")
        return None     


if __name__ == "__main__":
    main()
Response
{'generation_id': 'e3d46bba-7562-44a9-b440-504d940342a3'}
Processing complete:
 he doesn't belong to you and i don't see how you have anything to do with what is be his power yet he's he personified from this stage to you be fire

Example #2: Processing a Speech Audio File via File Path

import time
import requests

base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"

# Creating and sending a speech-to-text conversion task to the server
def create_stt():
    url = f"{base_url}/stt/create"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }

    data = {
        "model": "#g1_whisper-large",
    }
    with open("stt-sample.mp3", "rb") as file:
        files = {"audio": ("sample.mp3", file, "audio/mpeg")}
        response = requests.post(url, data=data, headers=headers, files=files)
    
    if response.status_code >= 400:
        print(f"Error: {response.status_code} - {response.text}")
    else:
        response_data = response.json()
        print(response_data)
        return response_data

# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
    url = f"{base_url}/stt/{gen_id}"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }
    response = requests.get(url, headers=headers)
    return response.json()
    
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
    stt_response = create_stt()
    gen_id = stt_response.get("generation_id")


    if gen_id:
        start_time = time.time()

        timeout = 600
        while time.time() - start_time < timeout:
            response_data = get_stt(gen_id)

            if response_data is None:
                print("Error: No response from API")
                break
        
            status = response_data.get("status")

            if status == "waiting" or status == "active":
                print("Still waiting... Checking again in 10 seconds.")
                time.sleep(10)
            else:
                print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
                return response_data
   
        print("Timeout reached. Stopping.")
        return None     


if __name__ == "__main__":
    main()
Response
{'generation_id': 'dd412e9d-044c-43ae-b97b-e920755074d5'}
Processing complete:
 he doesn't belong to you and i don't see how you have anything to do with what is be his power yet he's he personified from this stage to you be fire
here
Quickstart guide
a couple of examples
get
Authorizations
Path parameters
generation_idstringRequired
Responses
201Success
application/json
get
GET /v1/stt/{generation_id} HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer <YOUR_AIMLAPI_KEY>
Accept: */*
201Success
{
  "status": "text",
  "result": {
    "metadata": {
      "transaction_key": "text",
      "request_id": "text",
      "sha256": "text",
      "created": "2025-05-09T22:36:35.006Z",
      "duration": 1,
      "channels": 1,
      "models": [
        "text"
      ],
      "model_info": {
        "ANY_ADDITIONAL_PROPERTY": {
          "name": "text",
          "version": "text",
          "arch": "text"
        }
      }
    }
  }
}
  • Model Overview
  • Setup your API Key
  • API Schema
  • POST/v1/stt/create
  • GET/v1/stt/{generation_id}
  • Quick Code Examples
  • Example #1: Processing a Speech Audio File via URL
  • Example #2: Processing a Speech Audio File via File Path
post
Authorizations
Body
modelundefined · enumRequiredPossible values:
custom_intentany ofOptional
stringOptional
or
string[]Optional
custom_topicany ofOptional
stringOptional
or
string[]Optional
custom_intent_modestring · enumOptionalPossible values:
custom_topic_modestring · enumOptionalPossible values:
detect_languagebooleanOptional
detect_entitiesbooleanOptional
detect_topicsbooleanOptional
diarizebooleanOptional
dictationbooleanOptional
diarize_versionstringOptional
extrastringOptional
filler_wordsbooleanOptional
intentsbooleanOptional
keywordsstringOptional
languagestringOptional
measurementsbooleanOptional
multi_channelbooleanOptional
numeralsbooleanOptional
paragraphsbooleanOptional
profanity_filterbooleanOptional
punctuatebooleanOptional
searchstringOptional
sentimentbooleanOptional
smart_formatbooleanOptional
summarizestringOptional
tagstring[]Optional
topicsbooleanOptional
utterancesbooleanOptional
utt_splitnumberOptional
Responses
201Success
application/json
post
POST /v1/stt/create HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer <YOUR_AIMLAPI_KEY>
Content-Type: application/json
Accept: */*
Content-Length: 592

{
  "model": "#g1_whisper-large",
  "custom_intent": "text",
  "custom_topic": "text",
  "custom_intent_mode": "strict",
  "custom_topic_mode": "strict",
  "detect_language": true,
  "detect_entities": true,
  "detect_topics": true,
  "diarize": true,
  "dictation": true,
  "diarize_version": "text",
  "extra": "text",
  "filler_words": true,
  "intents": true,
  "keywords": "text",
  "language": "text",
  "measurements": true,
  "multi_channel": true,
  "numerals": true,
  "paragraphs": true,
  "profanity_filter": true,
  "punctuate": true,
  "search": "text",
  "sentiment": true,
  "smart_format": true,
  "summarize": "text",
  "tag": [
    "text"
  ],
  "topics": true,
  "utterances": true,
  "utt_split": 1
}
201Success
{
  "generation_id": "123e4567-e89b-12d3-a456-426614174000"
}