This documentation is valid for the following list of our models:
#g1_whisper-large
Note:
Previously, our STT models operated via a single API call to POST https://api.aimlapi.com/v1/stt. You can view the API schema here.
Now, we are switching to a new two-step process:
POST https://api.aimlapi.com/v1/stt/create – Creates and submits a speech-to-text processing task to the server. This method accepts the same parameters as the old version but returns a generation_id instead of the final transcript.
GET https://api.aimlapi.com/v1/stt/{generation_id} – Retrieves the generated transcript from the server using the generation_id obtained from the previous API call.
This approach helps prevent generation failures due to timeouts.
We've prepared a couple of examples below to make the transition to the new STT API easier for you.
Model Overview
The Whisper models are primarily for AI research, focusing on model robustness, generalization, and biases, and are also effective for English speech recognition. The use of Whisper models for transcribing non-consensual recordings or in high-risk decision-making contexts is strongly discouraged due to potential inaccuracies and ethical concerns.
The models are trained using 680,000 hours of audio and corresponding transcripts from the internet, with 65% being English audio and transcripts, 18% non-English audio with English transcripts, and 17% non-English audio with matching non-English transcripts, covering 98 languages in total.
Whisper models use per-second billing. The cost of audio transcription is based on the number of seconds in the input audio file, not the processing time.
Setup your API Key
If you don’t have an API key for the AI/ML API yet, feel free to use our Quickstart guide.
API Schema
Creating and sending a speech-to-text conversion task to the server
Requesting the result of the task from the server using the generation_id
Quick Code Examples
Let's use the #g1_whisper-large model to transcribe the following audio fragment:
Example #1: Processing a Speech Audio File via URL
import time
import requests
base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"
# Creating and sending a speech-to-text conversion task to the server
def create_stt():
url = f"{base_url}/stt/create"
headers = {
"Authorization": f"Bearer {api_key}",
}
data = {
"model": "#g1_whisper-large",
"url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3"
}
response = requests.post(url, json=data, headers=headers)
if response.status_code >= 400:
print(f"Error: {response.status_code} - {response.text}")
else:
response_data = response.json()
print(response_data)
return response_data
# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
url = f"{base_url}/stt/{gen_id}"
headers = {
"Authorization": f"Bearer {api_key}",
}
response = requests.get(url, headers=headers)
return response.json()
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
stt_response = create_stt()
gen_id = stt_response.get("generation_id")
if gen_id:
start_time = time.time()
timeout = 600
while time.time() - start_time < timeout:
response_data = get_stt(gen_id)
if response_data is None:
print("Error: No response from API")
break
status = response_data.get("status")
if status == "waiting" or status == "active":
print("Still waiting... Checking again in 10 seconds.")
time.sleep(10)
else:
print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
return response_data
print("Timeout reached. Stopping.")
return None
if __name__ == "__main__":
main()
Response
{'generation_id': 'e3d46bba-7562-44a9-b440-504d940342a3'}
Processing complete:
he doesn't belong to you and i don't see how you have anything to do with what is be his power yet he's he personified from this stage to you be fire
Example #2: Processing a Speech Audio File via File Path
import time
import requests
base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"
# Creating and sending a speech-to-text conversion task to the server
def create_stt():
url = f"{base_url}/stt/create"
headers = {
"Authorization": f"Bearer {api_key}",
}
data = {
"model": "#g1_whisper-large",
}
with open("stt-sample.mp3", "rb") as file:
files = {"audio": ("sample.mp3", file, "audio/mpeg")}
response = requests.post(url, data=data, headers=headers, files=files)
if response.status_code >= 400:
print(f"Error: {response.status_code} - {response.text}")
else:
response_data = response.json()
print(response_data)
return response_data
# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
url = f"{base_url}/stt/{gen_id}"
headers = {
"Authorization": f"Bearer {api_key}",
}
response = requests.get(url, headers=headers)
return response.json()
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
stt_response = create_stt()
gen_id = stt_response.get("generation_id")
if gen_id:
start_time = time.time()
timeout = 600
while time.time() - start_time < timeout:
response_data = get_stt(gen_id)
if response_data is None:
print("Error: No response from API")
break
status = response_data.get("status")
if status == "waiting" or status == "active":
print("Still waiting... Checking again in 10 seconds.")
time.sleep(10)
else:
print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
return response_data
print("Timeout reached. Stopping.")
return None
if __name__ == "__main__":
main()
Response
{'generation_id': 'dd412e9d-044c-43ae-b97b-e920755074d5'}
Processing complete:
he doesn't belong to you and i don't see how you have anything to do with what is be his power yet he's he personified from this stage to you be fire
A custom intent you want the model to detect within your input audio if present. Submit up to 100.
stringOptional
or
string[]Optional
custom_topicany ofOptional
A custom topic you want the model to detect within your input audio if present. Submit up to 100.
stringOptional
or
string[]Optional
custom_intent_modestring · enumOptional
Sets how the model will interpret strings submitted to the custom_intent param. When strict, the model will only return intents submitted using the custom_intent param. When extended, the model will return its own detected intents in addition those submitted using the custom_intents param.
Possible values:
custom_topic_modestring · enumOptional
Sets how the model will interpret strings submitted to the custom_topic param. When strict, the model will only return topics submitted using the custom_topic param. When extended, the model will return its own detected topics in addition to those submitted using the custom_topic param.
Possible values:
detect_languagebooleanOptional
Enables language detection to identify the dominant language spoken in the submitted audio.
detect_entitiesbooleanOptional
When Entity Detection is enabled, the Punctuation feature will be enabled by default.
detect_topicsbooleanOptional
Detects the most important and relevant topics that are referenced in speech within the audio
diarizebooleanOptional
Recognizes speaker changes. Each word in the transcript will be assigned a speaker number starting at 0
dictationbooleanOptional
Identifies and extracts key entities from content in submitted audio
diarize_versionstringOptional
extrastringOptional
Arbitrary key-value pairs that are attached to the API response for usage in downstream processing
filler_wordsbooleanOptional
Filler Words can help transcribe interruptions in your audio, like “uh” and “um”
intentsbooleanOptional
Recognizes speaker intent throughout a transcript or text
keywordsstringOptional
Keywords can boost or suppress specialized terminology and brands
languagestringOptional
The BCP-47 language tag that hints at the primary spoken language. Depending on the Model and API endpoint you choose only certain languages are available
measurementsbooleanOptional
Spoken measurements will be converted to their corresponding abbreviations
multi_channelbooleanOptional
Transcribes each audio channel independently
numeralsbooleanOptional
Numerals converts numbers from written format to numerical format
paragraphsbooleanOptional
Splits audio into paragraphs to improve transcript readability
profanity_filterbooleanOptional
Profanity Filter looks for recognized profanity and converts it to the nearest recognized non-profane word or removes it from the transcript completely
punctuatebooleanOptional
Adds punctuation and capitalization to the transcript
searchstringOptional
Search for terms or phrases in submitted audio
sentimentbooleanOptional
Recognizes the sentiment throughout a transcript or text
smart_formatbooleanOptional
Applies formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability
summarizestringOptional
Summarizes content. For Listen API, supports string version option. For Read API, accepts boolean only.
tagstring[]Optional
Labels your requests for the purpose of identification during usage reporting
topicsbooleanOptional
Detects topics throughout a transcript or text
utterancesbooleanOptional
Segments speech into meaningful semantic units
utt_splitnumberOptional
Seconds to wait before detecting a pause between words in submitted audio