nova-2
Nova-2 models use per-second billing. The cost of audio transcription is based on the number of seconds in the input audio file, not the processing time.
Model Overview
Nova-2 builds on the advancements of Nova-1 with speech-specific optimizations to its Transformer architecture, refined data curation techniques, and a multi-stage training approach. These improvements result in a lower word error rate (WER) and better entity recognition (including proper nouns and alphanumeric sequences), as well as enhanced punctuation and capitalization.
Nova-2 offers the following model options:
automotive: Optimized for audio with automotive oriented vocabulary.
conversationalai: Optimized for use cases in which a human is talking to an automated bot, such as IVR, a voice assistant, or an automated kiosk.
drivethru: Optimized for audio sources from drivethrus.
finance: Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.
general: Optimized for everyday audio processing.
medical: Optimized for audio with medical oriented vocabulary.
meeting: Optimized for conference room settings, which include multiple speakers with a single microphone.
phonecall: Optimized for low-bandwidth audio phone calls.
video: Optimized for audio sourced from videos.
voicemail: Optimized for low-bandwidth audio clips with a single speaker. Derived from the phonecall model.
Setup your API Key
If you don’t have an API key for the AI/ML API yet, feel free to use our Quickstart guide.
API Schema
Creating and sending a speech-to-text conversion task to the server
A custom intent you want the model to detect within your input audio if present. Submit up to 100.
A custom topic you want the model to detect within your input audio if present. Submit up to 100.
Sets how the model will interpret strings submitted to the custom_intent param. When strict, the model will only return intents submitted using the custom_intent param. When extended, the model will return its own detected intents in addition those submitted using the custom_intents param.
Sets how the model will interpret strings submitted to the custom_topic param. When strict, the model will only return topics submitted using the custom_topic param. When extended, the model will return its own detected topics in addition to those submitted using the custom_topic param.
Enables language detection to identify the dominant language spoken in the submitted audio.
When Entity Detection is enabled, the Punctuation feature will be enabled by default.
Detects the most important and relevant topics that are referenced in speech within the audio
Recognizes speaker changes. Each word in the transcript will be assigned a speaker number starting at 0
Identifies and extracts key entities from content in submitted audio
Arbitrary key-value pairs that are attached to the API response for usage in downstream processing
Filler Words can help transcribe interruptions in your audio, like “uh” and “um”
Recognizes speaker intent throughout a transcript or text
Keywords can boost or suppress specialized terminology and brands
The BCP-47 language tag that hints at the primary spoken language. Depending on the Model and API endpoint you choose only certain languages are available
Spoken measurements will be converted to their corresponding abbreviations
Transcribes each audio channel independently
Numerals converts numbers from written format to numerical format
Splits audio into paragraphs to improve transcript readability
Profanity Filter looks for recognized profanity and converts it to the nearest recognized non-profane word or removes it from the transcript completely
Adds punctuation and capitalization to the transcript
Search for terms or phrases in submitted audio
Recognizes the sentiment throughout a transcript or text
Applies formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability
Summarizes content. For Listen API, supports string version option. For Read API, accepts boolean only.
Labels your requests for the purpose of identification during usage reporting
Detects topics throughout a transcript or text
Segments speech into meaningful semantic units
Seconds to wait before detecting a pause between words in submitted audio
POST /v1/stt/create HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer <YOUR_AIMLAPI_KEY>
Content-Type: application/json
Accept: */*
Content-Length: 596
{
"model": "#g1_nova-2-automotive",
"custom_intent": "text",
"custom_topic": "text",
"custom_intent_mode": "strict",
"custom_topic_mode": "strict",
"detect_language": true,
"detect_entities": true,
"detect_topics": true,
"diarize": true,
"dictation": true,
"diarize_version": "text",
"extra": "text",
"filler_words": true,
"intents": true,
"keywords": "text",
"language": "text",
"measurements": true,
"multi_channel": true,
"numerals": true,
"paragraphs": true,
"profanity_filter": true,
"punctuate": true,
"search": "text",
"sentiment": true,
"smart_format": true,
"summarize": "text",
"tag": [
"text"
],
"topics": true,
"utterances": true,
"utt_split": 1
}
{
"generation_id": "123e4567-e89b-12d3-a456-426614174000"
}
Requesting the result of the task from the server using the generation_id
GET /v1/stt/{generation_id} HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer <YOUR_AIMLAPI_KEY>
Accept: */*
{
"status": "text",
"result": {
"metadata": {
"transaction_key": "text",
"request_id": "text",
"sha256": "text",
"created": "2025-07-05T18:02:14.747Z",
"duration": 1,
"channels": 1,
"models": [
"text"
],
"model_info": {
"ANY_ADDITIONAL_PROPERTY": {
"name": "text",
"version": "text",
"arch": "text"
}
}
}
}
}
Quick Code Examples
Let's use the #g1_nova-2-meeting
model to transcribe the following audio fragment:
Example #1: Processing a Speech Audio File via URL
import time
import requests
base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"
# Creating and sending a speech-to-text conversion task to the server
def create_stt():
url = f"{base_url}/stt/create"
headers = {
"Authorization": f"Bearer {api_key}",
}
data = {
"model": "#g1_nova-2-meeting",
"url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3"
}
response = requests.post(url, json=data, headers=headers)
if response.status_code >= 400:
print(f"Error: {response.status_code} - {response.text}")
else:
response_data = response.json()
print(response_data)
return response_data
# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
url = f"{base_url}/stt/{gen_id}"
headers = {
"Authorization": f"Bearer {api_key}",
}
response = requests.get(url, headers=headers)
return response.json()
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
stt_response = create_stt()
gen_id = stt_response.get("generation_id")
if gen_id:
start_time = time.time()
timeout = 600
while time.time() - start_time < timeout:
response_data = get_stt(gen_id)
if response_data is None:
print("Error: No response from API")
break
status = response_data.get("status")
if status == "waiting" or status == "active":
print("Still waiting... Checking again in 10 seconds.")
time.sleep(10)
else:
print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
return response_data
print("Timeout reached. Stopping.")
return None
if __name__ == "__main__":
main()
Example #2: Processing a Speech Audio File via File Path
import time
import requests
base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"
# Creating and sending a speech-to-text conversion task to the server
def create_stt():
url = f"{base_url}/stt/create"
headers = {
"Authorization": f"Bearer {api_key}",
}
data = {
"model": "#g1_nova-2-meeting",
}
with open("stt-sample.mp3", "rb") as file:
files = {"audio": ("sample.mp3", file, "audio/mpeg")}
response = requests.post(url, data=data, headers=headers, files=files)
if response.status_code >= 400:
print(f"Error: {response.status_code} - {response.text}")
else:
response_data = response.json()
print(response_data)
return response_data
# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
url = f"{base_url}/stt/{gen_id}"
headers = {
"Authorization": f"Bearer {api_key}",
}
response = requests.get(url, headers=headers)
return response.json()
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
stt_response = create_stt()
gen_id = stt_response.get("generation_id")
if gen_id:
start_time = time.time()
timeout = 600
while time.time() - start_time < timeout:
response_data = get_stt(gen_id)
if response_data is None:
print("Error: No response from API")
break
status = response_data.get("status")
if status == "waiting" or status == "active":
print("Still waiting... Checking again in 10 seconds.")
time.sleep(10)
else:
print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
return response_data
print("Timeout reached. Stopping.")
return None
if __name__ == "__main__":
main()
Last updated
Was this helpful?