AI/ML API Documentation
API KeyModelsPlaygroundGitHubGet Support
  • 📞Contact Sales
  • 🗯️Send Feedback
  • Quickstart
    • 🧭Documentation Map
    • Setting Up
    • Supported SDKs
  • API REFERENCES
    • 📒All Model IDs
    • Text Models (LLM)
      • Alibaba Cloud
        • qwen-max
        • qwen-plus
        • qwen-turbo
        • Qwen2-72B-Instruct
        • Qwen2.5-7B-Instruct-Turbo
        • Qwen2.5-72B-Instruct-Turbo
        • Qwen2.5-Coder-32B-Instruct
        • Qwen-QwQ-32B
        • Qwen3-235B-A22B
      • Anthracite
        • magnum-v4
      • Anthropic
        • Claude 3 Haiku
        • Claude 3 Opus
        • Claude 3 Sonnet
        • Claude 3.5 Haiku
        • Claude 3.5 Sonnet
        • Claude 3.7 Sonnet
        • Claude 4 Opus
        • Claude 4 Sonnet
      • Cohere
        • command-r-plus
      • DeepSeek
        • DeepSeek V3
        • DeepSeek R1
        • DeepSeek Prover V2
      • Google
        • gemini-1.5-flash
        • gemini-1.5-pro
        • gemini-2.0-flash-exp
        • gemini-2.0-flash
        • gemini-2.5-flash-preview
        • gemini-2.5-pro-exp
        • gemini-2.5-pro-preview
        • gemma-2
        • gemma-3
      • Gryphe
        • MythoMax-L2-13b-Lite
      • Meta
        • Llama-3-chat-hf
        • Llama-3-8B-Instruct-Lite
        • Llama-3.1-8B-Instruct-Turbo
        • Llama-3.1-70B-Instruct-Turbo
        • Llama-3.1-405B-Instruct-Turbo
        • Llama-3.2-11B-Vision-Instruct-Turbo
        • Llama-3.2-90B-Vision-Instruct-Turbo
        • Llama-Vision-Free
        • Llama-3.2-3B-Instruct-Turbo
        • Llama-3.3-70B-Instruct-Turbo
        • Llama-4-scout
        • Llama-4-maverick
      • MiniMax
        • text-01
        • abab6.5s-chat
      • Mistral AI
        • codestral-2501
        • mistral-nemo
        • mistral-tiny
        • Mistral-7B-Instruct
        • Mixtral-8x22B-Instruct
        • Mixtral-8x7B-Instruct
      • NVIDIA
        • Llama-3.1-Nemotron-70B-Instruct-HF
        • llama-3.1-nemotron-70b
      • NeverSleep
        • llama-3.1-lumimaid
      • NousResearch
        • Nous-Hermes-2-Mixtral-8x7B-DPO
      • OpenAI
        • gpt-3.5-turbo
        • gpt-4
        • gpt-4-preview
        • gpt-4-turbo
        • gpt-4o
        • gpt-4o-mini
        • gpt-4o-audio-preview
        • gpt-4o-mini-audio-preview
        • gpt-4o-search-preview
        • gpt-4o-mini-search-preview
        • o1
        • o1-mini
        • o1-preview
        • o3-mini
        • gpt-4.5-preview
        • gpt-4.1
        • gpt-4.1-mini
        • gpt-4.1-nano
        • o4-mini
      • xAI
        • grok-beta
        • grok-3-beta
        • grok-3-mini-beta
    • Image Models
      • Flux
        • flux-pro
        • flux-pro/v1.1
        • flux-pro/v1.1-ultra
        • flux-realism
        • flux/dev
        • flux/dev/image-to-image
        • flux/schnell
      • Google
        • Imagen 3
        • Imagen 4 Preview
      • OpenAI
        • DALL·E 2
        • DALL·E 3
      • RecraftAI
        • Recraft v3
      • Stability AI
        • Stable Diffusion v3 Medium
        • Stable Diffusion v3.5 Large
    • Video Models
      • Alibaba Cloud
        • Wan 2.1 (Text-to-Video)
      • Google
        • Veo2 (Image-to-Video)
        • Veo2 (Text-to-Video)
      • Kling AI
        • v1-standard/image-to-video
        • v1-standard/text-to-video
        • v1-pro/image-to-video
        • v1-pro/text-to-video
        • v1.6-standard/text-to-video
        • v1.6-standard/image-to-video
        • v1.6-pro/image-to-video
        • v1.6-pro/text-to-video
        • v1.6-standard/effects
        • v1.6-pro/effects
        • v2-master/image-to-video
        • v2-master/text-to-video
      • Luma AI
        • Text-to-Video v2
        • Text-to-Video v1 (legacy)
      • MiniMax
        • video-01
        • video-01-live2d
      • Runway
        • gen3a_turbo
        • gen4_turbo
    • Music Models
      • Google
      • MiniMax
        • minimax-music [legacy]
        • music-01
      • Stability AI
        • stable-audio
    • Voice/Speech Models
      • Speech-to-Text
        • stt [legacy]
        • Deepgram
          • nova-2
        • OpenAI
          • whisper-base
          • whisper-large
          • whisper-medium
          • whisper-small
          • whisper-tiny
      • Text-to-Speech
        • Deepgram
          • aura
    • Content Moderation Models
      • Meta
        • Llama-Guard-3-11B-Vision-Turbo
        • LlamaGuard-2-8b
        • Meta-Llama-Guard-3-8B
    • 3D-Generating Models
      • Stability AI
        • triposr
    • Vision Models
      • Image Analysis
      • OCR: Optical Character Recognition
        • Google
          • Google OCR
        • Mistral AI
          • mistral-ocr-latest
      • OFR: Optical Feature Recognition
    • Embedding Models
      • Anthropic
        • voyage-2
        • voyage-code-2
        • voyage-finance-2
        • voyage-large-2
        • voyage-large-2-instruct
        • voyage-law-2
        • voyage-multilingual-2
      • BAAI
        • bge-base-en
        • bge-large-en
      • Google
        • textembedding-gecko
        • text-multilingual-embedding-002
      • OpenAI
        • text-embedding-3-large
        • text-embedding-3-small
        • text-embedding-ada-002
      • Together AI
        • m2-bert-80M-retrieval
  • Solutions
    • Bagoodex
      • AI Search Engine
        • Find Links
        • Find Images
        • Find Videos
        • Find the Weather
        • Find a Local Map
        • Get a Knowledge Structure
    • OpenAI
      • Assistants
        • Assistant API
        • Thread API
        • Message API
        • Run and Run Step API
        • Events
  • Use Cases
    • Create Images: Illustrate an Article
    • Animate Images: A Children’s Encyclopedia
    • Create an Assistant to Discuss a Specific Document
    • Create a 3D Model from an Image
    • Create a Looped GIF for a Web Banner
    • Read Text Aloud and Describe Images: Support People with Visual Impairments
    • Find Relevant Answers: Semantic Search with Text Embeddings
    • Summarize Websites with AI-Powered Chrome Extension
  • Capabilities
    • Completion and Chat Completion
    • Streaming Mode
    • Code Generation
    • Thinking / Reasoning
    • Function Calling
    • Vision in Text Models (Image-To-Text)
    • Web Search
    • Features of Anthropic Models
    • Model comparison
  • FAQ
    • Can I use API in Python?
    • Can I use API in NodeJS?
    • What are the Pro Models?
    • How to use the Free Tier?
    • Are my requests cropped?
    • Can I call API in the asynchronous mode?
    • OpenAI SDK doesn't work?
  • Errors and Messages
    • General Info
    • Errors with status code 4xx
    • Errors with status code 5xx
  • Glossary
    • Concepts
  • Integrations
    • 🧩Our Integration List
    • Cline
    • Langflow
    • LiteLLM
    • Roo Code
Powered by GitBook
On this page

Was this helpful?

  1. API REFERENCES
  2. Voice/Speech Models
  3. Speech-to-Text
  4. Deepgram

nova-2

This documentation is valid for the following list of our models:

  • #g1_nova-2-automotive

  • #g1_nova-2-conversationalai

  • #g1_nova-2-drivethru

  • #g1_nova-2-finance

  • #g1_nova-2-general

  • #g1_nova-2-medical

  • #g1_nova-2-meeting

  • #g1_nova-2-phonecall

  • #g1_nova-2-video

  • #g1_nova-2-voicemail

Nova-2 models use per-second billing. The cost of audio transcription is based on the number of seconds in the input audio file, not the processing time.

Model Overview

Nova-2 builds on the advancements of Nova-1 with speech-specific optimizations to its Transformer architecture, refined data curation techniques, and a multi-stage training approach. These improvements result in a lower word error rate (WER) and better entity recognition (including proper nouns and alphanumeric sequences), as well as enhanced punctuation and capitalization.

Nova-2 offers the following model options:

  • automotive: Optimized for audio with automotive oriented vocabulary.

  • conversationalai: Optimized for use cases in which a human is talking to an automated bot, such as IVR, a voice assistant, or an automated kiosk.

  • drivethru: Optimized for audio sources from drivethrus.

  • finance: Optimized for multiple speakers with varying audio quality, such as might be found on a typical earnings call. Vocabulary is heavily finance oriented.

  • general: Optimized for everyday audio processing.

  • medical: Optimized for audio with medical oriented vocabulary.

  • meeting: Optimized for conference room settings, which include multiple speakers with a single microphone.

  • phonecall: Optimized for low-bandwidth audio phone calls.

  • video: Optimized for audio sourced from videos.

  • voicemail: Optimized for low-bandwidth audio clips with a single speaker. Derived from the phonecall model.

Setup your API Key

API Schema

Creating and sending a speech-to-text conversion task to the server

Requesting the result of the task from the server using the generation_id

Quick Code Examples

Let's use the #g1_nova-2-meeting model to transcribe the following audio fragment:

Example #1: Processing a Speech Audio File via URL

import time
import requests

base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"

# Creating and sending a speech-to-text conversion task to the server
def create_stt():
    url = f"{base_url}/stt/create"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }

    data = {
        "model": "#g1_nova-2-meeting",
        "url": "https://audio-samples.github.io/samples/mp3/blizzard_primed/sample-0.mp3"
    }
 
    response = requests.post(url, json=data, headers=headers)
    
    if response.status_code >= 400:
        print(f"Error: {response.status_code} - {response.text}")
    else:
        response_data = response.json()
        print(response_data)
        return response_data

# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
    url = f"{base_url}/stt/{gen_id}"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }
    response = requests.get(url, headers=headers)
    return response.json()
    
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
    stt_response = create_stt()
    gen_id = stt_response.get("generation_id")



    if gen_id:
        start_time = time.time()

        timeout = 600
        while time.time() - start_time < timeout:
            response_data = get_stt(gen_id)

            if response_data is None:
                print("Error: No response from API")
                break
        
            status = response_data.get("status")

            if status == "waiting" or status == "active":
                print("Still waiting... Checking again in 10 seconds.")
                time.sleep(10)
            else:
                print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
                return response_data
   
        print("Timeout reached. Stopping.")
        return None     


if __name__ == "__main__":
    main()
Response
{'generation_id': 'h66460ba-0562-1dd9-b440-a56d947e72a3'}
Processing complete:
 He doesn't belong to you and i don't see how you have anything to do with what is be his power yet he's he persona from this stage to you be fine

Example #2: Processing a Speech Audio File via File Path

import time
import requests

base_url = "https://api.aimlapi.com/v1"
# Insert your AIML API Key instead of <YOUR_AIMLAPI_KEY>:
api_key = "<YOUR_AIMLAPI_KEY>"

# Creating and sending a speech-to-text conversion task to the server
def create_stt():
    url = f"{base_url}/stt/create"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }

    data = {
        "model": "#g1_nova-2-meeting",
    }
    with open("stt-sample.mp3", "rb") as file:
        files = {"audio": ("sample.mp3", file, "audio/mpeg")}
        response = requests.post(url, data=data, headers=headers, files=files)
    
    if response.status_code >= 400:
        print(f"Error: {response.status_code} - {response.text}")
    else:
        response_data = response.json()
        print(response_data)
        return response_data

# Requesting the result of the task from the server using the generation_id
def get_stt(gen_id):
    url = f"{base_url}/stt/{gen_id}"
    headers = {
        "Authorization": f"Bearer {api_key}", 
    }
    response = requests.get(url, headers=headers)
    return response.json()
    
# First, start the generation, then repeatedly request the result from the server every 10 seconds.
def main():
    stt_response = create_stt()
    gen_id = stt_response.get("generation_id")


    if gen_id:
        start_time = time.time()

        timeout = 600
        while time.time() - start_time < timeout:
            response_data = get_stt(gen_id)

            if response_data is None:
                print("Error: No response from API")
                break
        
            status = response_data.get("status")

            if status == "waiting" or status == "active":
                print("Still waiting... Checking again in 10 seconds.")
                time.sleep(10)
            else:
                print("Processing complete:/n", response_data["result"]['results']["channels"][0]["alternatives"][0]["transcript"])
                return response_data
   
        print("Timeout reached. Stopping.")
        return None     


if __name__ == "__main__":
    main()
Response
{'generation_id': 'd793a81c-f8d8-40e0-a7c6-049ec6f54446'}
Processing complete:
 He doesn't belong to you, and I don't see how you have anything to do with what is be his power yet. He's he pursuing that from this stage to you.
PreviousDeepgramNextOpenAI

Last updated 1 day ago

Was this helpful?

If you don’t have an API key for the AI/ML API yet, feel free to use our .

Quickstart guide
post
Authorizations
Body
modelundefined · enumRequiredPossible values:
custom_intentany ofOptional

A custom intent you want the model to detect within your input audio if present. Submit up to 100.

stringOptional
or
string[]Optional
custom_topicany ofOptional

A custom topic you want the model to detect within your input audio if present. Submit up to 100.

stringOptional
or
string[]Optional
custom_intent_modestring · enumOptional

Sets how the model will interpret strings submitted to the custom_intent param. When strict, the model will only return intents submitted using the custom_intent param. When extended, the model will return its own detected intents in addition those submitted using the custom_intents param.

Possible values:
custom_topic_modestring · enumOptional

Sets how the model will interpret strings submitted to the custom_topic param. When strict, the model will only return topics submitted using the custom_topic param. When extended, the model will return its own detected topics in addition to those submitted using the custom_topic param.

Possible values:
detect_languagebooleanOptional

Enables language detection to identify the dominant language spoken in the submitted audio.

detect_entitiesbooleanOptional

When Entity Detection is enabled, the Punctuation feature will be enabled by default.

detect_topicsbooleanOptional

Detects the most important and relevant topics that are referenced in speech within the audio

diarizebooleanOptional

Recognizes speaker changes. Each word in the transcript will be assigned a speaker number starting at 0

dictationbooleanOptional

Identifies and extracts key entities from content in submitted audio

diarize_versionstringOptional
extrastringOptional

Arbitrary key-value pairs that are attached to the API response for usage in downstream processing

filler_wordsbooleanOptional

Filler Words can help transcribe interruptions in your audio, like “uh” and “um”

intentsbooleanOptional

Recognizes speaker intent throughout a transcript or text

keywordsstringOptional

Keywords can boost or suppress specialized terminology and brands

languagestringOptional

The BCP-47 language tag that hints at the primary spoken language. Depending on the Model and API endpoint you choose only certain languages are available

measurementsbooleanOptional

Spoken measurements will be converted to their corresponding abbreviations

multi_channelbooleanOptional

Transcribes each audio channel independently

numeralsbooleanOptional

Numerals converts numbers from written format to numerical format

paragraphsbooleanOptional

Splits audio into paragraphs to improve transcript readability

profanity_filterbooleanOptional

Profanity Filter looks for recognized profanity and converts it to the nearest recognized non-profane word or removes it from the transcript completely

punctuatebooleanOptional

Adds punctuation and capitalization to the transcript

searchstringOptional

Search for terms or phrases in submitted audio

sentimentbooleanOptional

Recognizes the sentiment throughout a transcript or text

smart_formatbooleanOptional

Applies formatting to transcript output. When set to true, additional formatting will be applied to transcripts to improve readability

summarizestringOptional

Summarizes content. For Listen API, supports string version option. For Read API, accepts boolean only.

tagstring[]Optional

Labels your requests for the purpose of identification during usage reporting

topicsbooleanOptional

Detects topics throughout a transcript or text

utterancesbooleanOptional

Segments speech into meaningful semantic units

utt_splitnumberOptional

Seconds to wait before detecting a pause between words in submitted audio

Responses
201Success
application/json
post
POST /v1/stt/create HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer <YOUR_AIMLAPI_KEY>
Content-Type: application/json
Accept: */*
Content-Length: 596

{
  "model": "#g1_nova-2-automotive",
  "custom_intent": "text",
  "custom_topic": "text",
  "custom_intent_mode": "strict",
  "custom_topic_mode": "strict",
  "detect_language": true,
  "detect_entities": true,
  "detect_topics": true,
  "diarize": true,
  "dictation": true,
  "diarize_version": "text",
  "extra": "text",
  "filler_words": true,
  "intents": true,
  "keywords": "text",
  "language": "text",
  "measurements": true,
  "multi_channel": true,
  "numerals": true,
  "paragraphs": true,
  "profanity_filter": true,
  "punctuate": true,
  "search": "text",
  "sentiment": true,
  "smart_format": true,
  "summarize": "text",
  "tag": [
    "text"
  ],
  "topics": true,
  "utterances": true,
  "utt_split": 1
}
201Success
{
  "generation_id": "123e4567-e89b-12d3-a456-426614174000"
}
get
Authorizations
Path parameters
generation_idstringRequired
Responses
201Success
application/json
get
GET /v1/stt/{generation_id} HTTP/1.1
Host: api.aimlapi.com
Authorization: Bearer <YOUR_AIMLAPI_KEY>
Accept: */*
201Success
{
  "status": "text",
  "result": {
    "metadata": {
      "transaction_key": "text",
      "request_id": "text",
      "sha256": "text",
      "created": "2025-05-30T04:57:59.351Z",
      "duration": 1,
      "channels": 1,
      "models": [
        "text"
      ],
      "model_info": {
        "ANY_ADDITIONAL_PROPERTY": {
          "name": "text",
          "version": "text",
          "arch": "text"
        }
      }
    }
  }
}
  • Model Overview
  • Setup your API Key
  • API Schema
  • POST/v1/stt/create
  • GET/v1/stt/{generation_id}
  • Quick Code Examples
  • Example #1: Processing a Speech Audio File via URL
  • Example #2: Processing a Speech Audio File via File Path