September 28, 2023

Evaluate the best Speech To Text Models

Table of Contents:

Evaluate the best Speech To Text Models

Speech-to-Text, often referred to as Automatic Speech Recognition (ASR), is a technology that uses machine learning to convert human speech into text. It's a common technology that many of us encounter every day – think of Siri, Okay Google, or any speech dictation software.

What is Automatic Speech Recognition?

Automatic Speech Recognition or ASR, involves using Machine Learning to turn spoken words into written text. This field has seen tremendous growth in the last decade, with ASR systems becoming a common feature in everyday applications like TikTok and Instagram for live captions, Spotify for podcast transcripts, Zoom for meeting notes, and many others.

How Does Automatic Speech Recognition Work?

Traditional Acoustic Speech Recognition Models:

Most ASR voice technology begins with an acoustic model to represent the relationship between audio signals and the basic building blocks of words. Acoustic models are a type of statistical model used to convert spoken language, which is in the form of an audio signal, into a sequence of linguistic units, typically phonemes, words, or subword units. Traditional ASR systems involve a multi-step process, including language modeling and pronunciation dictionaries.

End-to-End Deep Learning Models

The End-to-End Automatic Speech Recognition (ASR) model is a revolutionary approach in the field of speech technology. Unlike acoustic ASR systems, which involve multiple intermediate steps such as phoneme recognition and language modeling, the End-to-End ASR model aims to directly convert spoken language into text in a single step. It achieves this using advanced deep learning techniques, often leveraging architectures like convolutional neural networks (CNNs) or transformer-based models. This streamlined approach offers several advantages, including greater simplicity, improved accuracy, and the ability to handle diverse accents and speaking styles more effectively.

Why you should you use the best speech to text models with Clarifai?

Clarifai, a leading AI platform, offers a compelling solution with its state-of-the-art End-to-End Automatic Speech Recognition (ASR) models. 

Here's why you should consider using best speech to text models through Clarifai's API.

  1. State-of-the-Art ASR Models: Clarifai's integration of top-tier ASR models ensures that you have access to the most advanced and accurate speech-to-text conversion technology available. These models are meticulously trained on vast datasets, making them exceptionally proficient in converting spoken words into written text with high precision.
  2. Ease of Integration: Clarifai's Speech to text(STT) models can be effortlessly integrated into your applications using the API. Whether you're a seasoned developer or just starting, this ease of integration reduces the technical challenges and overhead, allowing you to focus on your core objectives. 
  3. Cost-Effective: Clarifai's STT APIs are available at a very competitive price point. This affordability opens the door for businesses of all sizes and individuals to access cutting-edge speech-to-text technology without breaking the bank.
  4. Data Security and Privacy: Clarifai places a strong emphasis on data security and privacy. You can trust that your audio data is handled with the utmost care, ensuring compliance with data protection regulations.

ASR Models

Clarifai contains large amounts of state-of-the-art Speech-to-Text models in the platform which can be used for multiple purposes. Few of the most popular models are: 

Chirp: Universal speech model (USM)

Chirp is a state-of-the-art speech model with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning 300+ languages. This  2 billion-parameter speech model developed through self-supervised training on extensive audio and text data in over 100 languages. It boasts an impressive 98% accuracy in English and over 300% improvement in various languages with fewer than 10 million speakers.

Chirp's uniqueness lies in its training approach. Initially, it learned from millions of hours of unsupervised audio data across multiple languages and then fine-tuned itself with limited supervised data for each language. This approach contrasts with traditional speech recognition methods that rely heavily on language-specific supervised data. 

Key Results

USM model, fine-tuned on YouTube Captions data, performs exceptionally well in 73 languages, with an average word error rate of less than 30%, surpassing Whisper by 32.7%. The USM model also shows lower word error rates on various ASR tasks, such as CORAAL, SpeechStew, and FLEURS. USM excels in quality compared to Whisper in speech translation tasks across different language segments based on resource availability. 

Try out Chirp model here 

Assembly AI

AssemblyAI's Speech-to-Text model, known as Conformer-2, represents the latest advancement in automatic speech recognition. It is trained on an extensive dataset comprising 1.1 million hours of English audio data. Conformer-2 builds upon its predecessor, Conformer-1, by offering substantial improvements in handling proper nouns, alphanumerics, and robustness to noisy audio.

The Conformer-2 is a speech recognition model based on the Transformer architecture with added convolutional layers for improved dependency capture. It offers excellent modeling capabilities. The Conformer-2 aims to create an efficient speech recognition model while maintaining the Conformer's strong modeling capabilities.

Conformer-2 builds on the original release of Conformer-1, improving both model performance and speed. Conformer-1 model achieved state-of-the-art performance (previous results). 

Key Results:

Conformer-2 maintains parity with Conformer-1 in terms of word error rate but takes a step forward in many user oriented metrics. Conformer-2 achieves a 31.7% improvement on alphanumerics, a 6.8% improvement on Proper Noun Error Rate, and a 12.0% improvement in robustness to noise. These improvements were made possible by both increasing the amount of training data to 1.1M hours of English audio data (170% of the size of data compared to Conformer-1) and increasing the number of models used to pseudo label data.

Try out Assembly AI ASR model here:


Whisper ASR model, notable for its robustness and accuracy in English speech recognition. Whisper-Large is trained on a large-scale weakly supervised dataset that includes 680,000 hours of audio, covering 96 languages. The dataset also includes 125,000 hours of X→en translation data. The models trained on this dataset transfer well to existing datasets zero-shot, removing the need for any dataset-specific fine-tuning to achieve high-quality results. Model excels in handling accents, background noise, and technical language. It's capable of transcription in multiple languages and translating them into English. 

Whisper may not outperform specialized models on benchmarks like LibriSpeech, it excels in zero-shot performance across diverse datasets, making 50% fewer errors than other models. Whisper's strength lies in its large and diverse dataset, approximately one-third of Whisper's audio dataset is non-English, and it effectively learns speech-to-text translation, surpassing supervised state-of-the-art models in CoVoST2 to English translation zero-shot tasks.

Try out Whisper-large model here:

How to Use Speech-To-Text model with Clarifai

You can access and run the speech-to-text Model using Clarifai’s Python client.

Check out the Code Below for the Whisper Model:

Model Demo in the Clarifai Platform

Try out the gcp-chirp, assembly-audio-transcription, whisper-large models

Evaluating ASR Model

Evaluating an Automatic Speech Recognition (ASR) model is a critical step in assessing its performance and ensuring its effectiveness in converting spoken language into text accurately. The evaluation process typically involves various metrics and techniques to measure the model's quality. Here are some key aspects and methods for evaluating ASR models:

  • Word Error Rate (WER): WER measures the accuracy of the recognized words in the system's output compared to the reference or ground truth transcription. It quantifies the number of errors in terms of word substitutions, insertions, and deletions made by the ASR system.

    Here's how WER is calculated:
    Substitutions (S): This represents the number of words in the reference transcription that are incorrectly replaced by words in the ASR output.
    Insertions (I): Insertions count the number of extra words present in the ASR output that are not in the reference transcription.
    Deletions (D): Deletions indicate the number of words in the reference transcription that are missing in the ASR output.

    The formula for calculating WER is as follows:
    Word Error Rate = (inserts + deletions + substitutions ) / number of words in reference transcript
    Simply put, this formula gives us the percentage of words that the ASR messed up. A lower WER, therefore, means a higher accuracy. 
  • Character Error Rate (CER): Similar to WER, CER measures the number of character-level errors in the recognized text compared to the reference text. It provides a finer-grained evaluation, especially useful for languages with complex scripts.
  • Accuracy: This metric calculates the percentage of correctly recognized words or characters in the transcription. It is a straightforward measure of ASR model accuracy.

What is automatic speech recognition used for?

Speech-to-Text Models can be used for various speech recognition tasks, including transcription of audio recordings, voice commands, and speech-to-text translation. These models can be applied to different languages and accents, making it useful for multilingual applications.

  • Closed Captions: Generating closed captions is the most obvious place to start. Whether it’s for movies, television, video games, or any other form of media, offline ASR accurately creates captions ahead of time to aid comprehension and make media more accessible to the deaf and hard-of-hearing. 
  • Content Creation: Content creators can benefit from accurate transcription to produce captions, subtitles, and written content from spoken material.
  • Transcription Services: Speech-to-Text model is suitable for various transcription needs, including converting audio recordings, interviews, meetings, and video content into written text.
  • Call Centers: Call centers are also employing ASR to drive better customer outcomes. Uses include monitoring customer support interactions, analyzing initial contacts to more quickly resolve issues, and improving employee training.

Checkout the platform here, and don't hesitate to connect with us for any questions or exciting ideas you want to share.