A new standard for live AI-powered closed captions and speech translations

Clevercast adds AI-powered closed captions and speech translations to your live stream with a jaw-dropping 99+% accuracy for commonly spoken languages. This is a unique accomplishment, vastly superior to what other AI-powered captions and speech translation solutions are offering.

The live stream used excerpts from the following videos available on November 8th 2024 under a CC-BY 4.0 license:

#AoIR2016: Panel Discussion “Who rules the internet?” by Alexander von Humboldt Institut für Internet und Gesellschaft

Panel Discussion with Founders | Women Entrepreneurs in Sustainability – May 17, 2018 by Stanford Energy

Cyber & Disinformation Panel Discussion by U.S. Naval Institute

Panel discussion: Moving crisis management forward by European Central Bank

The speech translations and closed captions in this live stream, recorded on Oct. 16, 2024, were generated entirely by AI in real-time, without any other human intervention. The recording is unedited.

The video demonstrates the accuracy, readability and intelligibility of the AI generated speech translations and closed captions. Our exclusive AI technology ensures that the quality of live speech translations and closed captions are virtually indistinguishable from Video on-Demand.

For a comparison of accuracy and readability with other platforms, check out our side-by-side evaluation with AI captioning on YouTube and Vimeo. Whereas other AI solutions are limited to an accuracy of 90% or less, Clevercast achieves 99+% without human intervention. In fact, our AI solutions often outperform human interpretation and captioning.

To enhance quality and reach 100% accuracy, Clevercast lets you compile custom vocabularies and make real-time corrections to the speech-to-text conversion.

Why choose AI-powered speech and captioning?

AI-powered speech and captioning have made significant strides, bridging the gap with human interpreters and captioners in both accuracy and speed. Unlike human captioners, AI can process and deliver real-time translations and captions with consistent quality across multiple languages.

AI-driven solutions are also highly scalable and cost-effective, allowing content creators to reach broader audiences without the logistical and financial demands of hiring live interpreters or captioners. With continuous advancements in language models, AI captioning offers a reliable, high-quality alternative suitable for diverse applications, from live events to on-demand media.

How accuracy is determined

WER (Word Error Rate) and FER (Frame Error Rate) are commonly used metrics to assess the accuracy of closed captions and speech translations. Both rely on a reference transcript from human transcribers to detect the number of errors. FER is more suited to evaluate real-time speech and subtitles because it counts the number of frames or segments in which the live speech and subtitles differ from the reference transcript. In our metrics, we exclude cosmetic errors (e.g. text not optimally split into two lines) insofar as they are not hindering the viewer’s experience. By accuracy, then, we mean correctly converting speech to text and rendering it in understandable phrases that are in sync with the live stream.

Thus, an accuracy of 99+% means that, on average, our AI powered speech recognition gets more than 99 out of every 100 words right and is able to put them in understandable phrases that are in sync with the live stream.¹

The high success rate is possible thanks to the huge progress in Automatic Speech Recognition (ASR) technology, where different ways of speaking, dialects, accents, vocabulary… no longer lead to inaccuracies in transcription. In addition, our ASR engines recognize most public names and abbreviations, and are able to render them correctly. Nowadays most ‘errors’ result from unclearly pronounced personal names and other not commonly known names.²

¹ This has been tested primarily with English and Spanish content, including many different accents and dialects. However, data from the AI solutions show that this should also apply to other commonly used languages such as French, German, Italian, Portuguese, Japanese and others.

² Misspellings of names, which cannot reasonably be assumed to be known by the AI engine and have not been added as additional context information, are not considered to be an error when calculating the level of accuracy.

How we achieve 99+% accuracy

There is a big difference between Clevercast and most other solutions for live AI-generated speech translations and closed captions. How does Clevercast attains 99+% accuracy, while other solutions are stuck between 70% and 90% accuracy? Below, we listed some of the things that lead to better speech recognition, processing, translation and rendering.

R

Best-in-class ASR solutions

AI keeps evolving, almost on a daily basis. We benchmark different AI solutions, so Clevercast can automatically select the best engine when a live stream is configured.

R

Optimized AI

We regularly evaluate the performance of the AI system using validation and test data, and seek to continuously improve and refine the system. This is how we ensure ever-higher quality.

R

Enhance the audio input

There are gains to be made with the audio that is sent to the ASR engine, for example in terms of audio quality, background noise, intelligent fragment detection …

R

Provide maximum context

Clevercast slightly increases the latency of the HLS stream. This way, we are able to send more context to the ASR engine, which leads to a better speech to text conversion.

R

Avoid wrong predictions

Language models are predictive by nature. To apply correctly what they have learned, they rely on context, background information and specific terms and instructions.

R

Intelligent AI output processing

Intelligent post-processing is necessary to catch errors, correct misinterpretations, fix grammatical errors, adjust punctuation, words and phrases and much more.

R

Apply formatting and line breaks

Factor in readability and time constraints when turning the transcription into captions. Consider the pace of speech and avoid captions that are too short or too long.

R

Synchronize the captions

A great timing involves more than aligning captions with the spoken words. Proper synchronization ensures that the captions appear at the right time and long enough to aid viewers in following the content.

Use our real-time correction interface for 100% accuracy

Our correction interface allows you to edit the AI-generated captions in real-time, just before they are sent to the live stream (and translated into other languages). It lets you change words and move them to different lines for improved readability.

Making these corrections is a simple task that requires no experience or training. Our intuitive interface allows anyone to edit the captions in a browser with mouse and keyboard. If desired, we can also provide you with professional correctors for your event.

Get Started Now

Start live streaming today with a solution of choice. No credit card required.

Try For Free

Or contact us for more info.