UNIQUE LIVE-CAPTIONING SOFTWARE
SOVO’s proprietary live-captioning software is unique on many aspects including its multi-modality, based on the best speech recognition engine, a computer game joystick and a graphical user interface to allow our live captioners to produce the best possible captions using the input mode that is most appropriate to the task, in realtime. The software is especially efficient when captioning high speech rate, unscripted content.
The STDirect speech recognition sub-system relies on “acoustic models” which represent all the different sounds that make up words in a given language. These models have been built with advanced artificial intelligence techniques powered through very large server farms composed of hundreds of computing processors using tens of thousands of hours of carefully annotated audio files. SOVO is leading the way in applying deep learning technologies to the task of respeaking for the production of closed captions for live broadcasts. The resulting high-performance acoustic models are speaker independent, such that they do not need to be trained for the particular voice of the captioner. This leads to more consistent quality results.
Version 8.0 of the STDirect software also introduces Dynamic-LM technology. Aside from the acoustic models, another key component of a speech recognition system is the language model which take into account the probabilities for words and sequence of words in a specific language or domain. The Dynamic-LM technology allows the use of very complex, arbitrarily large language models that cover a sub-domain or language with unmatched precision. As one particular example of the use of this technology, SOVO leverages a very detailed sub-grammar for representing numbers with all variations of decimals, fractions, orders, etc. Achieving superior accuracy on numbers and all their variations is of course of critical importance for caption intelligibility in domains such as business, public affairs, sports, weather, and more.
Aside from their voice, the live captioners use a joystick to input punctuation and special characters or text indicating either a change of speaker, background noises, clapping or other. This allows the captioner not to have to verbalize every period, comma, or other character that is often required, thus yielding a faster, more fluid and precise delivery of the live captions, especially when the speech rate is high. Finally, a custom-designed graphical interface allows the captioners to “click” words or other prepared text to be sent through in combination with the rest of the captions. This gives tremendous flexibility to the voice writers in delivering the best possible live captions. Complex, single-event names of teams or players (for instance the Ukrainian bobsleigh team members during a winter Olympics) or the specific name of a small town in Northern Manitoba can be prepared ahead of time (or even in realtime during production) and “clicked” through as an alternative to using voice input. Any scripted material available ahead of time can also be prepared to use directly in this user interface to reach 100% accuracy during these specific segments.
SOVO’s software architecture, unique in the industry, also allows the sharing of data, hardware and software among the entire group of voice writers working at SOVO. The server approach used by SOVO, contrary to an approach where each speaker uses personal dictation software running on their own computer, permits tapping much more powerful centralized software and hardware resources. Furthermore, the approach lets each captioner benefit immediately from system changes provided by their colleagues over the years in order to achieve better performance. In other organizations, captioners often work in silos and can hardly share new words, grammars or other improvements to solve transcription problems.