Transcription
LivingLens provides a time-aligned transcript of the spoken word for every video and audio file in a supported language. A transcript is required for media Search, making clips for Showreels, and Sentiment Analysis.
LivingLens supports these types of transcription:
Machine Speech-to-Text — may also be referred to as automatic speech recognition, computer speech recognition, machine transcription or machine speech recognition.
See Supported Languages for solutions by language.
Machine Speech-to-Text
Note: All channels created from November 18, 2021 are set to process Media Capture and CaptureMe uploads for Machine Speech-to-Text automatically, for all supported languages.
Machine Speech-to-Text automatically identifies words and phrases in spoken language and renders them as text.
Sound waves from a video or audio file are formatted and processed using a recurrent neural network (a computer system modeled on the human brain and nervous system) to predict and transcribe one letter at a time. Recurrent neural networks have memory that can help predict what the next letter will be. The initial process returns several predicted words. This output is polished to produce the most likely transcription based on the language and its common uses.
Machine Speech-to-Text transcription accuracy varies depending on several factors:
Quality of the audio — Poor quality microphones, muffled sound and background noise will impact the ability to pick-up and correctly identify the spoken word
Multiple speakers — Video or audio where multiple people are talking will be less accurate if individuals talk over one another
Accents — Strong dialects/accents may impact the ability to pick up or correctly identify some words during natural language processing
Niche Topics — Machine Speech-to-Text recognizes a wide range of words. However, specialize, technical language and industry-specific terminology may not be automatically recognized
Good quality spoken English audio averages 80%+ accuracy.
The system gives each word a confidence score between 0-100%. This score reflects how confident the system is that it produced the correct word from the audio.
Tip: By default, LivingLens displays all identified words in the transcript field. Use the 0-100% sliding scale under a machine transcript in LivingLens to view words at the confidence level displayed.
LivingLens transcripts created by Machine Speech-to-Text transcription are labeled with a computer icon on the language tab, as shown in the image below.
Human transcription
Restriction: Human transcription is not supported on LivingLens Experience Edition. Contact your Medallia Representative for more information.
Human transcription from LivingLens’ global network of approved suppliers produces caption-quality, time-stamped (.VTT or .SRT) transcripts of the spoken word in video and audio files. The API-enabled process returns the completed transcript orders to LivingLens.
LivingLens transcripts created by Human transcription are labeled with a person icon, as shown in the image below.
Human transcription with Speaker Separation is only available for English media to order or Upgrade.
Important: Human transcription incurs additional cost. Contact your Medallia Representative for pricing information.
Speaker Separation
Restriction: Speaker Separation is not available for LivingLens Experience Edition. Contact your Medallia Representative for more information.
Speaker separation is a transcription solution that identifies the number of speakers in a video or audio file and displays each numbered speaker in the transcript. Speakers are named (human transcription, when possible) or numbered (machine transcription) in the platform transcript field and in the Data Export transcript fields.
The following example shows a transcript where two speakers are identified and numbered as S1 (speaker 1) and S2 (speaker 2).
Best practice is to use Speaker Separation when media contains two or more speakers.
Speaker Separation is available alongside Machine Speech-to-Text transcription for several languages. See Transcription and Translation - Supported languages. It is available alongside English Human transcription.
Speaker Separation quality and performance is dependent on the audio quality. Unclear audio quality, overlapping speakers, or presence of background noise can cause inaccurate speaker identification and numbering.
Restriction: Speaker Separation is not available for Zoom-produced transcripts or uploaded .SRT or .VTT subtitle files.
Translation
LivingLens provides a time-aligned English translation transcript for all non-English video and audio files from the file’s spoken word transcript.
LivingLens supports these types of translation.
Machine translation
Human translation
Important: Machine Speech-to-Text + Machine translation accuracy is highly variable. Machine Speech-to-Text inaccuracies in a native transcript are amplified in Machine translation; as such, this pairing is not recommended.
If non-English language content is being reviewed by an English speaker, best practice is to use:
Human transcription + machine translation
Human transcription + human translation
Machine translation
Machine translation is the use of software to translate text. The system translates the text of the video or audio file spoken word transcript into English text.
LivingLens transcripts created by Machine translation are labeled (Auto) on the language tab.
Important: Machine Speech-to-Text inaccuracies in a native transcript are amplified in Machine translation.
Human translation
Restriction: Human translation is not supported on LivingLens Experience Edition. Contact your Medallia Representative for more information.
Human transcription from LivingLens’ global network of approved suppliers translates the spoken word transcript text into English text. The API-enabled process returns the completed transcript orders to LivingLens.
Important: Human translation incurs additional cost. Contact your Medallia Representative for pricing information.
Processing time
Machine Processing | Human Language Services |
English (Machine Speech to Text) Near real-time (2-3 times video length) | English (Human Transcription) 1-2 business days *turnaround may increase with higher volumes |
Non-English (Native Machine Speech to Text) Near real-time (2-3 times video length)
Note: Use Non-English Machine Speech-to-Text when a native speaker will be reviewing the transcript.
Machine translation should NOT be applied to a non-English Machine Speech-to-Text transcript when the user is an English speaker. | Non-English Human Transcription (of native speech) + English Machine Translation: about 2-3 business days *turnaround will increase if daily volume exceeds 30 minutes and/or if languages are less common e.g. Tagalog
Human transcription (of native speech) + English Human Translation: about 2-3 business days for transcription and an additional 4-5 business days for translation *turnaround will increase if daily volume exceeds 30 minutes and/or if languages are less common e.g., Tagalog |
Upgrading Machine Speech-to-Text to Human transcription
You can upgrade transcriptions and/or translations to Human processing. Human transcription and translations costs incurred will be billed monthly in arrears. Contact your Medallia representative for pricing information.
How to order transcripts
A LivingLens account is requried to access the articles linked below.
For File Upload
For Zoom Import
Upgrade transcription and/or translation of a file in LivingLens
For CaptureMe mobile uploads — Customer Admin users should set CaptureMe uploads to process via Channel Management - Auto Requests.
For Media Capture uploads — Customer Admin users should set Media Capture uploads to process via Channel Management - Auto Requests.
Note: If your account does not have a Customer Admin, contact LivingLens Support via Medallia Knowledge Center to request transcription for CaptureMe or Media Capture uploads