Speech Recognition: Art of the possible - DigiFest 2022

Speech
recognition:
Art of the possible
Dominik.Lukes@ctl.ox.ac.uk @techczech

Dominik’s journey
Computational linguistics
Cognitive linguistics
Language teaching
1990–1995
Language teacher training
Translation
Metaphor / discourse studies
1995–2008
Readability
Learning / Assistive technology
Dyslexia teacher training
2009 – present

Bill Gates in 2011
“The next big thing is definitely
speech and voice recognition.”

What do we want to know?
What is the current state of the
art?
How we got here?
Where are going?

Are we asking the right
questions?

Tasks for speech recognition by difficulty
Select
word
from list
Interpret
command
Type
dictation
Transcribe
presentation
Transcribe
conversation

How we think of it vs how it is
Select word from list
Interpret
command Type
dictation
Transcribe
presentation
Transcribe
conversation
Transcribe
conversation
Transcribe
presentation
Type dictation
Interpret
command
Select
word from
list

Speech recognition approximate timeline
Select digit
1950s
Select from 1000
words
1970s
Select from large
vocabulary
1980s
Dictate word by
word
1990s
Dictate whole
sentences
1997
Transcribe
YouTube video
2012
Transcribe
conversation
2019

What is the actual job of
speech recognition?

What is this word?
[pʰɹɛtsɫ̩]
[pɹɛtsl]
/pretsəl/
<pretzel>

What’s the problem
aspirated /p/ at
start of a stressed syllable
devoiced /r/ following /p/
labialised /r/
following /p/ dark /l/
syllabic
consonant
glottal
stop

It gets worse: find the missing sounds

Course on speech recognition 1993
Faster computers won’t help
improve speech recognition. We
need a new approach.

Dragon Naturally Speaking
released in 1997. Can
recognise whole
sentences.
What happened?

How speech recognition does not work?
Finding individual sounds
(phonemes) in the speech and
matching them to letters.

How speech recognition actually works?
P(W|C)
What is the likelihood that the
next word is X given what came
before?

Actually, it is quite a bit more complicated (Huang and Deng 2009)

Probabilistic (stochastic)
ASR enabled the change.
Linguistics took the back
seat.

Fred Jelinek (ASR Pioneer - 1988?)
"Every time I fire a linguist, the
performance of the speech
recognizer goes up"

Consequence of
probabilistic approach:
Worse on words not
predictable from
context
Names Acronyms
Specialist
Terms

Question in 2011
I recorded a lecture, can I use
Dragon to transcribe it?

“Caption fails” in 2014 provided source for comedy

YouTube Captions today are usable and useful

So what happened
between 2014 and 2022?

Ingredients of success
Larger data sets
More computing power
Neural networks

Patrick Winston (2015) MIT Lecture 12a in AI course
It was in 2010, yes, that's right. It was in 2010. We
were having our annual discussion about what we
would dump from 6034 in order to make room for
some other stuff. And we almost killed off neural
nets. That might seem strange because our heads
are stuffed with neurons. … But many of us felt that
the neural models of the day weren't much in
the way of faithful models of what actually goes
on inside our heads. And besides that, nobody
had ever made a neural net that was worth a
darn for doing anything.

2012 – ImageNet showed
that Neural Networks are
much better at computing
the probabilities for
complex data.

Ok, we have neural nets,
what does that mean?

Things to know about Neural Nets
Everything has a probability
Same input does not produce
same output
They have no ‘sanity check’
or ‘common sense’

What do probabilities look like?

What BERT is not: Lessons from a new suite
of psycholinguistic diagnostics for language
models
Allyson Ettinger 2019

Output changes as more
information is made
available. (Not always for
the better)

Examples from today’s captions
Crystal > Chris is
Am > and
experts > experience
AR > a our

Different ways of transcribing Dua Lipa
alipa
dualipa
dua lipa
lipa
duda lipa

Rise and mostly fall of Google’s new spell Czech

Tracking faces at the tips of the shoes

Hallucination is a big problem

Question asked by faculty member in 2021
We correct the transcripts, why
doesn’t the system learn the
correct spelling?

Adding your own word list
just tweaks the
probabilities.

Setting a genre setting
tweaks the probabilities.

Another thing to know about NN
Neural Nets use very large data
sets and can take days or
weeks to train.

Consequences of NN size
Speech recognition is often not
done on device.
Individual input often cannot adjust
the quality (except in pre-training)
Most applications use APIs from the
big players
Few open source/free options

Big players in the field
Google
Microsoft (now also Nuance)
Amazon

Interesting smaller companies
Verbit.ai
Carescribe.io (Caption.Ed)
Otter.ai
Rev.ai

Interesting applications
Descript
Microsoft Reading Progress
Microsoft Presentation Coach

What can we expect
in the future

The Original Roomba (2002) vs Roomba S9+ (2019) - Wow!

What happens in speeches
Fillers Repetition

What does conversation actually look like?

Possible futures?
Incremental
improvement
similar to Roomba in 17 years
Accurate
lecture
transcripts
Fluent
dictation with
pauses
Better meeting
transcription
Revolutionary
change
similar to change in speech
recognition in 6 years
Informal
conversation
transcription
Interactive
dictation
Multilingual
speech
transcription

How should we think about accuracy?
We speak 120-180 words per minute
99% accurate = 2 errors per minute

From Sept 2014 xkcd.com/1425
Sometimes it is hard to judge
how much effort will be needed
to solve a seemingly easy
problem.

Wishlist (a few hours of coding)
Transcripts indicate level
of confidence
Benchmarks for lecture
transcripts
Better manual control of
transcripts (like Descript)

Dreamlist (5 years and a research team)
Multilingual transcription
(identify change in
language)
Multimodal transcription
(use information from
video)
Raw to readable
transcript

Kate Knill
Machine Intelligence
Lab, University of
Cambridge
Richard Cave
MND Association (and
formerly Google
project Euphonia)
Richard
Purcell
Caption.Ed
Irit Opher
Head of Research at
Verbit.ai

What is the current state of
the art of speech recognition
in general and in the
transcription of recorded
speech in particular?
What are the current quality
metrics and how much do
they tell us about suitability
of models? Do we need
better ones?
After the big recent jump in
performance, are we seeing
a plateau with incremental
growth or can we expect
another step change in
quality?
Where can we see the most
innovation? What are the
research and development
blind spots where more effort
is needed?
What are the currently
unsolved problems for
which we do not have a
solution?
What is the space for
smaller players to innovate
in this space? How much do
they have to rely on pre-
trained models from big
providers? Is there space for
open source?

This presentation is licensed
under Creative Commons By
Attribution license except where
otherwise noted.
Icons and stock images from Microsoft
Office 365 creative premium. They
cannot be distributed separately from this
document.

Speech Recognition: Art of the possible - DigiFest 2022

Recommended

Recommended

More Related Content

Similar to Speech Recognition: Art of the possible - DigiFest 2022

Similar to Speech Recognition: Art of the possible - DigiFest 2022 (20)

More from Dominik Lukes

More from Dominik Lukes (20)

Recently uploaded

Recently uploaded (20)

Speech Recognition: Art of the possible - DigiFest 2022

Editor's Notes