How to get the most out of Polly, Leveraging Lexicons and SSML - March 2017 AWS Online Tech Talks

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Marco Nicolis, Remus Mois
Amazon Text-to-Speech
03/27/2017
How to get the most out of Polly
Leveraging lexicons and SSML

What to Expect from the Session
• What is Polly?
• Example app
• Using punctuation and SSML
• Using external Lexicons
• Q&A

• A service that converts text into lifelike speech
• 47 voices, 24 languages
• Developers can store, replay and distribute
generated speech
What is Polly?

The Polly console
I bought 2lbs of meat
and 16oz of potatoes
Justin (US)
Amy (UK)
Raveena (IN)

Text-to-Speech Pipeline
Text
Text normalization
Grapheme-to-phoneme
conversion
Waveform
generation
Speech
She has $20 in her pocket.
she has twenty dollars in her pocket
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

Goal: Convert text into intelligible, accurate, and natural
speech
• G2P: rough, though, through.
• Homographs: same spelling, different pronunciations.
I live in Poland
This presentation is broadcasted live from Poland
Context helps 'live' disambiguation. But...
I read this book.
Main Challenges for Text-to-Speech

• Text normalization: disambiguation of abbreviations, acronyms,
units ‘St.’ expanded as ‘street’ or ‘saint’
<speak>St. Patrick St.</speak>
• Foreign words (déjà vu), proper names (François Hollande),
social media lingo (ASAP, LOL) etc.
Main challenges for Text-to-Speech

Speech Synthesis Markup Language (SSML)
• W3C recommendation, XML-based markup language for speech
synthesis applications. AWS Polly tags are compliant with SSML 1.1
specifications.
• Allows customers to modify certain aspects of the TTS speech output, for
example pronunciation of words, expansion of abbreviation, acronyms, etc.,
as well as pitch, rate of speech, volume, etc.
SSML in Polly

All SSML documents must start with an opening <speak> tag and end with a
closing </speak> tag. All other tags are inserted between <speak></speak>
SSML document structure

Changing pronunciations in Polly

<phoneme>

The tag
In-line aliasing
In many cases we do not want to change all instances of a certain word.
<speak>
My favorite chemical element is Al,but Al prefers Mg.
</speak>

The <phoneme> tag
Force pronunciation in-line
Read: present or past?
I <phoneme alphabet = "x-sampa"
ph='"rid'>read</phoneme> a
book.
I <phoneme alphabet = "x-sampa"
ph='"rEd'>read</phoneme> a
book.
Examples of EN phonemes
http://docs.aws.amazon.com/polly/latest/dg/supported-ssml.html
IPA X-SAMPA Example
ɹ r red
ɛ E dress
i i fleece
d d dig

Alias (e.g. abbreviation expansion)
Follows the Pronunciation Lexicon Specifications (PLS)
<lexeme><grapheme>Ne</grapheme><alias>Neon</alias></lexeme>
<lexeme><grapheme>Na</grapheme><alias>Sodium</alias></lexeme>
<lexeme><grapheme>Mg</grapheme><alias>Magnesium</alias></lexeme>
<lexeme><grapheme>Al</grapheme><alias>Aluminum</alias></lexeme>
<lexeme><grapheme>Si</grapheme><alias>Silicon</alias></lexeme>
<speak>Mg and Al are chemical elements</speak>
Lexicons: <alias>

Assign custom pronunciation (IPA or X-Sampa alphabets)
Settling the 'gif' issue once and for all.
<lexeme><grapheme>gif</grapheme><phoneme>"dZIf</phoneme></lexeme>
<lexeme><grapheme>David</grapheme><phoneme>"dA.%vid</phoneme>
</lexeme>
<speak>I like this gif.</speak>
<speak>Here's my friend David.</speak>
Lexicons: <phoneme>

The <lang> tag
Foreign words and phrases
Foreign phrases are rendered better if they are enclosed inside the <lang> tag,
as in the following example.
French in English
<speak>
J'adore chanter.
</speak>
<speak>
<lang xml:lang="fr-FR">J'adore chanter</lang>.
</speak>

The <lang> tag
English in Italian
The pronunciation of English is like that of a non-bilingual Italian speaker.
<speak>
Mi piace Bruce Springsteen.
</speak>
<speak>
Mi piace <lang xml:lang="en-US">Bruce Springsteen.</lang>
</speak>

The <lang> tag
Multiple languages
All languages supported by AWS Polly can be invoked by the lang tag.
EN FR IT ES PL
<speak>Onion, onion, cipolla, cebolla, cebula.</speak>
<speak>Onion, <lang xml:lang="fr-FR">onion</lang>, <lang
xml:lang="it-IT">cipolla</lang>, <lang xml:lang="es-
ES">cebolla</lang>, <lang xml:lang="pl-PL">cebula</lang>.</speak>

Define a specific interpretation
<say-as interpret-as="">

The <say-as> tag
• The TTS engine works well for most common and unambiguous text
structures, such as dates, time, etc..
• Possible to force interpretation through the <say-as> tag in
ambiguous cases. (phone number, addresses, etc.)
Phone numbers (interpret-as="telephone")
<speak>(514) 888-5195
<say-as interpret-as="telephone">(514) 888-5195</say-as>
</speak>
<speak>(514) 888-5195x123 </speak>
<speak><say-as interpret-as="telephone">(514) 888-5195x123</say-
as></speak>

The <say-as> tag
Phone numbers (US vs. UK): different pronunciation styles.
US
Richard's number is <prosody rate='slow'> <say-as interpret-
as='telephone'>(212) 224-1555</say-as> </prosody>
UK
Richard's number is <prosody rate='slow'> <say-as interpret-
as='telephone'>(212) 224-1555</say-as></prosody>

<say-as interpret-as="expletive">
Bleeping undesirable content
<speak>
Your next song is "Killing in the name of" by Rage Against
the Machine.
</speak>
<speak>
Your next song is "<say-as interpret-
as="expletive">Killing</say-as> in the name of" by Rage
Against the Machine.
</speak>

<say-as interpret-as="spell-out">
Read character by character
<speak>And here is how you spell handkerchief: <prosody
rate="x-slow"><say-as interpret-as="spell-
out">handkerchief</say-as></prosody>.</speak>

Modify speech delivery
<prosody>

The power of commas / periods
Adding punctuation helps getting better prosody
<speak>He went to Harvard and when he decided to drop out it was
not to find enlightenment with an Indian guru but to start a
computer software company.</speak>
<speak>He went to Harvard, and when he decided to drop out, it
was not to find enlightenment with an Indian guru, but to start a
computer software company.</speak>

The <prosody> tag
The <prosody> tag allows some changes to how speech is
delivered, through the following supported attributes
• volume
• rate
• pitch

The volume attribute
Modify the volume of speech
<speak>
I can speak normally, <prosody volume="x-loud"> or I can speak
louder</prosody>.
</speak>
<speak>
I can speak normally, <prosody volume="x-soft"> or I can speak
quieter</prosody>.
</speak>

The rate attribute
Change the speed of speech
<speak>
When I wake up, <prosody rate="x-slow">I speak quite
slowly</prosody>.
</speak>
<speak>
When I am in a hurry, <prosody rate="x-fast">I speak very
fast</prosody>.
</speak>

The pitch attribute
Modify the pitch of a word/phrase
<speak>
When I get angry, <prosody pitch="x-high">my pitch goes way
up</prosody>
</speak>
<speak>
When I get sad, <prosody pitch="x-low">my pitch goes way
down</prosody>
</speak>

The pitch attribute
Modify the pitch of a word/phrase
<speak>
I can go normal, <prosody pitch="high">high</prosody>,<prosody
pitch="x-high">higher</prosody>,<prosody
pitch="low">low</prosody>, and <prosody pitch="x-
low">lower</prosody>.
</speak>

Use pitch to improve intonation
Adding punctuation and modifying pitch helps getting better
prosody
Do you like this or that?
Do you like <prosody pitch="+5%"> this </prosody>, or <prosody
pitch="-2%">that?</prosody>

Punctuation and the <break> tag
Add a pause anywhere (time, strength attributes)
And the winner is <break time='5s'/> Bob Dylan!
And the winner is <break strength="x-strong" /> Bob Dylan!

Fun with SSML
'Can you make your voices sound like an auctioneer?'
<speak><prosody rate='+60%'>I’m at 500 and I want
550<prosody volume='x-loud'>550</prosody></prosody>
<prosody rate='+60%'>bid on 550 I’m at 500 would you go
550 550 for the gentleman in the corner</prosody> <prosody
rate="+90%">A big black bug bit a big black bear a big
black bug bit a big black bear</prosody> Do we get 600?
<prosody rate='+90%'>A big black bug bit a big black
bear</prosody><prosody rate='+60%'>We got 600 for the
whole herd</prosody><prosody rate='default' volume='x-
loud'>Sold <prosody rate='+60%'>for
600.</prosody></prosody></speak>

Fun with SSML
'It's good, but can you make her sound like she's from
Boston???'
If your car’s blinkers are broken, it may be the blinker
relay. Fortunately, this car fix is easy to do.
<speak>If <phoneme ph='"jO: "kAz "blIN.k@z'>your car's
blinkers</phoneme> <phoneme ph='%A'>are</phoneme> broken,
it may be the <phoneme ph='"blIN.k@'>blinker</phoneme>
relay. <phoneme ph='"fO.tS@n.@t.li'>Fortunately</phoneme>,
this <phoneme ph='"kA'>car</phoneme> fix is easy to do.
</speak>

• Contact us with any question about this webinar or Polly in general
polly-webinars-feedback@amazon.com
• SSML documentation
http://docs.aws.amazon.com/polly/latest/dg/supported-ssml.html
• Introducing Amazon Polly at re:Ivent 2016
https://www.youtube.com/watch?v=zjMqimHis3U&t=2s
• PLS 1.0 Specifications
https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/
Next AWS Polly webinar (Apr 10th): "How to integrate Amazon Polly
voices seamlessly into your application workflow"

How to get the most out of Polly, Leveraging Lexicons and SSML - March 2017 AWS Online Tech Talks

Recommended

Recommended

More Related Content

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

How to get the most out of Polly, Leveraging Lexicons and SSML - March 2017 AWS Online Tech Talks