|
|
|
What is
Text-To-Speech? |
| Text-To-Speech, or TTS for short, is computer software
that converts text into audible speech.
TTS is separate from speech recognition. You can think of
TTS as "talking" and speech recognition as "listening".
There is some shared technology, but neither is just the
reverse of the other. And the talking/listening analogy is
limited too. Neither technology really involves much
language understanding.
TTS is also distinct from language translation, though
voice to voice translation would employ both speech
recognition and TTS. Again, translation requires significant
understanding of the meaning.
People new to the idea of TTS often underestimate the
difficulty of the task. After all, humans can typically
learn this stuff in early childhood. They talk, listen,
understand, and even translate without much apparent effort.
Humans do all this work without even being aware of it in
most cases, but that doesn't make it easy.
If programmers could create software that really
understands human language we could avoid most of the
guesswork in TTS, but that hasn't happened yet. Until then,
TTS is more like learning to read a foreign language aloud
without ever understanding the words. With a good
dictionary, grammar rules, etc. you can get better and
better but will still make mistakes occasionally that are
obvious to native speakers. |
| |
|
How does TTS work?
|
| TTS is often described as two conceptual stages. In the
first stage, it decides how the text should be
spoken, that is, how each word should be pronounced, what
length and pitch each phoneme should have, etc. In the
second stage, the system does it's best to create audio that
matches the specifications produced by stage one.
TTS software has little or no understanding of
the text being read. It uses rules, lists, dictionaries,
etc. to make very sophisticated guesses about how a piece of
text should be read. While general performance can be quite
good, some decisions are intrinsically hard to make without
some level of understanding. For example, the word "bass" in
the phrases "bass drum" or "bass boat". Intonation depends
in many cases on the writer's intention, which often cannot
be inferred in short texts even by human readers. As a
result, TTS systems will occasionally make mistakes and can
be fooled by carefully constructed texts. These are
challenging problems for all TTS systems, and we continue to
improve ours as we are able.
The type of TTS we do is called a "concatenative" system,
meaning that we record a human speaker to make a voice
database. We re-use small chunks of the recordings to create
new sentences containing words that where never recorded.
Further, we do "unit selection" synthesis. This means that
we use large voice databases and do clever searches
on-the-fly to find chunks in the voice database that best
match the requested sentences. |
|
|
Instructions for use:
Please do not abbreviate and do not use capital letters (for
example: instead using TTS, write: text to speech). |
|
©
All
rights reserved |
|