Look at our partners web sites Jerusalem Safed Click here and send your prayer to Israel Say TEHILIM now and send your prayer to the Holy Land - click here and move to the Book of Psalms Netivot Tiberias Move to the main page Contact us Move to the front page

What is Text-To-Speech?
Text-To-Speech, or TTS for short, is computer software that converts text into audible speech.

TTS is separate from speech recognition. You can think of TTS as "talking" and speech recognition as "listening". There is some shared technology, but neither is just the reverse of the other. And the talking/listening analogy is limited too. Neither technology really involves much language understanding.

TTS is also distinct from language translation, though voice to voice translation would employ both speech recognition and TTS. Again, translation requires significant understanding of the meaning.

People new to the idea of TTS often underestimate the difficulty of the task. After all, humans can typically learn this stuff in early childhood. They talk, listen, understand, and even translate without much apparent effort. Humans do all this work without even being aware of it in most cases, but that doesn't make it easy.

If programmers could create software that really understands human language we could avoid most of the guesswork in TTS, but that hasn't happened yet. Until then, TTS is more like learning to read a foreign language aloud without ever understanding the words. With a good dictionary, grammar rules, etc. you can get better and better but will still make mistakes occasionally that are obvious to native speakers.

 
   How does TTS work?
TTS is often described as two conceptual stages. In the first stage, it decides how the text should be spoken, that is, how each word should be pronounced, what length and pitch each phoneme should have, etc. In the second stage, the system does it's best to create audio that matches the specifications produced by stage one.

TTS software has little or no understanding of the text being read. It uses rules, lists, dictionaries, etc. to make very sophisticated guesses about how a piece of text should be read. While general performance can be quite good, some decisions are intrinsically hard to make without some level of understanding. For example, the word "bass" in the phrases "bass drum" or "bass boat". Intonation depends in many cases on the writer's intention, which often cannot be inferred in short texts even by human readers. As a result, TTS systems will occasionally make mistakes and can be fooled by carefully constructed texts. These are challenging problems for all TTS systems, and we continue to improve ours as we are able.

The type of TTS we do is called a "concatenative" system, meaning that we record a human speaker to make a voice database. We re-use small chunks of the recordings to create new sentences containing words that where never recorded. Further, we do "unit selection" synthesis. This means that we use large voice databases and do clever searches on-the-fly to find chunks in the voice database that best match the requested sentences.

 

Instructions for use:

Please do not abbreviate and do not use capital letters (for example: instead using TTS, write: text to speech).

© All rights reserved