A “voice” for nonspeaking individuals through the use of technology does not necessarily involve the electronic production of speech. For example, the Picture Exchange Communication System (PECS), a popular low-tech AAC system, involves children with complex communication needs exchanging laminated paper cards of picture icons with another person in order to ask and answer questions, make comments, and issue requests.
PECS cards are stored in a binder, usually attached to Velcro strips. Speech therapists sometimes refer to the PECS binder as the child’s “voice,” under the premise that this colloquialism socializes caregivers to the idea that the PECS system supports their child’s communicative agency. 4 iPads and AAC apps can digitally store thousands of messages in a much more efficient, compact, and archival manner than a PECS binder.
The voices that AAC technologies produce are also never completely disembodied. Ideas about the normative body, vocal techniques, andcommunicative repertoires have historically influenced the design and development of voice output communication technologies. Long before smartphones could speak multiple languages and GPS devices could spout driving directions, ancient civilizations attempted to mechanically simulate the human instrument of voice. The Greeks and Romans would rig statues with concealed speaking tubes to make their idols appear to talk.
Talking automatons designed in the eighteenth century drew inspiration from the vox humana pipes of an organ, which imitated a chorus of human voices. The engineers of early speaking machines attempted to copy the vocal organs and model “normal” human physiology. 7
In the present day, there are two main kinds of voice output used in electronic AAC devices: digitized and synthetic text-to-speech output. Digitized output is any kind of recorded speech or nonlexical sound (e.g., laughter) that can be prerecorded and played back. Text-to-speech output uses software to translate visual text into audible speech. Contemporary techniques for generating synthetic speech include a process known as concatenative speech synthesis in which human speech (usually produced by an actor or actress, but sometimes engineers themselves) is recorded, broken down into units, stored in a database, and recombined into synthesized words. Older systems require more digital signal processing, which causes these synthetic voices to sound less “natural.”
While engineered voices at present sound less robotic than early systems, they are not quite human either, a slippage shared by another pervasive technology: online bots that emulate and automate human activity and interaction. One giveaway is that synthetic speech tends to lacprosody—the stress, intonation, and rhythm of human speech. Humor poses another problem. Movie critic Roger Ebert, who lost the ability to produce embodied oral speech following complications from thyroid cancer, famously proposed an “Ebert test” for humor. 12 Invoking Alan Turing’s Turing test, Ebert challenged engineers to develop a computer-based synthesized voice that would be indistinguishable from a human one in its ability to tell a joke well. Even the most sophisticated synthetic voices on the market are a far cry from the vibrancy of Samantha, the fictional operating system and titular character in Spike Jonze’s 2013 futuristic film Her.