When the crew of the Starship Enterprise access the ship's computer system, they simply speak to it. While this may seem somewhat futuristic, the possibility of computers responding to voice commands has many practical applications. Imagine dictating letters or reports direct onto the screen and then having the word processor amend the text to suit your taste and style. Apart from eliminating the potential for repetitive strain injury - an ever-present threat to keyboard users - productivity would improve enormously. The quality of work would also improve because instead of having to marshal thoughts sequentially, breaking concentration to type and then moving on to the next topic, users could create documents with a natural flow. It sounds like heaven for the those allergic to PCs but it may not be too long before voice recognition for these tasks becomes a widely available technology. The comp.speech newsgroup offers lots of useful information and in particular, sets out definitions of different voice - or more correctly - speech recognition methods. It distinguishes automatic speech recognition from automatic speech understanding. Automatic speech recognition is the process by which a computer maps an acoustic speech signal to text. Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech. Commercially available systems fall into three broad categories: - A speaker dependent system designed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker adaptive or speaker independent systems. - A speaker independent system is developed to operate for any speaker of a particular type (for example, American-English). These systems are the most difficult to develop, the most expensive, and accuracy is lower than speaker dependent systems. However, they are more flexible and will allow for regional accent or dialect variations. - A speaker adaptive system is developed to bridge the gap between speaker dependent and independent systems, allowing adaptation for new speakers. Its difficulty lies somewhere between speaker independent and speaker dependent systems. The size of vocabulary affects the complexity, processing requirements and the accuracy of the system. Some applications only require a few words (for example, numbers only), others require very large dictionaries (dictation machines). There are no established definitions, but many working in the field view this issue as one that can be more closely defined by reference to the size of vocabulary. An isolated-word system oper-ates on single words at a time - requiring a pause between saying each word. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not affect others. Thus, because the occurrences of words are more consistent they are easier to recognise. A continuous speech system operates on speech in which words are not separated by pauses. Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points of words. Another problem is "coarticulation". The production of each phoneme is affected by the production of surrounding phonemes, and similarly the start and end of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech. Fast speech tends to be harder to decode. By now it should be clear that voice recognition is a complex business and developers adopt a different strategies to overcome the problems. Typically, speech recognition starts with the digital sampling of speech. The next stage is acoustic signal processing. Most techniques include spectral analysis like LPC analysis (Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients) and cochlea modelling. The next stage is recognition of phonemes, groups of phonemes and words. This can be achieved by processes such as DTW (Dynamic Time Warping), HMM (hidden Markov modelling), NNs (Neural Networks), expert systems and combinations of techniques. HMM-based systems are currently flavour of the month, being the most successful approach. Most systems utilise some knowledge of the language to aid the recognition process whilst other try to "understand" speech. That is, they try to convert words into a representation of what the speaker intended to mean or achieve by what they said. Despite the difficulties, there are speech recognition systems of one kind or another for every major platform. IBM has been working with voice recognition systems for some years. Indeed, users of the now defunct OS/2 will be aware it included voice recognition software. IBM still sells its voice recognition software under the ViaVoice and Simply Speaking banners, but today those products are in an altogether different class. Simply Talking provides you with the ability to speak into Word - once you have "taught" the system what you sound like and been through the "training" session for those words it just can't understand. You can get add-on lexicons that cover certain sectors like the medical profession. These are extremely useful because apart from recognising terms specific to a profession, it saves you the aggravation of trying to spell words like isoniazid, rifampin and pryazinamide - they're hard enough to say, let alone spell. The medical profession has been a widely used test bed for voice recognition systems of every kind. Doctors are required to report on every aspect of patient care, diagnosis and treatment. With many highly specialised words, it is easy for mistakes to be made and therefore a reliable speech dictation has real application in reducing the amount of work needed to prepare accurate and complete reports. Vendors like Voice Activated Systems Technology (VAST) has gone one stage further and designed modules for specific medical disciplines like psychiatry, urology, opthamology, podiatry and endocrinology. These plug into Dragon Naturally Speaking, another voice recognition speech/write system. IBM's ViaVoice provides continuous speech/write facilities at rates of up to 140 words per minute. This is not very fast in real world speaking terms where the average person speaks at a rate of anywhere between 250 and 300 words a minute. But it is quite adequate for most purposes and is the "standard" to which most of the well known systems operate to. The error rate is much higher than with the more stilted Simply Speaking, takes a good half day to "train", but is more natural. For those with specialist needs like the visually impaired, products like this can be truly liberating. One problem IBM has yet to resolve is the use of these products where Word is acting as an Email component. At the moment, ViaVoice can cause unexpected system failures, which is a pity because Email communication is rapidly becoming the nearest thing to spoken messaging. Lernout and Haupsie recently launched Voice Commands, a Word specific control product that allows you to manipulate, amend and re-format Word documents through voice recognition. The idea is that rather than being limited to speech/write input, you can take voice recognition to its next logical stage and perform command actions. Again, this is a product that has to be trained but once working helps remove a lot of mousing around the complex Word menu system. In use, it is reasonably accurate but subject to problems of understanding different ways of speaking and dialects. This can be largely overcome by putting time into "training" the product to understand natural language passages like "Make this sentence bold" or "Bold this sentence" or "Make the next sentence bold." Not to be left out, Microsoft has a speech recognition development team that has re-engineered Sphinx-II into Whisper, a command and dictation system that works with any Windows application and can run on low-power PCs. Development continues and it may be some time before we see product. In the meantime, Microsoft has made its Speech API (SAPI) SDK 3.0, available for developers wishing to add speech recognition into applications. Apart from the potential efficiencies, voice recognition has important social implications. Those who are physically disabled can be at a disadvantage in a computerised world. This type of product may prove invaluable to helping them work where it might otherwise be impossible. For those who cannot leave their house but are able to speak, voice recognition products open a new range of employment possibilities, especially in the remote working market. But voice recognition is more than speech dictation and command systems. Some vendors are using this technology for advanced communications systems. One example is Enterprise Wildfire, an all-digital office telephony solution that connects any standard phone directly to the corporate LAN, using a low-cost board. From there, the call goes over Ethernet to a server running Wildfire server software, and from the server through a low-cost ISDN PRI card to a PBX or directly to the telephone network. The Wildfire client software handles speech recognition, so the user can summon Wildfire over their speaker phone and place calls or review messages without lifting the handset. With WildTone replacing the dialling tone, using the office telephone becomes faster and more efficient. However, the pace of change is slow and some voice recognition enabled CTI systems are unreliable. I recently tried to use a system that gave me eight alternative names to the one I spoke, none of which was the person I wanted! One area where voice recognition has much potential, according to David Bradshaw, a senior analyst at Ovum, is in Interactive Voice Response (IVR). IVR is used on telephones in a similar way to touchtones to let people use their telephone to access information such travel, traffic or weather reports, order cinema or tickets or book a flight. Rather than use the telephone keypad to select menu options, as is done with touchtones, IVR lets users speak a command into the telephone receiver. "The ultimate goal of IVR is natural interaction between you and the computer," said Bradshaw. He explained that with such a system someone could dial up an airline booking service and say "I want a flight from London to Washing-ton" in order to purchase a flight ticket. Bradshaw said that performing such an operation using touchtone commands would take ages and the user would have to go through several levels of menus: "Natural interaction is equivalent to going from text-based systems to GUIs." IVR is useful to anyone dealing with customer interaction and also has application areas like voice dialling. This is where a user picks up the phone and says the name of the person he wishes to dial. Another use is in mobile phones, where manufacturers are considering using voice commands to make their handsets totally hands free. One area where voice technology could make a breakthrough, according to Bradshaw, is in unified messaging where voice, Email and fax message are stored in one system. "Currently, Voicemail is stored as audio," he said. "Imagine the benefits of translating this into a text message. Users could even reply to their Email messages by voice." Voice recognition holds a great deal of promise. It is not as advanced as one might expect and today's product undoubtedly calls for a great deal of patience. Even so, we can expect to see advances that will change the way we use computers in the future. VOICE RECOGNITION: GLOSSARY Like other specialist areas of computing, voice and speech recognition have their own terminology. Below is a glossary of terms provided by Lernout and Haupsie: Acoustic model - a computer model created from recorded voice and used for recognising spoken language based on the sounds or phonemes that make up the language Algorithm - a pre-determined set of instructions and calculations for solving a specific problem in a limited number of ways Bi-gram - two words in a series such as "in the" or "to the". Tri-gram - three words in a series: "down the street" or "in two minutes" Constrained vocabulary - a limited number of words available to the recognition engine, such as the names of 50 states, members of a group, destinations, and so on. A good example is the contents of specific computer program menus. Discrete speech - natural spoken language, spoken with distinct pauses between words. Typical discrete dictation speed is somewhere between 150 to 200 words per minute. Language model - a calculation of probabilities that one particular word will follow another. The language model describes how phrases and sentences are built from individual words. Information typically included in the model would include statistics on how frequently individual words are used and how often groups of two words (bi-grams) or three words (tri-grams) are used together. Natural language - a language approximating how a person normally speaks to another person, as opposed to a synthetic language created for a task. For example, "edit cut" and "file open" are synthetic, while "cut the selection" and "open the file" are natural. Phonemes - the basic building blocks of a language. In English, there are roughly 44 phonemes - the sounds of the alphabet plus pairs such as ch, ph and sh. Every word in English therefore can be represented by a sequence of these phonemes. For each language, the greater the contrast between different phonemes or sounds within the language, the easier it is for the computer to recognise the spoken words. Continuous speech - natural human speech, spoken at normal pace with no pauses between words.
Latest Tesla news: Tesla stock price tanks amid reports of 'widening probe' by SEC and claims the base Model 3 loses money
SEC 'probe' takes its toll on Tesla as new research suggests that Tesla loses $6,000 on every $35,000 Model 3
10nm Cannon Lake Core i3-8121U CPUs make a rare outing with Intel's NUC mini PC
'Notorious' Australian child hacker thought he had executed 'flawless' hack
The former employee says that Tesla fired him for bringing the accusations to management internally