![]() The actual DNN is indicated by the dashed box. The top layer performs temporal integration. The Deep Neural Network used to detect "Hey Siri." The hidden layers are actually fully connected. a general logistic or normalized exponential), but since we want log probabilities the actual math is somewhat simpler. The final nonlinearity is essentially a Softmax function (a.k.a. Each "hidden" layer is an intermediate representation discovered by the DNN during its training to convert the filter bank inputs to sound classes. The DNN consists mostly of matrix multiplications and logistic nonlinearities. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the "Hey Siri" phrase, plus silence and other speech, for a total of about 20 sound classes. A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. The microphone in an iPhone or Apple Watch turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second. In particular, it focusses on the detector: a specialized speech recognizer which is always listening just for its wake-up phrase (on a recent iPhone with the "Hey Siri" feature enabled). This article concentrates on the part that runs on your local device, such as an iPhone or Apple Watch. There are also servers that can provide updates to the acoustic models used by the detector. Most of the implementation of Siri is "in the Cloud", including the main automatic speech recognition, the natural language interpretation and the various information services. As Figure 1 shows, the whole system has several parts. The Hey Siri flow on iPhoneīeing able to use Siri without pressing buttons is particularly useful when hands are busy, such as when cooking or driving, or when using the Apple Watch. Hardware, software, and Internet services work seamlessly together to provide a great experience. It seems simple, but quite a lot goes on behind the scenes to wake up Siri quickly and efficiently. No need to press a button as "Hey Siri" makes Siri hands-free. It is aimed primarily at readers who know something of machine learning but less about speech recognition. This article takes a look at the underlying technology. If the score is high enough, Siri wakes up. It then uses a temporal integration process to compute a confidence score that the phrase you uttered was "Hey Siri". The "Hey Siri" detector uses a Deep Neural Network (DNN) to convert the acoustic pattern of your voice at each instant into a probability distribution over speech sounds. When it detects "Hey Siri", the rest of Siri parses the following speech as a command or query. A very small speech recognizer runs all the time and listens for just those two words. The "Hey Siri" feature allows users to invoke Siri hands-free.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |