Google has built a new technology to power its voice search, which the company says will make it even faster and more accurate. The new technology uses Connectionist Temporal Classification (CTC) and sequence discriminative training techniques. In 2012, Google switched from Gaussian Mixture Model (GMM) to Deep Neural Networks (DNNs), which allowed the company to better assess which sound a user was producing at that time, and delivered an increased speech recognition accuracy.
Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud - "museum" - it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.
The change in technology has been made by Google, and is now being used to power voice searches in the Google app on both iOS and Android, as well as dictation on Android devices.
Source: Google Research Blog