Last week at Google I/O, we introduced Smart Compose, a new feature in Gmail that uses machine learning to interactively offer sentence completion suggestions as you type, allowing you to draft emails faster. Building upon technology developed for Smart Reply, Smart Compose offers a new way to help you compose messages — whether you are responding to an incoming email or drafting a new one from scratch.
In developing Smart Compose, there were a number of key challenges to face, including:
- Latency: Since Smart Compose provides predictions on a per-keystroke basis, it must respond ideally within 100ms for the user not to notice any delays. Balancing model complexity and inference speed was a critical issue.
- Scale: Gmail is used by more than 1.4 billion diverse users. In order to provide auto completions that are useful for all Gmail users, the model has to have enough modeling capacity so that it is able to make tailored suggestions in subtly different contexts.
- Fairness and Privacy: In developing Smart Compose, we needed to address sources of potential bias in the training process, and had to adhere to the same rigorous user privacy standards as Smart Reply, making sure that our models never expose user’s private information. Furthermore, researchers had no access to emails, which meant they had to develop and train a machine learning system to work on a dataset that they themselves cannot read.
Finding the Right Model
Typical language generation models, such as ngram, neural bag-of-words (BoW) and RNN language (RNN-LM) models, learn to predict the next word conditioned on the prefix word sequence. In an email, however, the words a user has typed in the current email composing session is only one “signal” a model can use to predict the next word. In order to incorporate more context about what the user wants to say, our model is also conditioned on the email subject and the previous email body (if the user is replying to an incoming email).
One approach to include this additional context is to cast the problem as a sequence-to-sequence (seq2seq) machine translation task, where the source sequence is the concatenation of the subject and the previous email body (if there is one), and the target sequence is the current email the user is composing. While this approach worked well in terms of prediction quality, it failed to meet our strict latency constraints by orders of magnitude.
To improve on this, we combined a BoW model with an RNN-LM, which is faster than the seq2seq models with only a slight sacrifice to model prediction quality. In this hybrid approach, we encode the subject and previous email by averaging the word embeddings in each field. We then join those averaged embeddings, and feed them to the target sequence RNN-LM at every decoding step, as the model diagram below shows.
|Smart Compose RNN-LM model architecture. Subject and previous email message are encoded by averaging the word embeddings in each field. The averaged embeddings are then fed to the RNN-LM at each decoding step.|
Accelerated Model Training & Serving
Of course, once we decided on this modeling approach we still had to tune various model hyperparameters and train the models over billions of examples, all of which can be very time-intensive. To speed things up, we used a full TPUv2 Pod to perform experiments. In doing so, we’re able to train a model to convergence in less than a day.
Even after training our faster hybrid model, our initial version of Smart Compose running on a standard CPU had an average serving latency of hundreds of milliseconds, which is still unacceptable for a feature that is trying to save users’ time. Fortunately, TPUs can also be used at inference time to greatly speed up the user experience. By offloading the bulk of the computation onto TPUs, we improved the average latency to tens of milliseconds while also greatly increasing the number of requests that can be served by a single machine.
Fairness and Privacy
Fairness in machine learning is very important, as language understanding models can reflect human cognitive biases resulting in unwanted word associations and sentence completions. As Caliskan et al. point out in their recent paper “Semantics derived automatically from language corpora contain human-like biases”, these associations are deeply entangled in natural language data, which presents a considerable challenge to building any language model. We are actively researching ways to continue to reduce potential biases in our training procedures. Also, since Smart Compose is trained on billions of phrases and sentences, similar to the way spam machine learning models are trained, we have done extensive testing to make sure that only common phrases used by multiple users are memorized by our model, using findings from this paper.
We are constantly working on improving the suggestion quality of the language generation model by following state-of-the-art architectures (e.g., Transformer, RNMT+, etc.) and experimenting with most recent and advanced training techniques. We will deploy those more advanced models to production once our strict latency constraints can be met. We are also working on incorporating personal language models, designed to more accurately emulate an individual’s style of writing into our system.
Smart Compose language generation model was developed by Benjamin Lee, Mia Chen, Gagan Bansal, Justin Lu, Jackie Tsay, Kaushik Roy, Tobias Bosch, Yinan Wang, Matthew Dierker, Katherine Evans, Thomas Jablin, Dehao Chen, Vinu Rajashekhar, Akshay Agrawal, Yuan Cao, Shuyuan Zhang, Xiaobing Liu, Noam Shazeer, Andrew Dai, Zhifeng Chen, Rami Al-Rfou, DK Choe, Yunhsuan Sung, Brian Strope, Timothy Sohn, Yonghui Wu, and many others.