Interview With Google's Conversation Design Lead Nandini Stocker
By Sophie Curtis, Freelance Writer - RE•WORK
February 13, 2017
Science and gender equality are both vital for the achievement of international development goals. In order to achieve equal access to and participation in science for women and girls, and attain global gender equality and empowerment of women and girls, the UN has created an International Day of Women & Girls in Science.
The 2nd annual International Day of Women and Girls in Science was observed by universities, governing bodies, media and businesses worldwide, on 11 February 2017. We celebrated the day by offering free tickets for women in science, to a 2017 RE•WORK event of their choice (applications close 17 Feb, more info here).
Today we're featuring one of Google's leading women in conversational technologies, to explore how she came to work in this field, recent advancements in human-computer interaction, and predictions for the future of speech recognition. Read on for more.
Nandini Stocker, Conversation Design Lead at Google, has created voice experiences for over 50 of Google's business units, in more than 60 languages and 120 countries. Nandini is passionate about enabling people find help and making voice interfaces more accessible, and believes the secret to realizing our potential for speech technology in human-to-computer interaction is by honouring the core, evolved rules of human-to-human communication.
Can you introduce yourself and what you do? I’m a voice interaction designer for the Google Assistant and Actions on Google. My area of concentration is establishing a Conversation Design framework and set of standards to enable both our internal teams and the developer community to quickly build conversational experiences that are intuitive and make technology and information more accessible to everyone.
How did you get into voice interaction design?
I’d be selling myself short if I said that I stumbled into this field or that it was a happy accident involving the right opportunities or circumstances, even if it may have felt like that at the time. When I truly reflect on the course of my career, it was actually a series of pivotal moments where I made meaningful decisions to steer my direction toward what both interested me and leveraged my innate talents and passions.
I didn’t have much exposure to media or technology as a child. Having homeschooled myself for many years, I was in fact quite terrified of talking to others my own age. But in my determination to pursue education as an escape from my situation, I read a *lot* of books and in the process, learned to write and how to communicate my ideas. Speech and debate club in high school, as well as student government and newspaper in college played a big role in giving me the foundation I have today. I got my degree in Communication Arts and Political Science and started my career as a technical writer and copy editor.
About 17 years ago, I spent 6 months writing a technical product guide that was supposed to translate every facet of a new suite of speech solutions for the layperson. It was intended as a guide to put salespeople and engineers on common ground so they could start selling speech-enabled IVR (interactive voice response) phone applications. Since I was a technical writer, and not a designer at the time, I knew how to translate technical jargon into user manuals and such, but otherwise didn’t know much about the technology itself it going in. Still, I got three things from that experience:
First, I developed a repetitive muscle strain in my wrist from all the mouse work - so debilitating that I could no longer use said mouse (or even drink a glass of water given how weak my hand was). So ironically, my boss got me a voice command program that was overlaid on my Windows desktop computer and let me control everything hands-free. The software I used was flawed and tedious, but it taught me very early on that this technology had life-altering potential.
I also found my true career path. I fell in love with dialog design and felt like I finally found my place in the professional world - doing something that was equally technical and creative. You see, half my extended family are pure artists - sculptors, painters, graphic artists, and photographers - and the other half are highly technical - surveyors, architects, computer scientists, and shipping engineers. So perhaps at least some of it was genetic.
And the last thing I got from that experience was a sense of wonder that I was contributing toward a fantastical future that still only existed in our imaginations. In fact, I remember ending the introduction to that guide with something like “…and besides, this means we’re that much closer to Star Trek.”
From there I’ve worked for a series of companies focused on design consulting for voice interactions in the customer support space, including at Google, until I joined the Assistant team.
What are the key factors that have enabled recent advancements in voice interaction?
Until just a few years ago, the most widely used real life speech solutions had been in a business context - used to cut cost, create efficiencies, and automate calls that would otherwise go to humans in a contact center. It was only once speech technologies and interfaces started being integrated into consumer-facing products, namely mobile and smartphones, that enough data could be gathered to make significant progress in accuracy. Unlike only a few years earlier, users no longer had to train devices to their own voice, and these systems had far larger vocabularies than previous ones with limited vocabulary domains. Suddenly voice interfaces were not just something one would encounter when calling a customer service number, but were becoming commonplace as an interface into what you wanted to get done on your smartphone, including opening apps, transcribing text messages, or searching the internet.
All that said, the proliferation of the smartphone with touch screens sans any real physical keys or buttons also only served to highlight the need for voice recognition as a means to control and enter text. So I see the both advancements having each facilitated the other.
Which industries do you think speech technology will benefit the most and why?
The types of interfaces users encounter vary widely in modality, and voice is becoming a valuable part of that equation, whether it is used alone or together with another mode of input or output. In fact, the concept of an “interface” has rapidly been replaced by “experience”.
There is a principle of “universal design” that emerged in the 1990s from observations around developing technologies with accessibility in mind. These universal design principles apply to design of voice user interfaces (VUIs) as well. When systems are designed and developed thoughtfully for people with disabilities, it turns out that they benefit many people with no known disabilities. Captioning audio material is one example. While the initial impetus of captioning was to provide access for deaf and hard of hearing people, it is helpful in many other situations as well, such as normal-hearing people watching video in noisy settings, or non-native speakers that can read captions more easily than they understand spoken language, or older viewers with mild hearing loss.
So, while it’s probably unsurprising that creating systems with voice input and output can be an enabling option for users that are visually or motor impaired, we need to expand how we have traditionally approached designing for the “multi-modal” experience to actually apply to all users, regardless of disability, depending on the experiential context for the user. For example, if you’ve ever tried to use a phone or tablet while cooking or holding a baby, you know the value of a hands-free device. There are four categories of interfaces where the use of voice will have varying levels of impact:
Eyes-free - This type of experience is what it sounds like. The user should never have to interact with a screen or use any mode of input other than speech. Examples are voice-enabled bots, apps, and assistive devices, all provided they can be invoked exclusively by voice.
Hands-busy - This is a more traditional multi-modal interface that usually combines a mix of input modes, such as both speech and tapping screen elements. This may involve tapping or touching a device as an occasional action used to trigger or respond with input. An example might be a vehicle navigation system or wearable device that requires only occasional sensory input, but is not dependent on it as the single means of input. Such interfaces are most successful when they accept multiple modes at the same time, e.g. the user can type or talk, their choice.
Tap and type - This might involve multiple modes of input such as both tapping and typing, but otherwise is almost entirely visual - or at least requires visual capabilities to use. However, this takes into account only the input mode.
Hybrid - There may be several ways the above contexts can also mix into a hybrid model to cover even more accessibility needs, such as a tap-and-type model that may also have audio feedback to enable someone who is speech-impaired but without hearing or motor disabilities. Before speech recognition devices were ubiquitous, this type of interface might be the only available solution for the blind, whereby they could input by typing or Braille input, then getting audio output.
Then, considering beyond the physical realm of possible contexts, a conversational interface that doesn’t require reading or typing also removes barriers for those who may have literacy or cognitive challenges, thereby making technology and information available to more segments of the population overall.
What challenges has the tech industry faced with regards to speech recognition and how are these being overcome?
Well, for the better part of a century, science fiction has portrayed voice and speech recognition technology as synonymous with artificial intelligence. And many predicted that by now we would be far more advanced than we actually are. And in fact we’re indeed finally well on our way to reclaim that which sci-fi promised us. However, while we know people have tried it and they’re using it, if you peel away how they are interacting, and look instead at the context of the motivation for using in the first place, that makes a huge difference on how they feel about it.
Availability on mobile devices and the data gathering potential that ubiquitousness offered only tells part of the story of adoption. It’s important to note that with an interface that’s thrust upon you, such as when you call a business - usually with a problem to solve - you don’t build a personal relationship with the entity you’re interacting with. Instead, and especially if it doesn’t work well, you might feel angry, confused, or even stupid if it doesn’t work well, after these fleeting (and involuntary) interactions. For most users, those feelings are sometimes directed toward the business they tried to contact, but whether they consciously realize it or not, their impressions are also formed about the technology itself.
And as people interact with individual devices more and more, it’s something that’s far more intimate - and voluntary. With voluntary interactions, the relationship is with the device primarily, perhaps the manufacturer or operating system second, but also ultimately with the technology.
So the negative, visceral reactions sometimes encountered when a system doesn’t perform against people’s unconscious expectations are probably the biggest challenge the industry has faced and will continue to face. When someone encounters a poorly designed web site or mobile app, they may even be able to articulate why - that it was hard to find what they’re looking for, or that the content was overwhelming for example. But with a voice interface, the reaction can feel primal. You don’t really know why it annoys you, it just does as soon as the underlying “wires” show and the machine-ness is exposed. Whether conscious or not, people are subconsciously comparing a voice they hear from a machine to their preconceived understanding of the conventions of human language and communication. They’re effectively comparing it to human beings. Designing and building to that standard is a tall order.
At Google we’re trying to be more deliberate in our approach to designing for voice interfaces. We recognize that spoken language is complex, something that’s evolved for over a hundred thousand years, so realizing the potential for this technology means honoring the rules and conventions that make it what it is. For example, we’re working to understand what it means to unpack what a conversation looks like when broken down into its parts as we figure out how to teach computers to talk to humans, and not expect it to be the other way around.
What advancements in speech technology would you hope to see in the next 3 years?
In many ways, we’ve gotten the easy part out of the way. We can capture *what* people say and are making huge strides in natural language understanding to interpret what they *mean* by what’s said. But meaning within a conversation is systematically dependent on context.
We humans have gotten pretty darn good at negotiating meaning from context, without really being aware of it. And the thing to remember is that so much of what we humans use to arrive at that understanding is not through what’s literally spoken in words, but rather, what shared knowledge we have with each other within a given context, be it social, cultural, or even environmental - most of which never gets uttered out loud when we communicate. It simply gets applied unconsciously to our spoken words to help us interpret each other’s meaning along the way.
So conversations are still hard to manufacture. As we move to focus on a computing age more driven toward artificial intelligence, I hope to see advances in machine learning that derive from what we know to be true of human-to-human communication so we can start to design conversations that don’t have to be modeled intentionally by human designers like myself, but are more contextually relevant and meaningful in real time, that adapt to human behavior predictively based on learned patterns.
How do you feel about being a woman in tech? Have you faced any challenges? What advice would you give young girls interested in tech?
I’ve certainly felt the statistical reality of sometimes being the only woman in the room. I’ve experienced my share of having to rebuild my tech cred with every new group of men I work with. But having strong verbal skills and the ability to speak with authority on a topic goes a long way in any situation. Then there’s the simple fact that language breaks down barriers. That’s true between individuals, but even at a societal level. So that’s no less the case in technology. People who understand language are the new technologists in this new world of conversational interfaces. Even if you don’t know any type of computer language, if you can be an expert on human language, you can influence the direction of future technology.
We know from brain science that girls are neurologically wired with a language head start, talking earlier and more than boys, but we’re also finding interesting ways to connect the dots to technology by getting girls applying those skills to building things earlier. A great example is Goldieblox, founded by Debbie Sterling, which leverages girls’ love of stories and characters, and satisfies their curiosity about why they’d build something, not just how.
So conversation design and language processing technology of any kind, but especially voice, is a great field for girls to explore. My own career path is living proof of that. On 21 February in London, we're holding an evening of discussions & networking to celebrate women advancing the field of machine intelligence. The event has now sold out but if you would like to join the waiting list for tickets to the Women in Machine Intelligence Dinner, please visit the event site here. Our following dinner will be in celebration of Women in Affective Computing this September, see more details here. Previous speakers at these dinners have joined us from Google DeepMind, UCL, the Alan Turing Institute and IBM Watson. View all upcoming events here.
The Deep Learning Summit is the next revolution in artificial intelligence. Explore the impact of image & speech recognition as a disruptive trend in business and industry. How can multiple levels of representation and abstraction help to make sense of data such as images, sound, and text. Hear the latest insights and technology advancements from industry leaders, startups and researchers.
The next generation in predictive intelligence. Anticipating user & business needs to alert & advise logical steps to increase efficiency. The summit will showcase the opportunities of advancing trends in AI Assistants & their impact on business & society. What impact will predictive intelligence have on business efficiency & personal organization?
Following day 1 of the summit, attendees will come together for an evening of networking, discussions and fine food & wine. Mix with leaders on topics including NLP, speech recognition and image analysis, as well as applications in sectors including manufacturing, transport, healthcare, finance and security.