creating-voice-skills-for-google-assistant-and-amazon-alexa

About The Author

Tris is a Creative Technologist at Greenwood Campbell, an Award Winning Digital Agency. He specializes in innovation in the technology sector, to connect …
More about
Tris

Voice assistants are hopping out of emerging tech and into everyday life. As a front end developer, you already have the skills to build one, so let’s dive into the platforms.

Over the past decade, there has been a seismic shift towards conversational interfaces. As people reach ‘peak screen’ and even begin to scale back their device usage with digital wellbeing features being baked into most operating systems.

To combat screen fatigue, voice assistants have entered the market to become a preferred option for quickly retrieving information. A well-repeated stat states that 50% of searches will be done by voice in year 2020. Also, as adoption rises, it’s up to developers to add “Conversational Interfaces” and “Voice Assistants” to their tool belt.

Designing The Invisible

For many, embarking on a voice UI (VUI) project can be a bit like entering the Unknown. Find out more about the lessons learned by William Merrill when designing for voice. Read article →

What Is A Conversational Interface?

A Conversational Interface (sometimes shortened to CUI, is any interface in a human language. It is tipped to be a more natural interface for the general public than the Graphic User Interface GUI, which front end developers are accustomed to building. A GUI requires humans to learn its specific syntaxes of the interface (think buttons, sliders, and drop-downs).

This key difference in using human language makes CUI more natural for people; it requires little knowledge and puts the burden of understanding on the device.

Commonly CUIs comes in two guises: Chatbots and Voice Assistants. Both have seen a massive rise in uptake over the last decade thanks to advances in Natural Language Processing (NLP).

Understanding Voice Jargon

(Large preview)
Keyword Meaning
Skill/Action A voice application, which can fulfill a series of intents
Intent Intended action for the skill to fulfill, what the user wants the skill to do in response to what they say.
Utterance The sentence a user says, or utters.
Wake Word The word or phrase used to start a voice assistant listening, e.g. ‘Hey google’, ‘Alexa’ or ‘Hey Siri’
Context The pieces of contextual information within an utterance, that helps the skill fulfill an intent, e.g. ‘today’, ‘now’, ‘when I get home’.

What Is A Voice Assistant?

A voice assistant is a piece of software capable of NLP (Natural Language Processing). It receives a voice command and returns an answer in audio format. In recent years the scope of how you can engage with an assistant is expanding and evolving, but the crux of the technology is natural language in, lots of computation, natural language out.

For those looking for a bit more detail:

  1. The software receives an audio request from a user, processes the sound into phonemes, the building blocks of language.
  2. By the magic of AI (Specifically Speech-To-Text), these phonemes are converted into a string of the approximated request, this is kept within a JSON file which also contains extra information about the user, the request and the session.
  3. The JSON is then processed (usually in the cloud) to work out the context and intent of the request.
  4. Based on the intent, a response is returned, again within a larger JSON response, either as a string or as SSML (more on that later)
  5. The response is processed back using AI (naturally the reverse – Text-To-Speech) which is then returned to the user.

There’s a lot going on there, most of which don’t require a second thought. But each platform does this differently, and it’s the nuances of the platform that require a bit more understanding.

(Large preview)

Voice-Enabled Devices

The requirements for a device to be able to have a voice assistant baked in are pretty low. They require a Microphone, an internet connection, and a Speaker. Smart Speakers like the Nest Mini & Echo Dot provide this kind of low-fi voice control.

Next up in the ranks is voice screen, this is known as a ‘Multimodal’ device (more on these later), and are devices like the Nest Hub and the Echo Show. As smartphones have this functionality, they can also be considered a type of Multimodal voice-enabled device.

Voice Skills

First off, every platform has a different name for their ‘Voice Skills’, Amazon goes with skills, which I will be sticking with as a universally understood term. Google opts for ‘Actions’, and Samsung goes for ‘capsules’.

Each platform has its own baked-in skills, like asking the time, weather and sports games. Developer-made (third-party) skills can be invoked with a specific phrase, or, if the platform likes it, can be implicitly invoked, without a key phrase.

Explicit Invocation: ”Hey Google, Talk to .”

It is explicitly stated which skill is being asked for:

Implicit Invocation: ”Hey Google, what is the weather like today?”

It is implied by the context of the request what service the user wants.

What Voice Assistants Are There?

In the western market, voice assistants are very much a three-horse race. Apple, Google and Amazon have very different approaches to their assistants, and as such, appeal to different types of developers and customers.

Apple’s Siri

Device Names: ”Google Home, Nest”

Wake Phrase: ”Hey Siri”

Siri has over 375 million active users, but for the sake of brevity, I am not going into too much detail for Siri. While it may be globally well adopted, and baked into most Apple devices, it requires developers to already have an app on one of Apple’s platforms and is written in swift (whereas the others can be written in everyone’s favorite: Javascript). Unless you are an app developer who wants to expand their app’s offering, you can currently skip past apple until they open up their platform.

Google Assistant

Device Names: ”Google Home, Nest”

Wake Phrase: ”Hey Google”

Google has the most devices of the big three, with over 1 Billion worldwide, this is mostly due to the mass of Android devices that have Google Assistant baked in, with regards to their dedicated smart speakers, the numbers are a little smaller. Google’s overall mission with its assistant is to delight users, and they have always been very good at providing light and intuitive interfaces.

Their primary aim on the platform is to use time — with the idea of becoming a regular part of customers’ daily routine. As such, they primarily focus on utility, family fun, and delightful experiences.

Skills built for Google are best when they are engagement pieces and games, focusing primarily on family-friendly fun. Their recent addition of canvas for games is a testament to this approach. The Google platform is much stricter for submissions of skills, and as such, their directory is a lot smaller.

Amazon Alexa

Device Names: “Amazon Fire, Amazon Echo”

Wake Phrase: “Alexa”

Amazon has surpassed 100 million devices in 2019, this predominantly comes from sales of their smart speakers and smart displays, as well as their ‘fire’ range or tablets and streaming devices.

Skills built for Amazon tend to be aimed at in skill purchasing. If you are looking for a platform to expand your e-commerce/service, or offer a subscription then Amazon is for you. That being said, ISP isn’t a requirement for Alexa Skills, they support all sorts of uses, and are much more open to submissions.

The Others

There are even more Voice assistants out there, such as Samsung’s Bixby, Microsoft’s Cortana, and the popular open-source voice assistant Mycroft. All three have a reasonable following, but are still in the minority compared to the three Goliaths of Amazon, Google and Apple.

Building On Amazon Alexa

Amazons Ecosystem for voice has evolved to allow developers to build all of their skills within the Alexa console, so as a simple example, I am going to use its built-in features.

(Large preview)

Alexa deals with the Natural Language Processing and then finds an appropriate Intent, which is passed to our Lambda function to deal with the logic. This returns some conversational bits (SSML, text, cards, and so on) to Alexa, which converts those bits to audio and visuals to show on the device.

Working on Amazon is relatively simple, as they allow you to create all parts of your skill within the Alexa Developer Console. The flexibility is there to use AWS or an HTTPS endpoint, but for simple skills, running everything within the Dev console should be sufficient.

Let’s Build A Simple Alexa Skill

Head over to the Amazon Alexa console, create an account if you don’t have one, and log in,

Click Create Skill then give it a name,

Choose custom as your model,

and choose Alexa-Hosted (Node.js) for your backend resource.

Once it is done provisioning, you will have a basic Alexa skill, It will have your intent built for you, and some back end code to get you started.

If you click on the HelloWorldIntent in your Intents, you will see some sample utterances already set up for you, let’s add a new one at the top. Our skill is called hello world, so add Hello World as a sample utterance. The idea is to capture anything the user might say to trigger this intent. This could be “Hi World”, “Howdy World”, and so on.

What’s Happening In The Fulfillment JS?

So what is the code doing? Here is the default code:

const HelloWorldIntentHandler = {

    canHandle(handlerInput) {

        return Alexa.getRequestType(handlerInput.requestEnvelope) === 'IntentRequest'

            && Alexa.getIntentName(handlerInput.requestEnvelope) === 'HelloWorldIntent';

    },

    handle(handlerInput) {

        const speakOutput = 'Hello World!';

        return handlerInput.responseBuilder

            .speak(speakOutput)

            .getResponse();

    }

};

This is utilizing the ask-sdk-core and is essentially building JSON for us. canHandle is letting ask know it can handle intents, specifically ‘HelloWorldIntent’. handle takes the input, and builds the response. What this generates looks like this:

{

    "body": {

        "version": "1.0",

        "response": {

            "outputSpeech": {

                "type": "SSML",

                "ssml": "Hello World!"

            },

            "type": "_DEFAULT_RESPONSE"

        },

        "sessionAttributes": {},

        "userAgent": "ask-node/2.3.0 Node/v8.10.0"

    }

}

We can see that speak outputs ssml in our json, which is what the user will hear as spoken by Alexa.

Building For Google Assistant

(Large preview)

The simplest way to build Actions on Google is to use their AoG console in combination with Dialogflow, you can extend your skills with firebase, but as with the Amazon Alexa tutorial, let’s keep things simple.

Google Assistant uses three primary parts, AoG, which deals with the NLP, Dialogflow, which works out your intents, and Firebase, that fulfills the request, and produces the response that will be sent back to AoG.

Just like with Alexa, Dialogflow allows you to build your functions directly within the platform.

Let’s Build An Action On Google

There are three platforms to juggle at once with Google’s solution, which are accessed by three different consoles, so tab up!

Setting Up Dialogflow

Let’s start by logging into the Dialogflow console. Once you have logged in, create a new agent from the dropdown just below the Dialogflow logo.

Give your agent a name, and add on the ‘Google Project Dropdown’, while having “Create a new Google project” selected.

Click the create button, and let it do its magic, it will take a little bit of time to set up the agent, so be patient.

Setting Up Firebase Functions

Right, now we can start to plug in the Fulfillment logic.

Head on over to the Fulfilment tab. Tick to enable the inline editor, and use the JS snippets below:

index.js

'use strict';

// So that you have access to the dialogflow and conversation object
const {  dialogflow } = require('actions-on-google'); 

// So you have access to the request response stuff >> functions.https.onRequest(app)
const functions = require('firebase-functions');

// Create an instance of dialogflow for your app
const app = dialogflow({debug: true});


// Build an intent to be fulfilled by firebase, 
// the name is the name of the intent that dialogflow passes over
app.intent('Default Welcome Intent', (conv) => {
  
  // Any extra logic goes here for the intent, before returning a response for firebase to deal with
    return conv.ask(`Welcome to a firebase fulfillment`);
  
});

// Finally we export as dialogflowFirebaseFulfillment so the inline editor knows to use it
exports.dialogflowFirebaseFulfillment = functions.https.onRequest(app);

package.json

{
  "name": "functions",
  "description": "Cloud Functions for Firebase",
  "scripts": {
    "lint": "eslint .",
    "serve": "firebase serve --only functions",
    "shell": "firebase functions:shell",
    "start": "npm run shell",
    "deploy": "firebase deploy --only functions",
    "logs": "firebase functions:log"
  },
  "engines": {
    "node": "10"
  },
  "dependencies": {
    "actions-on-google": "^2.12.0",
    "firebase-admin": "~7.0.0",
    "firebase-functions": "^3.3.0"
  },
  "devDependencies": {
    "eslint": "^5.12.0",
    "eslint-plugin-promise": "^4.0.1",
    "firebase-functions-test": "^0.1.6"
  },
  "private": true
}

Now head back to your intents, go to Default Welcome Intent, and scroll down to fulfillment, make sure ‘Enable webhook call for this intent’ is checked for any intents your wish to fulfill with javascript. Hit Save.

(Large preview)

Setting Up AoG

We are getting close to the finish line now. Head over to the Integrations Tab, and click Integration Settings in the Google Assistant Option at the top. This will open a modal, so let’s click test, which will get your Dialogflow integrated with Google, and open up a test window on Actions on Google.

On the test window, we can click Talk to my test app (We will change this in a second), and voila, we have the message from our javascript showing on a google assistant test.

We can change the name of the assistant in the Develop tab, up at the top.

So What’s Happening In The Fulfillment JS?

First off, we are using two npm packages, actions-on-google which provides all the fulfillment that both AoG and Dialogflow need, and secondly firebase-functions, which you guessed it, contains helpers for firebase.

We then create the ‘app’ which is an object that contains all of our intents.

Each intent that is created passed ‘conv’ which is the conversation object Actions On Google sends. We can use the content of conv to detect information about previous interactions with the user (such as their ID and information about their session with us).

We return a ‘conv.ask object’, which contains our return message to the user, ready for them to respond with another intent. We could use ‘conv.close’ to end the conversation if we wanted to end the conversation there.

Finally, we wrap everything up in a firebase HTTPS function, that deals with the server-side request-response logic for us.

Again, if we look at the response that is generated:

{

  "payload": {

    "google": {

      "expectUserResponse": true,

      "richResponse": {

        "items": [

          {

            "simpleResponse": {

              "textToSpeech": "Welcome to a firebase fulfillment"

            }

          }

        ]

      }

    }

  }

}

We can see that conv.ask has had its text injected into the textToSpeech area. If we had chosen conv.close the expectUserResponse would be set to false and the conversation would close after the message had been delivered.

Third-Party Voice Builders

Much like the app industry, as voice gains traction, 3rd party tools have started popping up in an attempt to alleviate the load on developers, allowing them to build once deploy twice.

Jovo and Voiceflow are currently the two most popular, especially since PullString’s acquisition by Apple. Each platform offers a different level of abstraction, so It really just depends on how simplified you’re like your interface.

Extending Your Skill

Now that you have gotten your head around building a basic ‘Hello World’ skill, there are bells and whistles aplenty that can be added to your skill. These are the cherry on top of the cake of Voice Assistants and will give your users a lot of extra value, leading to repeat custom, and potential commercial opportunity.

SSML

SSML stands for speech synthesis markup language and operates with a similar syntax to HTML, the key difference being that you are building up a spoken response, not content on a webpage.

‘SSML’ as a term is a little misleading, it can do so much more than speech synthesis! You can have voices going in parallel, you can include ambiance noises, speechcons (worth a listen to in their own right, think emojis for famous phrases), and music.

When Should I Use SSML?

SSML is great; it makes a much more engaging experience for the user, but what is also does, is reduce the flexibility of the audio output. I recommend using it for more static areas of speech. You can use variables in it for names etc, but unless you intend on building an SSML generator, most SSML is going to be pretty static.

Start with simple speech in your skill, and once it is complete, enhance areas which are more static with SSML, but get your core right before moving on to the bells and whistles. That being said, a recent report says 71% of users prefer a human (real) voice over a synthesized one, so if you have the facility to do so, go out and do it!

(Large preview)

In Skill Purchases

In-skill purchases (or ISP) are similar to the concept of in-app purchases. Skills tend to be free, but some allow for the purchase of ‘premium’ content/subscriptions within the app, these can enhance the experience for a user, unlock new levels on games, or allow access to paywalled content.

Multimodal

Multimodal responses cover so much more than voice, this is where voice assistants can really shine with complementary visuals on devices that support them. The definition of multimodal experiences is much broader and essentially means multiple inputs (Keyboard, Mouse, Touchscreen, Voice, and so on.).

Multimodal skills are intended to complement the core voice experience, providing extra complementary information to boost the UX. When building a multimodal experience, remember that voice is the primary carrier of information. Many devices don’t have a screen, so your skill still needs to work without one, so make sure to test with multiple device types; either for real or in the simulator.

(Large preview)

Multilingual

Multilingual skills are skills that work in multiple languages and open up your skills to multiple markets.

The complexity of making your skill multilingual is down to how dynamic your responses are. Skills with relatively static responses, e.g. returning the same phrase every time, or only using a small bucket of phrases, are much easier to make multilingual than sprawling dynamic skills.

The trick with multilingual is to have a trustworthy translation partner, whether that is through an agency or a translator on Fiverr. You need to be able to trust the translations provided, especially if you don’t understand the language being translated into. Google translate will not cut the mustard here!

Conclusion

If there was ever a time to get into the voice industry, it would be now. Both in its prime and infancy, as well as the big nine, are plowing billions into growing it and bringing voice assistants into everybody’s homes and daily routines.

Choosing which platform to use can be tricky, but based on what you intend to build, the platform to use should shine through or, failing that, utilize a third-party tool to hedge your bets and build on multiple platforms, especially if your skill is less complicated with fewer moving parts.

I, for one, am excited about the future of voice as it becomes ubiquitous; screen reliance will reduce and customers will be able to interact naturally with their assistant. But first, it’s up to us to build the skills that people will want from their assistant.

Smashing Editorial(dm, il)

it’s-time-to-rethink-voice-assistants-completely

More than 200 million homes now have a smart speaker providing voice-controlled access to the internet, according to one global estimate. Add this to the talking virtual assistants installed on many smartphones, not to mention kitchen appliances and cars, and that’s a lot of Alexas and Siris.

Because talking is a fundamental part of being human, it is tempting to think these assistants should be designed to talk and behave like us. While this would give us a relatable way to interact with our devices, replicating genuinely realistic human conversations is incredibly difficult. What’s more, research suggests making a machine sound human may be unnecessary and even dishonest. Instead, we might need to rethink how and why we interact with these assistants and learn to embrace the benefits of them being a machine.

Speech technology designers often talk about the concept of “humanness.” Recent developments in artificial voice development have resulted in these systems’ voices blurring the line between human and machine, sounding increasingly humanlike. There have also been efforts to make the language of these interfaces appear more human.

Perhaps the most famous is Google Duplex, a service that can book appointments over the phone. To add to the human-like nature of the system, Google included utterances like “hmm” and “uh” to its assistant’s speech output—sounds we commonly use to signal we are listening to the conversation or that we intend to start speaking soon. In the case of Google Duplex, these were used with the aim of sounding natural. But why is sounding natural or more human-like so important?

Chasing this goal of making systems sound and behave like us perhaps stems from pop culture inspirations we use to fuel the design of these systems. The idea of talking to machines has fascinated us in literature, television, and film for decades, through characters such as HAL 9000 in 2001: A Space Odyssey or Samantha in Her. These characters have seamless conversations with humans. In the case of Her, there is even a love story between an operating system and its user. Critically, all these machines sound and respond the way we think humans would.

There are interesting technological challenges in trying to achieve something resembling conversations between us and machines. To this end, Amazon has recently launched the Alexa Prize, looking to “create socialbots that can converse coherently and engagingly with humans on a range of current events and popular topics such as entertainment, sports, politics, technology, and fashion.” The current round of competition asks teams to produce a 20-minute conversation between one of these bots and a human interlocutor.

These grand challenges, like others across science, clearly advance the state of the art, bringing planned and unplanned benefits. Yet when striving to give machines the ability to truly converse with us like other human beings, we need to think about what our spoken interactions with people are actually for and whether this is the same as the type of conversation we want to have with machines.

We converse with other people to get stuff done and to build and maintain relationships with one another—and often these two purposes intertwine. Yet people see machines as tools serving limited purposes and hold little appetite for building the kind of relationships with machines that we have every day with other people.

Pursuing natural conversations with machines that sound like us can become an unnecessary and burdensome objective. It creates unrealistic expectations of systems that can actually communicate and understand like us. Anyone who has interacted with an Amazon Echo or Google Home knows this is not possible with existing systems.

This matters as people need to have an idea of how to get a system to do things which, because voice-only interfaces have limited buttons and visuals, are guided significantly by what the system says and how it says it. The importance of interface design means humanness itself may not only be questionable but deceptive, especially if used to fool people into thinking they are interacting with another person. Even if their intent may be to create intelligible voices, tech companies need to consider the potential impact on users.

Looking beyond humanness

Rather than consistently embracing humanness, we can accept that there may be fundamental limits, both technological and philosophical, to the types of interactions we can and want to have with machines.

We should be inspired by human conversations rather than using them as a perceived gold standard for interaction. For instance, looking at these systems as performers rather than human-like conversationalists may be one way to help create more engaging and expressive interfaces. Incorporating specific elements of conversation may be necessary for some contexts, but we need to think about whether human-like conversational interaction is necessary, rather than using it as a default design goal.

It is hard to predict what technology will be like in the future and how social perceptions will change and develop around our devices. Maybe people will be okay with having conversations with machines, becoming friends with robots and seeking their advice.

But we are currently skeptical of this. In our view it is all to do with context. Not all interactions and interfaces are the same. Some speech technology may be required to establish and foster some form of social or emotional bond, such as in specific healthcare applications. If that is the aim, then it makes sense to have machines converse more appropriately for that purpose—perhaps sounding human so the user gets the right type of expectations.

Yet this is not universally needed. Crucially, this human-likeness should link to what the systems can actually do with conversation. Making systems that do not have the ability to converse like a human sound human may do far more harm than good.


Leigh Clark is a lecturer in computer science at Swansea University. Benjamin Cowan is an assistant professor at the School of Information & Communication Studies at University College Dublin. This story originally appeared on The Conversation

how-seos-can-master-voice-search-now

You already know the entry-level SEO factors you need to think about constantly to make your rockstar brand visible to your audience. You’ve covered your keyword research, content strategy, domain authority and backlink profile. It’s all solid.

But at the same time, it’s 2019, and those elements won’t always cut it in the same ways they did ten or even five years ago. As we prepare to enter the 2020s, digital marketers everywhere need to stay current with changing trends in the SEO space. In this post, I’m talking about the mostly untapped opportunity of optimizing your SEO for voice search.

You know voice search, that on-the-rise realm of online querying that’s performed with nothing more than your voice and a virtual assistant, be it Amazon Alexa, Cortana, Google Assistant or Siri. You can buy things online, set reminders for yourself and, of course, perform searches.

I don’t know anyone who denies that advanced voice search is one of the coolest pieces of technology to come out of the 21st century so far. But what does it mean for SEO going forward? Here’s a statistic to give you an idea: Comscore has forecast that 50 percent of all online searches will be performed by voice search by 2020. That’s a sufficient reason for any digital marketer to take pause and think. Half of all online searchers will soon be finding results using their voices.

With that in mind, ask yourself: Is your SEO optimized for voice search? If it isn’t, you may be missing out on about a billion voice searches per month. In 2017, 13 percent of Americans owned some kind of smart assistant. This number was 16 percent by 2019 and is predicted to skyrocket to 55 percent by 2022. Let’s face it. Users like the convenience of interacting with the internet using only their voices and this should affect the way you do SEO.

With all of that said, here are four actionable tips for optimizing your SEO for voice search.

1. Think featured snippets

Voice queries that can be answered directly with a featured snippet almost always are. The Google Assistant specifically tries to do this wherever possible, reading most of the snippet aloud to the user. Position zero is a great place to be and digital marketers, of course, are already vying for that coveted spot. So how do you get to be the featured snippet for a voice search? How can you ensure that Google will read your site’s content out loud to a voice searcher?

  • First, featured snippets are not always pulled from position one. Only about 30 percent are, while the other 70 percent generally come from positions two through five. What does this tell you? It says that once you’re on page one, relevance matters more than position.
  • To become the featured snippet, your content should be optimized to answer specific questions. A large portion of featured snippets are related to recipes, health, and DIY subjects, but don’t be discouraged just because those aren’t your industries. Use SEMrush’s topic research tool or the free Answer the Public tool to generate content ideas for answering specific user questions.
  • Your content will be more likely to be featured in a snippet if it’s presented as a paragraph, list or table. If you go for the paragraph, try to keep it below 50 words, and make the sentences brief. You should also optimize the paragraph with your targeted keyword. Lists and tables are likely to get featured as well, since they’re easy to follow logically and visually. Whichever direction you go with your content, make sure it’s easy to understand and free of advanced terminology. Remember, you’re going for a large audience here, and jargony content is a huge turn-off.

Combine all of these steps – getting to page one, researching one specific query and answering that query briefly and in an easily digestible format – and you’ll be well on your way to getting your time in the spotlight with one of Google’s featured snippets.

Once you’ve done that, just imagine millions of virtual assistants presenting your page’s content as the best answer to a user question. That’s the power of voice search-optimized SEO.

2. Optimize your content for voice search

I touched on voice search-optimized content in the previous section, but content itself is important enough to merit its own section. By this point in the existence of search engines, the best way to type a query into an engine comes as pretty much second nature to most people. We know to keep our searches concise and detailed. “Italian restaurants Scranton” is a quintessential typed query.

As virtual assistants get smarter with every voice search, however, queries are becoming more conversational in nature. A person could say to Siri, “Show me the cheapest Italian restaurants in Scranton.” In response, Siri might say, “Here are the best Italian restaurants near your location.” It almost sounds like two people speaking. For that reason, optimizing content to be found by voice searchers will require you to leverage long-tail keywords such as “cheapest Italian restaurants in Scranton” rather than “Italian restaurants Scranton.”

Long-form content – as in, content with a word count above 1,800 words – is as strong in voice search as it is in traditional SEO, but it’s also a good idea to keep your sentences relatively short and not go out of control with your vocabulary. People use voice search like they talk in everyday life, so go for “reliable” over “steadfast.” You get the idea.

My final point on voice search-optimized content is, again, to use SEMrush’s topic research tool and the Answer the Public tool to find out what queries people are asking to find their way to websites like yours, and what those queries say about people’s plans at the moment. A query beginning with “what” shows someone who is looking for information, while a person with a “where” query is probably closer to acting on their intent. Use this information to your advantage when generating content for voice searches.

3. Perfect your mobile-friendliness

Most voice searches, particularly those involving some variation of “near me,” are performed on mobile devices by people on the go, people who perhaps find themselves in unfamiliar places and rely on voice searches to guide them to points of interest. It is therefore vital that you make your site as mobile-friendly as humanly possible.

If you’re lacking in the mobile-friendliness aspect, take action now. Your first job is to ensure your website has a responsive rather than an adaptive design. Responsive web pages will fit themselves to any screen, be it on a Galaxy phone or an iPad.

Then you need to work on site speed by compressing your files, using a web cache, optimizing your images, and minifying your code. It should take your mobile site no longer than five seconds to load, but aim for three to four seconds. That’s the Goldilocks zone for ensuring mobile users stay with you when they select a voice search result.

4. Focus on local SEO

Finally, you absolutely must optimize your pages for local SEO if you are, in fact, a local entity. This is because 22 percent of voice searches are related to local businesses such as restaurants.

To make sure potential customers in your area can find you, you just need to follow all the normal protocols for local SEO optimization. These include using geotargeted and “near me” search terms in your meta tags and on your landing pages. You should also create separate location pages for all your brick-and-mortar spots. Finally, be sure to claim your Google My Business page and keep your business hours, phone number and address updated and accurate. Do all this, and when users voice-search for “Show me bookstores near me,” they will find themselves face-to-face with your business.

The frequency of voice searches around the world is only going to increase in 2020 and as the decade continues. Voice search most certainly affects SEO, but there’s no need to fear. By taking the time to follow these steps, you can stay ahead of the curve and rank as well in voice results as you do in a typical typed queries. The future is coming, and it is in every SEO’s best interests to pay attention.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.



About The Author

Kristopher Jones is a serial entrepreneur, angel investor, and best-selling author of “SEO Visual Blueprint” by Wiley (2008, 2010, 2013). Kris was the founder and former CEO of digital marketing and affiliate agency Pepperjam (sold to eBay) and has since founded multiple successful businesses, including ReferLocal.com, APPEK Mobile Apps, French Girls App, and LSEO.com, where he serves as CEO. Most recently, Kris appeared on Apple’s first TV Show, “Planet of the Apps,” where he and his business partner, comedian / actor Damon Wayans, Jr., secured $1.5 million for an on-demand LIVE entertainment booking app called Special Guest.

bbc:-how-to-design-a-voice-experience

Designing for voice: a brand new challenge

Voice devices are growing in popularity – so much so that we now have an entire department dedicated to making voice experiences here at the BBC. So how do our designers in BBC Voice AI go about making experiences that work through smart speakers? 

Inspired by our work on the BBC Kids skill – a voice experience for three to seven year olds – we’ve come up with 12 design principles for voice. 

In this guide, we talk about how to:

  • write good content
  • use language and tone
  • handle errors
  • use sound effects
  • choose between the device voice and a human voice-over
  • test your designs

We believe that by understanding how to design a voice experience for children, you’ll learn how to design a voice experience for anyone.

Asking questions

1. If you’re asking a question, end the sentence with it and listen for an answer immediately.

When children have something to say to a smart speaker, they won’t wait nicely for their turn to speak. As soon as you ask a question they will want to answer it, so you should be ready to listen.

Whilst we can’t change how impulsive children are or make smart speakers listen when they’re speaking, we can write dialogue that prompts children to speak at the right time.

Bad practice: Asking a question mid sentence

Smart speaker: “Who do you want to play with? Andy or Justin?”

In early usability testing sessions, we observed children trying to answer after the first question: who do you want to play with? But at this point the smart speaker is still talking, and isn’t listening to what they’re saying. The child is also talking over information that they need to actually answer the question. 

So by writing dialogue this way, there’s a high likelihood that the interaction will be unsuccessful, either because the smart speaker hasn’t listened or the child doesn’t know how or when to answer. This can be frustrating for the child.

Best practice: Ending a sentence with a question

Smart speaker:Andy and Justin are here. Who do you want to play with?”

2. Don’t tell children to ‘say this’ or ‘say that’, simply ask the question.

Given that conversation is the interface of a voice experience, it can feel patronising to tell your users exactly what to say. A good voice experience should aspire to feel like a natural conversation, not an automated phone system from the 1980s. As demonstrated in the example below, telling your users exactly what to say can make the experience lengthier than it needs to be.

Bad practice: Telling the user what to say

Smart speaker: If you’d like to play a game, say ‘game’. Or, if you’d like to hear a story, say ‘story’.

Read all 19 of these words aloud, and it’s easy to see how this is both insulting to the intelligence of the user and disrespectful of their time.

Best practice: Asking the user a question

Smart speaker: Would you like a game or a story?

Reduced to eight words, this approach is far more succinct and respectful. In our user testing, three year olds were able to answer this question with ease.

3. Don’t ask rhetorical questions; children will answer them.

As we mentioned in our first principle, children are impulsive and will immediately try and answer a question. This is especially true of rhetorical questions.

It can be very easy when writing dialogue for children to try and engage them with rhetorical questions, such as “wasn’t that nice?” as part of a longer sentence. Children won’t nod along and continue listening when stood by a smart speaker. They will try and answer.

Bad practice: Using a rhetorical question

Smart speaker: “I wonder who’s here to play?”

User: “I am!”

Smart speaker: “Go Jetters”

Smart speaker: “Hey Duggee”

Smart speaker: “Who do you want to play with?”

Here, as a way of priming the user for the options that are about to be presented, the rhetorical question “I wonder who’s here to play” is used. The child, not realising this is a rhetorical question, answers immediately and speaks over the available options. When the ‘real’ question is finally asked, they don’t know how to answer.

Best practice: Avoiding rhetorical questions

Smart speaker: “Let’s listen and find out who’s here to play.”

Smart speaker: “Go Jetters”

Smart speaker: “Hey Duggee”

Smart speaker: “Who do you want to play with?”

User: “Duggee!”

By using an instructional command “Let’s listen and find out who’s here to play”, the user is prompted to listen rather than answer.

4. Ask questions that have distinctive, easy-to-say answers.

During early rounds of usability testing, we observed just how often children were misunderstood by smart speakers. Since the launch of our skill, our analytics have shown that roughly 40% of everything children say to our skill is misunderstood.

So if children aren’t good at speaking and smart speakers aren’t good at listening, then what can we do to help?

We’ve found that asking for simple, distinct utterances reduces the margin for error. Ideally, pick one or two word phrases that are clearly different from the other options being offered. Following these guidelines makes it easier for children to understand, remember and say these utterances successfully. They’re also easier for a smart speaker to understand.

Bad practice: Offering similar sounding choices

Smart speaker: Wait or Walk?

These two terms are short, simple and easy to say. But they’re not distinct enough from one another.

Whilst the alliteration of “wait or walk is easy to grasp, memorable and satisfying to say, the similarity in sound of the two terms can make it tricky for a smart speaker to differentiate between them. This can lead to misinterpretations and in the case of this example, ‘Waffle the Wonder Dog’ walking when he should wait and waking Mrs Hobbs up. Disaster!

Best practice: Offering distinctive sounding choices

Smart speaker: Hide or Walk?

By replacing ‘wait’ with ‘hide’, the two options offered are now just as easy to say and remember, but far more distinct. They are much easier for a smart speaker to differentiate between.

Listening for answers

5. When offering a choice, provide no more than three options.

Another early insight taught us that children, when presented with options, struggle to retain any more than three. Research from other Voice AI projects also highlighted this as a limitation for adults following along with a recipe in a voice experience.

More than three options are difficult for users to remember. This can mean users are busy thinking about what they’ve forgotten when they should be making a decision. Decision-making involves weighing up the available options, and that‘s very difficult to do if you can’t remember all of them.

Bad practice: Providing too many options

Smart speaker: Weve got five games. Go Jetters, Hey Duggee, Waffle the Wonder Dog, Move It and Justin. Which game would you like?

In this example, five games are presented to the user at once. This many options can feel overwhelming. Few people would be able to remember all five of the options they heard. Consequently, the question at the end can make the user feel at fault for forgetting this information.

Best practice: Providing three or fewer options

Smart speaker: Weve got five games. How about Go Jetters or Waffle the Wonder Dog. Choose one or ask for more.

However, in this example, the five games are split into smaller chunks of two games at a time. Users are able to process the information more easily, weighing up two options and deciding if they like what they’ve heard or if they want to hear more.

6. Strive to present options that are balanced in their appeal to children.

When first testing our navigation, we noticed that more children were choosing to play a game than listen to a story. Through further investigation, we learned that listen was associated with being told to listen at school. This wasn’t very appealing, especially presented against the option to play a game.

Providing choices that are equally weighted in their appeal is an ongoing challenge. But by getting this right we can be sure that children have the best chance of navigating through and discovering all of our content.

Bad practice: Providing options with unbalanced appeal

Smart speaker: Would you like to play a game or listen to a story?

‘Listen’ is off-putting for children, whilst ‘play’ has an almost insurmountable appeal. The options aren’t fairly weighted.

Best practice: Providing options with similar appeal

Smart speaker: Would you like a game or a story?

By removing the verbs ‘play’ and ‘listen’, we offer two far more balanced options. 

Handling errors

7. Don’t keep children stuck in error loops. Turn a bad situation good by progressing them even when they are misunderstood.

Children get frustrated if they aren’t able to do what they want within an experience. And if they’re consistently misunderstood, they will abandon the experience altogether.

To prevent this, if a child is misunderstood twice in our navigation, we make a choice for them. Even if they can’t speak clearly or can’t remember what to say, they’ll eventually reach some content.

We frame this randomly chosen piece content as a surprise! The prospect of an unexpected treat is thrilling to a child and distracts from any frustration they may be feeling.

Bad Practice: Maintaining a state of frustration

Smart speaker: Would you like a game or a story?

User: [Misheard utterance]

Smart speaker: Sorry! I didn’t quite catch that. Would you like a game or a story?

User: [Misheard utterance]

Smart speaker: Oops, I still don’t understand. Would you like a game or a story?

User: [Misheard utterance]

Smart speaker: Oops, I still don’t understand. Would you like a game or a story?

Providing a second chance for a misheard utterance to be said again is good practice. But doing this multiple times is less sympathetic to the user. It can become a torturous loop they’re unable to break out of.

Best Practice: Progressing the action

Smart speaker: “Would you like a game or a story?

User: [Misheard utterance]

Smart speaker: “Sorry! I didn’t quite catch that. Would you like a game or a story?

User: [Misheard utterance]

Smart speaker: “Oops, I still don’t understand. Let’s play a surprise game.*Drum roll*

Smart speaker: “Let’s join in and dance in Justin’s House!

By providing a surprise after a second misheard utterance, the user is progressed to some content and away from a frustrating loop. Children love a surprise, so the experience of being misheard is positive, not negative.

8. Don’t use language or tone to make the child feel as though they are to blame.

We’ve touched upon how easily children can become frustrated by a voice experience that doesn’t understand them. But when a voice experience blames a child for the problem, it can go beyond frustration and become upsetting.

During testing, we met one particularly shy and sensitive child who, through some careful facilitation, found the confidence to speak to our smart speaker. But this was short lived, as the speaker didn’t understand them and the response left them feeling frustrated and upset. 

Three words were to blame.

Bad practice: Putting the blame on the user

User: [Misheard utterance]

Smart speaker: Sorry, I didn’t understand what you said.

Those three words, ‘what you said’ puts the responsibility for the misunderstanding on the child, making them feel at fault. Coming from trusted CBeebies presenter Rebecca, it’s no wonder that our participant lost their confidence.

Best practice: Taking the blame

User: [Misheard utterance]Smart speaker: Sorry, I didn’t understand.

By removing ‘what you said’, the emphasis is placed onto ‘I’ and not ‘you’. The blame is taken away from the user entirely. 

Writing content

9. Use a real voice to speak to children in a warm, friendly tone – avoid cold, monotone, synthesised voices.

Watching any TV programme for preschoolers, you’ll notice a warm, friendly and inclusive tone. At present, the synthesised voices provided by smart speakers can’t get close to this dynamic delivery.

When we originally began working on the BBC Kids Skill we prototyped an experience where Alexa helped you navigate to Justin’s Hide & Seek game. The handover from synthesised voice to Justin jarred. Badly. Alexa felt lifeless by comparison.

We decided there and then to banish synthesised voices from the BBC Kids Skill altogether and turn that decision into one of our core design principles.

Bad practice: Using Alexa’s voice

It’s difficult not to feel underwhelmed by Alexa’s jarring delivery after hearing Justin and Ubercorn.

Best practice: Using a human voice

CBeebies’ Rebecca on the other hand delivers warmth and joy in every syllable. Her tone of voice knits together perfectly with Justin and Ubercorn to provide an altogether more cohesive experience. We carried this through to the skill – ensuring kids have the same connection with BBC Kids on smart speaker as they do with CBeebies on TV.

When writing the navigation script, we stayed true to Rebecca’s warm, friendly tone of voice and used language that reflected her natural style. At the record, we encouraged Rebecca to adapt the script, ensuring her delivery felt conversational – just as children are used to when watching her on CBeebies.

10. Use sound effects and music to break up dialogue and immerse children in make believe.

We observed that children lose attention during extended stretches of pure dialogue. But when interjected with sound effects, voice experiences are much more engaging.

Sound effects can be used to build a world out of audio, transporting the listener to virtually anywhere and increasing engagement even further. They are also important signposts to help the listener orient themselves in an experience that is entirely non-visual.

Bad practice: Uninterrupted dialogue

Pure uninterrupted dialogue for 24 seconds is hard for anyone to follow, let alone a 3 year old. Without a pause for a sound effect or some music the brain never stops processing the information it’s being fed. It asks the user for 24 seconds of pure concentration as they try to listen, understand and remember what’s being said. In short, this is boring.

Best practice: Dialogue punctuated by sound effects & music

Punctuated by Duggee’s ‘awoof’, the cheers of the Squirrels and some catchy music, the same information is far easier to absorb. It’s broken up nicely, it’s fun to listen to and it builds the world of Hey Duggee out of audio for the child to enjoy.

11. Encourage engagement beyond speaking and listening.

Children don’t sit still and talk to a smart speaker; they are interacting with the world around them. Speaking isn’t the only way they can engage with a product, and we’ve found that movement, dancing and performing are really popular ways to interact with the skill.

For many parents, smart speakers represent a welcome escape from screens and a healthy choice for their children. So, wherever possible, engage children in physical activity to enhance this healthy relationship with technology.

12. Test early and often. Try your design out with real children.

All of our principles came from making mistakes and learning from them in our research. We conducted seven rounds of usability testing over a six month period, with over 100 children on the way to releasing the BBC Kids Skill.

Alongside our learnings about the content of our skill, we also learned that you don’t need a fully working prototype to be able test a voice experience. Using a Bluetooth speaker and some audio clips of your own voice, you can fake an intelligent experience using Wizard of Oz testing.

It’s also vital to test the working skill on a real device as soon as you can. Whilst Wizard of Oz testing puts your design under the microscope, testing a working skill is essential to uncover issues that may arise from a smart speaker’s natural language processing.

Conclusion

As we strive to craft more delightful Voice experiences, these principles are likely to grow and evolve. Our user research is continuous and our learning is ongoing.

These principles were informed by a project to build a voice experience for children. However, we strongly believe that they have universal application. The considerations and thought processes demonstrated by these principles can be applied to any voice experience.