In the last few years Amazon has been able to capture a significant share of the home assistant market with the help of the Echo product family. Starting with the audio only product line, Dot, Tap, and Echo, Amazon has now branched out into other hardware solutions such as the Echo Show and Echo Look. Each of these products come with the voice assistant, Alexa. In addition to answering basic questions right out of the box, Alexa lets users install skills, which are essentially apps that extend the voice assistants knowledge into other domains.
With over 15,000 Skills and growing, now is a great time to consider making an Alexa skill. This post discusses the basic components of an Alexa skill, starting with the vocal interface design and progressing through the software development life cycle. Although much of the content in this post is programming language agnostic, feel free to check out this repository for a Java starter.
The most important aspect of a skill is its vocal interface. A skill’s vocal interface should be as natural as interacting with a person. There are three main components of a skill’s vocal interface: invocation name, intents, and sample utterances.
An invocation name is used to launch the skill. Picking the invocation name is akin to picking a mobile application’s icon and name. Users are required to know and say a skill’s invocation name in order to use the skill. There are many best practices for choosing a name, but at the very least it should be easy to say and easy to remember.
Intents can be thought of as a goal that the user is attempting to achieve by invoking a skill. Each intent can also contain many slots. A slot defines a variable that the Alexa platform will parse and expose to the application code. For example, consider the phrase “What is the weather on Tuesday?” The general intent of the user is to get the weather and a slot is the specific day of the week.
The above snippet shows the definition of two Intents: HelloWorldIntent and AMAZON.HelpIntent. The HelloWorldIntent is a custom Intent that contains one slot, Name, which will be available as a variable in the code. In order to increase the likelihood of Amazon parsing the value correctly, it has been assigned a type provided by Amazon, US first name. Amazon provides many different types such as date, actor, airline, TV Episode, and more. The second intent is defined by Amazon and should be implemented to give a response that provides help to the user. There are several other Amazon defined intents defined here.
So invocation name let users invoke skills, and intents implement the capabilities of the skill, but how do verbal phrases get mapped to an intent? The Alexa platform handles the complexities of natural language processing through the help of a manually curated file, SampleUtterances.txt. This file provides a brute force mapping of intent to all desired verbal phrases.
The first word in every line of SampleUtteranaces.txt is the intent name. The application code can see the value of the intent name in order to respond properly. The following text is the phrase that a user might say to achieve that intent. Even though there is a Name slot defined in the intent, it is not required that a user say a phrase that includes that slot. The application is free to react differently based on the presence or value of the slot. To give Alexa the best chance of understanding users, it is recommended to include as many sample utterances as possible. It can be challenging to build up this file and may require user research and iteration. Depending on the scope of the skill there could be dozens, hundreds, or even thousands of ever changing sample utterances. More best practices can be found here.
Putting all the pieces together, the entire vocal interface is best summed up by this image from the Amazon documentation.
Once the vocal interface has been designed the development model is fairly straight forward. There is an input/output contract that needs to be implemented for each intent. The input object is an IntentRequest which is a representation of the user’s intent and includes all the slot values. The response object is a bit more complicated because there are multiple ways Alexa can respond to a user:
In addition, there are different mediums that the response can be expressed. Alexa could say them verbally or they could be displayed visually on their phone. Read the full description of the response object for more information.
Alexa skills can be deployed as a generic web service or as an AWS Lambda. When first starting, the simplest approach is to deploy as a Lambda. Currently AWS Lambda supports code written in:
If the skill is deployed as a web service it must satisfy these requirements.
Beyond unit tests, there are several ways to test skills, some do not even require an official Alexa device. Of course the best testing solution will be to install your skill onto a device that is connected to your development account. The next best option is to use the free web based tool Echosim. Amazon also offers a text based testing approach that can be useful smoke test when deploying as a Lambda.
Each one of these topics deserves more care and consideration than this post can possibly cover, so dig deeper when implementing for the first time. There are countless other Alexa capabilities not even mentioned in this post, such as Account Linking and the Speech Synthesis Markup Language and IoT integration. No matter what a skill does, it is a good idea to start small and gradually expand scope. The age of vocal interfaces and voice applications is young but at the rate which they are growing they could become just as important as mobile is now. Alexa is an incredible development platform that I found surprisingly fun and easy to use, so check out this sample and get started!