February 2018 – August 2018
- User Research
- User Testing
- Pepper robot
- IBM Watson Services
- Pen and paper
This case study was my graduation project at IBM Benelux in Amsterdam, the Netherlands.
The purpose of the project was to implement IBM Watson services such as Watson Assistant, Watson Speech-to-Text, Watson Text-to-Speech, Watson Visual Recognition, Watson Natural Language Understanding
I worked with two other students: one who programs the robot and another who generates robot’s gesture using machine learning. I worked mainly on the IBM Watson Assistant to design the interaction and conversation flow between the robot and the people. I was responsible for planning, executing and delivering a user study and usability testings as the pilot project. The result of the user study and usability testings were implemented as interaction improvements. I also presented part of the project during the Dutch Technology Week 2018 where visitors could talk and interact with the robot.
The robot would serve as a hospitality robot at the IBM’s office building. To better understand the existing interaction between the real receptionists and the visitors, a user study was conducted. The purpose was to identify their goals and needs, how they accomplish their goals, what they said to accomplish their goals, and gathering the most frequent questions asked to the receptionist as well as they key use cases.
The study was done with the receptionists in the company since they interact directly with the users on a daily basis.
I approached the head of the receptionists in the lobby while there was no visitor coming. First, I introduced myself and explained the purpose of the study as well as the procedure to the participant. The participant was given a paper to fill in the questionnaire anytime they have time and their answer would be picked up by the end of the day.
At the end of the day, I came to the receptionist desk to pick up the survey. The receptionists were also asked if they had any questions. All of the participants were thanked for their participation, and it was emphasized that their contribution was highly valuable.
The result of the survey showed that visitors, business customers and foreign employees were identified as the most common user types, while appointment handling, booking a taxi and asking for a building pass/badge are the most frequent users’ goals in the reception desk.
The most frequent questions including “Will you call Mr./Ms. …?”, “Can you call …?” or “We have an appointment with Mr./Ms. …” for appointment; “Can you provide me a badge?” or “I forgot my badge” for asking a building badge; and “Can you book me a taxi?” or “Can you call a taxi?” for booking a taxi. Other questions also answered in the survey were “Where can I find room…”, “Where is the toilet?” or “Where is the waiting room?” for asking certain location in the building; “I lost my …, did someone bring it?” for lost and found; and “What is the WiFi here?” regarding the facilities in the building.
Conversation Design Process
The present study followed a guide to design conversational agent adapted from the IBM Design Thinking theory and IBM Watson Conversation Design Methodology provided by IBM Skills Gateway.
Key Use Cases
Based on the result of the study, it was chosen that appointment handling is the main interaction for the robot, considering the practical limitations of the other most frequent users’ goals. Booking a taxi was not included in the interaction as it happens only after the users already have been inside the building. In addition, providing a badge is also not included in the interaction flow as it needs a real human to check the visitor’s id and to provide badge. Consequently, answering questions regarding the facilities in the building was chosen as the minor capabilities of the robot as well as to play a small quiz.
After we had a clear picture of the users and key use cases, then, we defined the conversation flow. Below, a sample dialog was written to give a quick sense of the conversation flow. It conveyed the flow that the user experience without any constraints, which was called ‘a happy flow’. To illustrate the ‘happy flow’, imagine the following conversation:
Robot: Hi, I’m Casper. Welcome to IBM. Do you have an appointment today?
User: Yes, I have a meeting with John.
Robot: Okay. What time?
User: At 10.30
Robot: Let me confirm it. So, you have an appointment with John at 10.30?
Robot: Great! Could you please tell me your first name, so I can notify your host that you have arrived?
User: My name is Linda.
Robot: Thank you. I have notified your host of your arrival.
Gathering User Uterances
To note, since we use Watson Assistant as the medium for conversation, we needed to gather user utterances based on their intents and entities they provided. Intents are defined as goals that users will have when they interact with the robot and can be extracted based on users’ input. For example, the intent for above dialog was #appointment. Imagine what we would have said if we have an appointment. “I have a meeting with John”, “I have an appointment with John”, “I would like to meet John” or “I’m here for John” were sample user utterances that were employed as a first training data to be used in the dialog flow in Watson Assistant. To learn more about intent or Watson Assistant in general, please click here.
Conversation Testing and Iteration
To test the flow of the conversation and the overall performance of the conversational agent, a user testing was completed. This was also done to see the possible errors that could happen when the complete solution was integrated and if the natural language processing works.
Participants were recruited from personal contacts and asked to interact with the conversational agent. They were asked to imagine themselves as a visitor in an office building and that it was the first time they came to the office. They had an appointment with an employee at a certain time. The participants were also told to imagine when they enter the building, there was a robot greeted them and asked the purpose of them came to the building. From that point, the interaction was up to them. However, the participants were emphasized that although the interaction was up to them, they were asked to act as real as possible and to stick to the purpose of visiting. The participants also had a chance to ask if the instruction was clear and if the participants had any questions. Next, the participants were asked to sit in front of the laptop and to start the experiment.
For the analysis, the log of the conversation for each participant was checked. For each user turn, the utterance was analyzed if it matched with the correct intent and if corresponds to the system utterance. For each mismatch, the user’s utterance is added to the correct intent as a sample. The dialog flow was also re-arranged to improve the flow of the conversation. This process was done for each participant’s response.
The Complete Solution
In this phase, Watson Assistant was integrated with the other IBM Watson services (Speech-to-Text, Text-to-Speech, and Visual Recognition) into the robot and ready for the real interaction with the robot.
The study was a field study and the robot would serve as a hospitality robot at the IBM’s office building. However, the present study only performed the the first user testing phase or as a pilot experiment. This first pilot phase allows testing the integrated system with participants, including all software modules and dialog used for the evaluation. The purpose was to see if the overall systems works as expected, whether people understand how to interact with the robot and to see possible errors in dialog or interface.
The participants were IBM’s employee. In total, 39 people participated in this study. They were contacted by personal connections and the sample consisted of people who are most easy to reach. The number of participants was also based on a convenient sample. Those people were not necessarily novices. There was also no need for this requirement since the objective of this study was not to identify the interaction of first-time users with the robot. However, 16 out of 39 participants reported that they had no prior experience with the robot.
The experiment took place at IBM Benelux office in Amsterdam, The Netherlands. The robot was placed in one of the meeting rooms in the office. The location of the meeting room was varied due to the convenience of finding the participants. A laptop was used to provide the link to fill in the robot evaluation questionnaire at the end of the experiment. The questionnaire was built using LimeSurvey. A mobile phone’s voice recorder was also used to record the conversation between the participant and the robot. All the speech said and heard by the robot were shown on the robot’s screen so people can use it as a guidance or fallback option of the ongoing interaction.
The Godspeed questionnaire (Bartneck, Kulić, Croft, & Zoghbi, 2009) was used as dependent variables to measure people’s perception towards the robot. However, the composition of the questionnaire presented in this study was adapted based on The Robotic Social Attributes Scale (RoSAS) questionnaire (Carpinella, Wyman, Perez, & Stroessner, 2017). To check the manipulation, “Did the robot understand what you said?” was asked after the robot’s perception questionnaire was presented. The questionnaire used a 5-point Likert scale to indicate how often the robot understood participants’ utterances, from Never, Rarely, Occasionally, Frequently, and Constantly. The users’ prior experience with the robot was also checked. “I work with humanoid robots on a daily basis” was asked with 10 points-scale, with 1 as Not at all and 10 as I work with humanoid robots on a daily basis. See Appendix B for the complete list of the questionnaire. Participants were also asked two questions after the experiment: “What do you think of the interaction?” and “Any suggestion for improvement?”. The purpose was to assess their verbal evaluation of the interaction with the robot and to seek suggestions for the improvement of the complete solution since this study was the first testing phase.
In addition, the objective measurements were also collected such as:
1) Task completion by the robot; evaluated as true if the robot could get the name of the
employee and the appointment time and confirmed as a correct by the user.
2) Complete interaction; evaluated as true if the interaction was finished until the ratings of
3) A total number of turns by the user and the robot.
4) A total number of turns by the user; this was divided into two division, either using the
speech or touching the robot’s screen
5) A number of errors; user utterances was evaluated based on Bohus & Rudnicky (2008) source of error analysis. Each user utterance was labeled as a misunderstanding, out-of- grammar, out-of-application-scope, out-of-domain, user responses to non-understanding, and corrections.
6) A number of recovery strategies.
7) Duration of the interaction.
Participants entered the room and asked to sit in front of the robot. After the participant sat, the researcher explained the purpose of the experiment as well as the procedure. First, participants were asked to talk with the robot and ask the robot about personal things (e.g., name, age, gender, favorite books, favorite movies, favorite colors, etc.). There was no strict rules or restriction on how they should interact with the robot. The purpose was to make the participants familiar with the robot and the interaction. After the participants interacted with the robot, the researcher explained the procedure of the real testing. The participants had to imagine themselves as an IBM’s visitors, and this was the first time that they came to the IBM office. They had to imagine that when they entered the main building, the robot would stand between the main entrance of the building and the receptionist desk. The robot could assist the appointment of IBM’s visitors and answering people’s questions regarding the facilities in the building. However, for the experiment, the participants were told to have an appointment with an IBM’s employee. The name of the employee and the appointment time were the choices of the participants themselves. The participants were also asked before the real experiment if it was okay to record their voice when they interacted with the robot.
After the real testing, the participants were asked to evaluate the interaction as well as any suggestions for improvement. Then the robot’s evaluation questionnaire was presented.
- Watson Speech to Text works with high enough accuracy for this type of interaction. A large part of sen- tences are transcribed correctly, and Assistant corrects small errors such as missing words. Accents do create some troubles with wrong transcriptions, but by repeating the sentence more pronounced, the interaction could still be finished following the happy flow.
- Users do not experience problems in how to use and approach the robot given the scenario. This observation is drawn because people did not doubt what to say, did not have to repeat themselves (often) and people were able to finish the conversation until the end, following the happy flow of the hospitality scenario, thus finishing the given task. It must again be noted that people were allowed to get to know the robot before activating the actual scenario, and they were given a goal, whereas in the evaluation at the reception this will not be the case possibly influencing the participant’s approach. To improve this more, the dialog has been changed to have more structure and be more precise in how questions are formulated, leaving less ambiguity what the robot wants to know from the user. Also, more suggestions and buttons on the tablet have been added at various interaction steps as it seemed to help people what to say. Next to that, buttons on the tablet have been enlarged and made more clear.
- English Speech to Text does not work well in a Dutch oriented office. Due to English Speech to Text, half or more of Dutch names are not understood correctly and converted into likewise sounding existing English words, which makes a scenario identifying the client hard to do. People could use English pronunciation for their name to solve this problem, but this has to be told in advance. As a solution, an alternative for name entering has been made in the form of a software keyboard on the tablet. If there was no correct name after the first try, a keyboard would open automatically, containing the pre-filled text “My name is .”. Participants used this option often after implementation.
- Transcribing single letters does not work. During the quiz game, users have to answer ’option A’, ’B’ or similar. However, Speech to Text did not transcribe single letters correctly resulting in wrong input for Watson Assistant. This sometimes made the conversation go in wrong and unintended directions. To solve this, the answer space at the quiz has been made less broad. It wasn’t indicated whether people could say for example ’Option B’ or the actual answer to a question and furthermore ’Option B’ was often not correctly transcribed. Now instructions are given at the start of the game to either say ’option <number>’ or click the answer on the tablet. ’One, two, three, and four’ can be transcribed by Watson STT, while ’A, B, C, and D’ cannot in our case.
- Need to use a clear separation of actors on the tablet. Some people were confused with the text on the tablet since it was sometimes unclear which side (left for user transcribe, right for robot response) represented who. This has been improved by making the difference between speakers more clear and remove (fade away) transcriptions with low certainty sooner.