American engineers have created an algorithm that enables smartphones to understand what now looks a user and thereby more accurately interpret his voice commands. He compares the data with front and back cameras, and thereby calculates the point at which man looks at. This allows you to communicate through voice impersonal commands such as “what time this shop closes?”. Article with the description of the algorithm will be presented at the conference CHI 2020.
Voice assistants are quite strongly developed in recent years, but they still remain several fundamental problems, including weak work with the context. For example, they often do not understand the connection of the new team with the previous dialogue with the user, and also unable to work with pronouns. However, in daily conversation, people constantly use this form of guidance so she could do the voice assistants are more similar to humans.
To smartphone could understand what the user says in this case, it is necessary to use the data from the camera. The most obvious way of implementing this algorithm is simple to direct the camera at interesting subject matter, so he was in the center of the frame. But it makes use of a smartphone is unnatural, so the developers under the direction of Chris Harrison (Chris Harrison) from the University Carnegie Mellon has offered to simultaneously use cameras with both sides to determine the direction of gaze of the person, not forcing him to accurately direct the smartphone.
The authors have used an iPhone with iOS 13, because starting with this version, system allows to use two cameras. To recognize the direction the developers have used the API system for tracking head position. On its basis the program receives a vector with the direction of the head and knowing the parameters of both cameras, it carries data from the back camera.
The objects in these data recognize the built in iOS framework. The main limitation is that it only works with familiar objects, but the developers suggest that this can be solved with a single cloud-based database. The algorithm maps the vector with detected objects in front of the camera and ranks them on a range from him.
The program listens to user in search of the activation phrases, recognizing words with the built in system algorithm dictation. After the user said the activation phrase and the command, the program maps the index of the pronoun in the command objects and generates a final command in which pronouns are replaced by specific objects. Because the application is a demonstration of the method, it processes the command and read the response, but if you want a command can be given to the processing system voice assistant or even built into the system.
One of the three authors works at Apple, which recently filed a patent application on a similar method that allows you to specify commands with pronouns using the sight. In the application described different implementations of such a system, including a smart speaker with a built-in camera and a smartphone standing in the room.