Head-nods and turn-taking both significantly contribute conversational dynamics in dyadic interactions. Timely prediction and use of these events is quite valuable for dialog management systems in human-robot interaction. In this study, we present an audio-visual prediction framework for the head-nod and turntaking events that can also be utilized in real-time systems. Prediction systems based on Support Vector Machines (SVM) and Long Short-Term Memory Recurrent Neural Networks (LSTMRNN) are trained on human-human conversational data. Unimodal and multi-modal classification performances of head-nod and turn-taking events are reported over the IEMOCAP dataset.
Index Terms: head-nod, turn-taking, social signals, event prediction, dyadic conversations, human-robot interaction
Authors: B. B. Turker, E. Erzin, Y. Yemez and M. Sezgin
Read the full paper.
This paper addresses the problem of evaluating engagement of the human participant by combining verbal and nonverbal behaviour along with contextual information. This study will be carried
out through four different corpora. Four different systems designed to explore essential and complementary aspects of the JOKER system in terms of paralinguistic/linguistic inputs were used for the data collection. An annotation scheme dedicated to the labeling of verbal and non-verbal behavior have been designed. From our experiment, engagement in HRI should be multifaceted.
Keywords-Human-Robot Interaction; Dataset; Engagement; Speech Recognition; Affective Computing
Authors: L. Devillers and S. Rosset and G. Dubuisson Duplessis and L. Bechade and Y. Yemez and B. B. Turker and M. Sezgin and E. Erzin and K. El Haddad and S. Dupont and P. Deleglise and Y. Esteve and C. Lailler and E. Gilmartin and N. Campbell
Read the full paper.
Human eyes exhibit different characteristic patterns during different virtual interaction tasks such as moving a window, scrolling a piece of text, or maximizing an image. Human-computer studies literature contains examples of intelligent systems that can predict user’s task-related intentions and goals based on eye gaze behavior. However, these systems are generally evaluated in terms of prediction accuracy, and on previously collected offline interaction data. Little attention has been paid to creating real-time interactive systems using eye gaze and evaluating them in online use. We have five main contributions that address this gap from a variety of aspects. First, we present the first line of work that uses real-time feedback generated by a gaze-based probabilistic task prediction model to build an adaptive real-time visualization system. Our system is able to dynamically provide adaptive interventions that are informed by real-time user behavior data. Second, we propose two novel adaptive visualization approaches that take into account the presence of uncertainty in the outputs of prediction models. Third, we offer a personalization method to suggest which approach will be more suitable for each user in terms
of system performance (measured in terms of prediction accuracy). Personalization boosts system performance and provides users with the more optimal visualization approach (measured in terms of usability and perceived task load). Fourth, by means of a thorough usability study, we quantify the effects of the proposed visualization approaches and prediction errors on natural user behavior and the performance of the underlying prediction systems. Finally, this paper also demonstrates that our previously-published gaze-based task prediction system, which was assessed as successful in an offline test scenario, can also be successfully utilized in realistic online usage scenarios.
Implicit interaction, activity prediction, task prediction, uncertainty visualization, gaze-based interfaces, predictive interfaces, proactive interfaces, gaze-contingent interfaces, usability study
Authors: Çağla Çığ and T. M. Sezgin
Read the full paper.
This work advances our understanding of children’s visualization literacy, and aims to improve it with a novel approach for teaching visualization at elementary schools. We ﬁrst contribute an analysis of data graphics and activities employed in grade K to 4 educational materials, and the results of a survey conducted with 16 elementary school teachers. We ﬁnd that visualization education could beneﬁt from integrating pedagogical strategies for teaching abstract concepts with established interactive visualization techniques. Building on these insights, we develop and study design principles for novel interactive teaching material aimed at increasing children’s visualization literacy. We speciﬁcally contribute an online platform for teachers and students to respectively teach and learn about pictographs and bar charts and report on our initial observations of its use in grades K and 2.
Author Keywords: visualization literacy; qualitative analysis.
Authors: B. Alper, N. H. Riche, F. Chevalier, J. Boy and T. M. Sezgin
Read the full paper.
We present a work-in-progress report on a sketch- and image-based software called “CHER-ish” designed to help make sense of the cultural heritage data associated with sites within 3D space. The software is based on the previous work done in the domain of 3D sketching for conceptual architectural design, i.e., the system which allows user to visualize urban structures by a set of strokes located in virtual planes in 3D space. In order to interpret and infer the structure of a given cultural heritage site, we use a mix of data such as site photographs and floor plans, and then we allow user to manually locate the available photographs and their corresponding camera positions within 3D space. With the photographs’ camera positions placed in 3D, the user defines a scene’s 3D structure by the means of stokes and other simple 2D geometric entities. We introduce the main system components: virtual planes (canvases), 2D entities (strokes, line segments, photos, polygons) and provide a description of the methods that allow the user to interact with them within the system to create a scene representation. Finally, we demonstrate the usage of the system on two different data sets: a collection of photographs and drawings from Dura-Europos, and drawings and plans from Horace Walpole’s Strawberry Hill villa.
Authors: V. Rudakova, N. Lin, N. Trayan, T. M. Sezgin, J. Dorsey and H.
We address the problem of continuous laughter detection over audio-facial input streams obtained from naturalistic dyadic conversations. We ﬁrst present meticulous annotation of laughters, cross-talks and environmental noise in an audio-facial database with explicit 3D facial mocap data. Using this annotated database, we rigorously investigate the utility of facial information, head movement and audio features for laughter detection. We identify a set of discriminative features using mutual information-based criteria, and show how they can be used with classiﬁers based on support vector machines (SVMs) and time delay neural networks (TDNNs). Informed by the analysis of the individual modalities, we propose a multimodal fusion setup for laughter detection using different classiﬁer-feature combinations. We also effectively incorporate bagging into our classiﬁcation pipeline to address the class imbalance problem caused by the scarcity of positive laughter instances. Our results indicate that a combination of TDNNs and SVMs lead to superior detection performance, and bagging effectively addresses data imbalance. Our experiments show that our multimodal approach supported by bagging compares favorably to the state of the art in presence of detrimental factors such as cross-talk, environmental noise, and data imbalance.
Index Terms—Laughter detection, naturalistic dyadic conversations, facial mocap, data imbalance
Authors: B. B. Türker, Y. Yemez, T. M. Sezgin, E. Erzin.
Read the full paper.
Sketch recognition is the task of converting hand-drawn digital ink into symbolic computer representations. Since the early days of sketch recognition, the bulk of the work in the field focused on building accurate recognition algorithms for specific domains, and well defined data sets. Recognition methods explored so far have been developed and evaluated using standard machine learning pipelines and have consequently been built over many simplifying assumptions. For example, existing frameworks assume the presence of a fixed set of symbol classes, and the availability of plenty of annotated examples. However, in practice, these assumptions do not hold. In reality, the designer of a sketch recognition system starts with no labeled data at all, and faces the burden of data annotation. In this work, we propose to alleviate the burden of annotation by building systems that can learn from very few labeled examples, and large amounts of unlabeled data. Our systems perform self-learning by automatically extending a very small set of labeled examples with new examples extracted from unlabeled sketches. The end result is a sufficiently large set of labeled training data, which can subsequently be used to train classifiers. We present four self-learning methods with varying levels of implementation difficulty and runtime complexities. One of these methods leverages contextual co-occurrence patterns to build verifiably more diverse set of training instances. Rigorous experiments with large sets of data demonstrate that this novel approach based on exploiting contextual information leads to significant leaps in recognition performance. As a side contribution, we also demonstrate the utility of bagging for sketch recognition in imbalanced data sets with few positive examples and many outlier
Authors: K. T. Yeşilbek, T. M. Sezgin.
Read the full paper,
From a user interaction perspective, speech and sketching make a good couple for describing motion. Speech allows easy specification of content, events and relationships, while sketching brings inspatial expressiveness. Yet, we have insufficient knowledge of how sketching and speech can be used for motion-based video retrieval, because there are no existing retrieval systems that support such interaction. In this paper, we describe a Wizard-of-Oz protocol and a set of tools that we have developed to engage users in a sketch and speech-based video retrieval task. We report how the tools and the protocol fit together using ”retrieval of soccer videos” as a use case scenario. Our software is highly customizable, and our protocol is easy to follow. We believe that together they will serve as a convenient and powerful duo for studying a wide range of multi-modal use cases.
Keywords: sketch-based interfaces, human-centered design, motion, multimedia retrieval
Authors: O. C. Altıok, T. M. Sezgin.
Read the full paper.
We explore the effect of laughter perception and response in terms of engagement in human-robot interaction. We designed two distinct experiments in which the robot has two modes:
laughter responsive and laughter non-responsive. In responsive mode, the robot detects laughter using a multimodal real-time laughter detection module and invokes laughter as a backchannel to users accordingly. In non-responsive mode, robot has no utilization of detection, thus provides no feedback. In the experimental design, we use a straightforward question-answer based interaction scenario using a back-projected robot head. We evaluate the interactions with objective and subjective measurements of engagement and user experience.
Index Terms: laughter detection, human-computer interaction,
laughter responsive, engagement.
Authors: B. B. Türker, Z. Buçinca, E. Erzin, Y. Yemez, and M. T. Sezgin.