This study focuses on improving the optical character recognition (OCR) data for panels in the COMICS dataset, the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for western comics, called “COMICS Text+: Detection” and “COMICS Text+: Recognition”. We evaluated the performance of state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called “COMICS Text+”, which contains the extracted text from the textboxes in the COMICS dataset. Using the improved text data of COMICS Text+ in the comics processing model from resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ dataset can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All the data and inference instructions can be accessed in https://github.com/gsoykan/comics_text_plus.
We present the engagement in human–robot interaction (eHRI) database containing natural interactions between two human participants and a robot under a story-shaping game scenario. The audio-visual recordings provided with the database are fully annotated at a 5-intensity scale for head nods and smiles, as well as with speech transcription and continuous engagement values. In addition, we present baseline results for the smile and head nod detection along with a real-time multimodal engagement monitoring system. We believe that the eHRI database will serve as a novel asset for research in affective human–robot interaction by providing raw data, annotations, and baseline results.
We investigate how collaborative guidance can be realized in multimodal virtual environments for dynamic tasks involving motor control. Haptic guidance in our context can be defined as any form of force/tactile feedback that the computer generates to help a user execute a task in a faster, more accurate, and subjectively more pleasing fashion. In particular, we are interested in determining guidance mechanisms that best facilitate task performance and arouse a natural sense of collaboration. We suggest that a haptic guidance system can be further improved if it is supplemented with a role exchange mechanism, which allows the computer to adjust the forces it applies to the user in response to his/her actions. Recent work on collaboration and role exchange presented new perspectives on defining roles and interaction. However existing approaches mainly focus on relatively basic environments where the state of the system can be defined with a few parameters. We designed and implemented a complex and highly dynamic multimodal game for testing our interaction model. Since the state space of our application is complex, role exchange needs to be implemented carefully. We defined a novel negotiation process, which facilitates dynamic communication between the user and the computer, and realizes the exchange of roles using a three-state finite state machine. Our preliminary results indicate that even though the negotiation and role exchange mechanism we adopted does not improve performance in every evaluation criteria, it introduces a more personal and humanlike interaction model.
Since the strict separation of working spaces of humans and robots has experienced a softening due to recent robotics research achievements, close interaction of humans and robots comes rapidly into reach. In this context, physical human– robot interaction raises a number of questions regarding a desired intuitive robot behavior. The continuous bilateral information and energy exchange requires an appropriate continuous robot feedback. Investigating a cooperative anipulation task, the desired behavior is a combination of an urge to fulfill the task, a smooth instant reactive behavior to human force inputs and an assignment of the task effort to the cooperating agents. In this paper, a formal analysis of human–robot cooperative load transport is presented. Three different possibilities for the assignment of task effort are proposed. Two proposed dynamic role exchange mechanisms adjust the robot’s urge to complete the task based on the human feedback. For comparison, a static role allocation strategy not relying on the human agreement feedback is investigated as well. All three role allocation mechanisms are evaluated in a user study that involves large-scale kinesthetic interaction and full-body human motion. Results show tradeoffs between subjective and objective performance measures stating a clear objective advantage of the proposed dynamic role allocation scheme.
We explore the effect of laughter perception and response in terms of engagement in human-robot interaction. We designed two distinct experiments in which the robot has two modes: laughter responsive and laughter non-responsive. In responsive mode, the robot detects laughter using a multimodal real-time laughter detection module and invokes laughter as a backchannel to users accordingly. In non-responsive mode, robot has no utilization of detection, thus provides no feedback. In the experimental design, we use a straightforward question-answer based interaction scenario using a back-projected robot head. We evaluate the interactions with objective and subjective measurements of engagement and user experience.
We present a novel method for training a social robot to generate backchannels during human-robot interaction. We address the problem within an off-policy reinforcement learning framework, and show how a robot may learn to produce non-verbal backchannels like laughs, when trained to maximize the engagement and attention of the user. A major contribution of this work is the formulation of the problem as a Markov decision process (MDP) with states defined by the speech activity of the user and rewards generated by quantified engagement levels. The problem that we address falls into the class of applications where unlimited interaction with the environment is not possible (our environment being a human) because it may be time-consuming, costly, impracticable or even dangerous in case a bad policy is executed. Therefore, we introduce deep Q-network (DQN) in a batch reinforcement learning framework, where an optimal policy is learned from a batch data collected using a more controlled policy. We suggest the use of human-to-human dyadic interaction datasets as a batch of trajectories to train an agent for engaging interactions. Our experiments demonstrate the potential of our method to train a robot for engaging behaviors in an offline manner.
Hand-drawn objects usually consist of multiple semantically meaningful parts. For example, a stick figure consists of a head, a torso, and pairs of legs and arms. Efficient and accurate identification of these subparts promises to significantly improve algorithms for stylization, deformation, morphing and animation of 2D drawings. In this paper, we propose a neural network model that segments symbols into stroke-level components. Our segmentation framework has two main elements: a fixed feature extractor and a Multilayer Perceptron (MLP) network that identifies a component based on the feature. As the feature extractor we utilize an encoder of a stroke-rnn, which is our newly proposed generative ariational Auto-Encoder (VAE) model that reconstructs symbols on a stroke by stroke basis. Experiments show that a single encoder could be reused for segmenting multiple categories of sketched symbols with negligible effects on segmentation accuracies. Our segmentation scores surpass existing methodologies on an available small state of the art dataset. Moreover, extensive evaluations on our newly annotated big dataset demonstrate that our framework obtains significantly better accuracies as compared to baseline models. We release the dataset to the community.
Generating 3D models from 2D images or sketches is a widely studied important problem in computer graphics.We describe the first method to generate a 3D human model from a single sketched stick figure. In contrast to the existing human modeling techniques, our method requires neither a statistical body shape model nor a rigged 3D character model. We exploit Variational Autoencoders to develop a novel framework capable of transitioning from a simple 2D stick figure sketch, to a corresponding 3D human model. Our network learns the mapping between the input sketch and the output 3D model. Furthermore, our model learns the embedding space around these models. We demonstrate that our network can generate not only 3D models, but also 3D animations through interpolation and extrapolation in the learned embedding space. Extensive experiments show that our model learns to generate reasonable 3D models and animations
Key Words: Sketch-based shape modeling, deep learning, 2D sketches, 3D shapes, static and dynamic 3D human models.
Authors: Alican Akman, Yusuf Sahillioğlu, T. Metin Sezgin.
‘Serious games’ are becoming extremely relevant to individuals who have specific needs, such as children with an Autism Spectrum Condition (ASC). Often, individuals with an ASC have difficulties in interpreting verbal and non-verbal communication cues during social interactions. The ASC-Inclusion EU-FP7 funded project aims to provide children who have an ASC with a platform to learn emotion expression and recognition, through play in the virtual world. In particular, the ASC-Inclusion platform focuses on the expression of emotion via facial, vocal, and bodily gestures. The platform combines multiple analysis tools, using on-board microphone and web-cam capabilities. The platform utilises these capabilities via training games, text-based communication, animations, video and audio clips. This paper introduces current findings and evaluations of the ASC-Inclusion platform and provides detailed description for the different modalities.
Index Terms—Autism Spectrum Condition, inclusion, virtualcomputerised environment, emotion recognition, AI in games.
Authors: Erik Marchi, Bjorn Schuller, Alice Baird, Simon Baron-Cohen, Amandine Lassalle, Helen O’Rielly, Delia Pigat, Peter Robinson, Ian Davies, Tadas Baltrusaitis, Ofer Golan, Shimrit Fridenson-Hayo, Shahar Tal, Shai Newman, Noga Meir-Goren, Antonio Camurri, Stefano Piana, Sven Bolte, Metin Sezgin, Nese Alyuz, Agnieszka Rynkiewicz, Aurelie Baranger
We present HapTable; a multi–modal interactive tabletop that allows users to interact with digital images and objects through natural touch gestures, and receive visual and haptic feedback accordingly. In our system, hand pose is registered by an infrared camera and hand gestures are classified using a Support Vector Machine (SVM) classifier. To display a rich set of haptic effects for both static and dynamic gestures, we integrated electromechanical and electrostatic actuation techniques effectively on tabletop surface of HapTable, which is a surface capacitive touch screen. We attached four piezo patches to the edges of tabletop to display vibrotactile feedback for static gestures. For this purpose, the vibration response of the tabletop, in the form of frequency response functions (FRFs), was obtained by a laser Doppler vibrometer for 84 grid points on its surface. Using these FRFs, it is possible to display localized vibrotactile feedback on the surface for static gestures. For dynamic gestures, we utilize the electrostatic actuation technique to modulate the frictional forces between finger skin and tabletop surface by applying voltage to its conductive layer. To our knowledge, this hybrid haptic technology is one of a kind and has not been implemented or tested on a tabletop. It opens up new avenues for gesture–based haptic interaction not only on tabletop surfaces but also on touch surfaces used in mobile devices with potential applications in data visualization, user interfaces, games, entertainment, and education. Here, we present two examples of such applications, one for static and one for dynamic gestures, along with detailed user studies. In the first one, user detects the direction of a virtual flow, such as that of wind or water, by putting their hand on the tabletop surface and feeling a vibrotactile stimulus traveling underneath it. In the second example, user rotates a virtual knob on the tabletop surface to select an item from a menu while feeling the knob’s detents and resistance to rotation in the form of frictional haptic feedback.