Twenty-seven years after the birth of Affective Computing, which later evolved into Emotion AI, the field is now delving into Complex Expression Generation and paving the way for machines capable of experiencing emotions.

To date, there is no universally agreed-upon definition of what constitutes “emotions.” American psychologist Robert Plutchik (1927–2006) noted that over 90 different definitions were coined throughout the 20th century, which led to what he described as «one of the most confusing (and still unresolved) chapters in the history of psychology, with considerable disagreement among theorists over how to conceptualise them» [source: American Scientist].

Drawing on the American Psychological Association’s dictionary, emotions can be described as consisting of various components: physiological, behavioural, and experiential. Broadly speaking, they refer to a «reaction pattern involving these elements, through which an individual attempts to deal with a personally significant issue or event».

Paul Ekman, a prominent American psychologist and author of the neurocultural theory of emotions, identified six “universal basic emotions” – sadness, happiness, fear, anger, disgust, and surprise – found across all ethnicities and cultures worldwide.

For some time now, artificial intelligence has been exploring the realm of human emotions, enabling systems and machines to recognise, categorise, and adapt their behaviour accordingly. This is achieved through the analysis of vast amounts of data, such as facial expressions, voice tones, gestures, and even walking patterns, heart rate, and blood pressure.

Emotion AI, or emotional artificial intelligence, is the branch of AI dedicated to this line of research. Rosalind Wright Picard, an American researcher and professor of Media Arts and Sciences at the Massachusetts Institute of Technology (MIT), is widely regarded as its pioneer. She founded and led the Affective Computing Research Group at MIT.In her book Affective Computing, first published in 1997, Picard introduced the then-unfamiliar concept of “affective computing,” a field merging computer science, psychology, and cognitive science. This would later evolve into what we now refer to as Emotion AI. «If we want computers to be truly intelligent and to interact naturally with us, we need to give them the ability to recognise, understand and even feel and express emotions» she wrote twenty-seven years ago.


«If we want computers to interact naturally with us, we must give them the ability to recognise, understand, and even feel and express emotions» wrote Rosalind Wright Picard, the American researcher who pioneered the field of Affective Computing, laying the groundwork for AI studies focused on human emotions.
From the early Affective Computing systems that could only detect facial movements, to today’s emotion AI techniques enabling multimodal emotion recognition, machines equipped with both visual and auditory perception can now reconstruct a human’s emotional state by synthesising multiple data sources.
In the future, robots increasingly adept at recognising and accurately responding to human emotions could play an even more pivotal role in healthcare, interacting with patients with disabilities and deficits, where every emotionally positive response -whether a smile or a gesture – could become a powerful “healing tool.”

Emotion recognition based on multiple data sources

Since 1997, Affective Computing has evolved from simple facial movement detection to complex multimodal emotion recognition systems.

These advances enable machines equipped with visual and auditory perception to accurately determine a person’s emotional state by integrating various data sources. These range from spoken language and text-based sentiment analysis to micro and macro facial expressions, voice tone variations, posture, and gestures.Currently, the greatest challenge for emotion AI is the precise and accurate identification of human emotions from vast and complex datasets.

Until a decade ago, a variety of AI models were used for this purpose, achieving «satisfactory results in laboratory tests but with outcomes yet to be fully demonstrated in real-world applications», where «the perception and analysis of human emotional states hold great potential for optimising and refining human-machine interaction» [source: “Artificial Intelligence in Emotion Quantification: A Prospective Overview” – CAAI Artificial Intelligence Research, August 2024].

Facial expressions

Facial expression detection through cameras – ranging from the most subtle (micro-expressions) to the more obvious – forms the foundational layer of the process used to recognise a person’s emotional state, known as Facial Expression Recognition (FER). The face, therefore, serves as the starting point.

Emotion AI systems utilise techniques that capture facial images via video systems and then analyse, recognise, and categorise these expressions. The latest methods range «from capturing static facial expressions to continuously monitoring dynamic expression changes».

In particular, deep learning techniques, and specifically convolutional neural networks (CNNs), are most commonly used to «interpret complex emotional states». These networks can distinguish between “neutral” emotions and those reflecting fear, pleasure, or pain, and even detect subtle differences between emotions exhibited on male versus female faces [source: “Visual Analysis of Emotions Using AI Image-Processing Software: Possible Male/Female Differences between the Emotion Pairs ‘Neutral’–‘Fear’ and ‘Pleasure’–‘Pain’” – Association for Computing Machinery, June 2021].

Facial Micro-expression Recognition focuses on brief and minimal facial expressions, often found in individuals «attempting to conceal their true emotions». Current research in this area is dedicated to developing tools and methods capable of capturing these fleeting and subtle facial movements. This is achieved using, for example, high-speed cameras in conjunction with deep learning-based image processing techniques [source: “Facial Micro-expression Recognition Based on the Fusion of Deep Learning and Enhanced Optical Flow” – Multimedia Tools and Applications, 2021].

Another key area of Emotion AI research involves Complex Expression Generation, which aims to develop robots capable of producing facial expressions themselves. This field is rapidly evolving, with approaches combining deep learning and Large Language Models (LLMs). A notable example is the robotic head named Eva, developed in 2021 by the Creative Machines Lab at Columbia University, which can read the facial expressions of individuals and reflect their emotions back.

When it comes to LLMs, frameworks such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) currently represent the cutting edge of research in both recognising and generating dynamic micro-expressions [source: “Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic” – arXiv, 2023].

Tabella riassuntiva degli attuali ambiti di ricerca nel campo del riconoscimento delle emozioni (credit: “Artificial Intelligence in Emotion Quantification: A Prospective Overview” - CAAI Artificial Intelligence Research - https://www.sciopen.com/article/10.26599/AIR.2024.9150040).
Current research areas in the field of emotion recognition techniques (credit: “Artificial Intelligence in Emotion Quantification: A Prospective Overview” – CAAI Artificial Intelligence Research – https://www.sciopen.com/article/10.26599/AIR.2024.9150040).

Emotion recognition through voice

Following facial expressions, voice is the second key “pathway” through which emotions are conveyed. Emotion AI techniques for analysing the acoustic characteristics of speech, known as Speech Emotion Recognition (SER), provide another vital tool for identifying emotional states.

Recent research heavily relies on deep learning techniques paired with Natural Language Processing (NLP) to examine emotions expressed through the voice. 
A 2023 study introduced an innovative approach to human-machine vocal interactions, using an Audio Emotion Recognition technique based on «the combination of advanced natural language processing and vocal sentiment analysis with fuzzy logic.» The authors, from the Engineering College in Thiruvallur, India, explain that fuzzy logic is particularly useful for extracting more nuanced or “fuzzy” emotional tones from speech, offering insights into subtle variations in vocal sentiment [source: “A Fuzzy Logic and NLP Approach to Emotion Driven Response Generation for Voice Interaction” – IEEE, 2023].

Also in 2023, a study titled “Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network” (Electronis, 2023) introduced a method applying deep convolutional neural networks (CNNs) to analyse multiple types of acoustic features simultaneously. The research team highlighted that this approach «achieved an accuracy exceeding 93% in vocal emotion recognition».

However, a persistent challenge in SER systems is detecting not only the emotions conveyed by the voice but also the semantic context of the audio. This complexity arises from «the diversity of languages, accents, gender, age, and speech intensity, making the development of reliable Speech Emotion Recognition systems an ongoing challenge», according to the authors of “Speech Emotion Recognition Using Deep Learning” (Artificial Intelligence XL, 2023). In their work, they propose a novel approach, developing a deep learning system trained on four datasets: RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song), TESS (Toronto Emotional Speech Set), CREMA-D (CRowd-sourced Emotional Multimodal Actors Dataset), and SAVEE (Surrey Audio-Visual Expressed Emotion). These extensive collections include over 1,400 audio files each, featuring professional actors from various ethnicities expressing the six basic emotions (plus “neutrality”) in English with different accents.

So far, this research is unique in having developed a four-layer convolutional neural network, which, when tested on training data, achieved a precision rate of 76%. Although promising, it underscores the ongoing challenges in advancing SER systems to higher levels of reliability and accuracy across diverse contexts.

Gestural emotion recognition: focus on skeletal movements

In recent years, significant advancements have been made in recognising emotions expressed through gestures—particularly through movements of the upper body. One area of progress is Skeleton-based Emotion Recognition, which has benefited from the evolution of depth-sensing cameras that provide increasingly wide fields of view and detailed motion capture.

An example of such a system is SAGN (Semantic Adaptive Graph Network), developed in 2021 through the joint efforts of Beijing University of Posts and Telecommunications and East China Normal University in Shanghai. SAGN uses video devices equipped with deep learning algorithms that operate on Graph Convolutional Networks (a variation of convolutional neural networks designed to handle data structured as graphs). These networks analyse full-body movements in dynamic visual data and interpret emotional cues based on the semantic context of the video, which includes factors such as location, gender, age, and facial expressions of the subject [source: “SAGN: Semantic Adaptive Graph Network for Skeleton-Based Human Action Recognition” – Digital Library, 2021].

In recent years, gestural emotion recognition techniques have shifted toward a multimodal approach. This involves a two-phase process: in the first phase, facial tracking/recognition occurs, followed by skeletal tracking. The AI system then integrates both sets of video data within the same «semantic space» to provide a more comprehensive emotional interpretation [source: “Two-Stage Multi-Modal Modeling for Video Interaction Analysis in Deep Video Understanding Challenge” – Digital Library, 2022].

The future of research in this segment of emotion AI aims to explore how to more effectively utilise the complex data on body movements. The goal is to enhance the precision and accuracy of the entire emotion recognition process, further refining the way machines interpret human emotional states based on body language [source: “An Ongoing Review of Speech Emotion Recognition” – Neurocomputing, 2023].

Applications of Emotion AI

After the success of her book, Rosalind Wright Picard, along with fellow researchers, launched the first venture in Affective Computing in Boston in 2009, focusing on advertising research and road safety. One of the company’s founders explained: «Affective computing techniques capture the visceral and subconscious reactions of consumers, which we discovered are strongly correlated with their real-world behaviour, from simply sharing an ad to actually purchasing the product». In addition, when applied in a car, these techniques can recognise negative emotions in the driver – by monitoring facial expressions, vocal tone, and blood pressure – such as those caused by an argument with a passenger or over the phone, potentially intervening by adjusting the vehicle’s speed [source: “Emotion AI, explained” – The Sloan School of Management (MIT), 2019].

These were among the first applications of emotional artificial intelligence systems, which in the future could evolve into mental health monitoring platforms. Such systems might analyse users’ voices, even during phone conversations, to detect emotional cues linked to anxiety or depression.

The future of Emotion AI looks particularly towards applications in psychology and psychiatry, offering support to specialists in these fields. The aim is an «interdisciplinary approach to psycho-emotional disorders that combines neuroscience, deep learning, and big data analysis to optimise diagnostic tools and treatment strategies» [source: “Artificial Intelligence in Emotion Quantification: A Prospective Overview” – CAAI Artificial Intelligence Research, August 2024].

Over the next decade, advancements in this field could transform Emotion AI from systems that merely recognise emotions to sophisticated platforms capable of deeply understanding and interacting with human emotional states.

Glimpses of Futures

In the field of emotion recognition, research on Emotion AI techniques has surged in the last three years, with numerous ongoing studies.

But what can we expect in the coming years? What future scenarios might emerge? Using the STEPS matrix, we can explore the social, technological, economic, political, and sustainability impacts of these evolving technologies.

S – SOCIAL: imagine a machine that can read our faces, paying attention not just to what we say but how we say it, perceiving our movements, and understanding our deeper emotions. This type of system could radically speed up the diagnosis process for mental health issues by making assessments more accessible, regardless of location. Such advances would be especially valuable for diagnosing conditions related to a person’s emotional well-being, offering quicker, more widespread access to care. In addition, robots capable of recognising and responding to human emotions based on multiple data points will likely play an increasingly significant role in rehabilitative settings. These robots could interact with patients who struggle to express their feelings due to various neurological conditions, providing vital feedback through smiles or gestures. Such emotionally aware responses could become a valuable therapeutic tool, offering comfort and improving treatment outcomes for those unable to articulate their emotions.

T – TECHNOLOGICAL: one of the most promising areas of technological development in Emotion AI is Complex Expression Generation, which focuses on creating robots capable of expressing the six basic human emotions through facial expressions in response to emotional cues from people. This will be a long and complex journey, heavily reliant on the creation of increasingly deep neural networks. A key technological advancement in this area will come from Variational Autoencoders (VAEs), neural network architectures designed to encode input data (such as human emotions captured through facial movements, gestures, body language, and vocal tones) by reducing them to a set of essential traits. These compressed representations are then decoded to reconstruct the original input. This process of compressing and decompressing vast amounts of data will be crucial in the development of future machines capable of “feeling” emotions.

E – ECONOMIC: envisioning a future where human-machine interactions also take place on the emotional level – where both humans and robots not only recognise emotions but also express them – requires careful consideration of the roles and functions in workplaces. In industries such as healthcare, therapy, and rehabilitation, robots could work alongside human professionals to assist in treating and supporting vulnerable individuals. If these Emotion AI technologies evolve further, we may see robots entering sectors far removed from their traditional roles in warehouses, factories, or restaurants, where they currently function as mere workers or waiters. In these new settings, robots could become emotional coaches or companions, significantly transforming the dynamics of the workforce.

P – POLITICAL: the use of Emotion AI techniques to recognise human emotions raises significant ethical concerns. To what extent should emotional AI be allowed to “spy” on our feelings? Even when explicit consent is given for its use, who can guarantee that the highly sensitive emotional data – connected to our psychological and emotional well-being – won’t be misused? The EU AI Act addresses these concerns by categorising systems that detect emotions through biometric data as “limited risk,” requiring transparency to ensure their reliable use and to keep users well-informed. However, the law explicitly prohibits the use of AI systems to monitor emotions in workplaces and educational environments, except for medical or public safety reasons.

S – SUSTAINABILITY: as these technologies are developed, they must be designed to be inclusive, suitable, for people of all genders, ages, and especially ethnicities, not just for the specific subset of populations used in the training of algorithms. The social sustainability and inclusivity of Emotion AI could face challenges if developers, manufacturers, and ethicists do not address the need for intercultural adaptability. For example, «recognising emotions on an African American face can sometimes be difficult for machines trained primarily on Caucasian faces. Likewise, certain gestures or vocal inflections may have vastly different meanings across cultures». Future development of Emotion AI models must account for these differences to ensure global applicability [source: “Emotion AI, explained” – The Sloan School of Management (MIT), 2019].

Written by: