Dr. Ricardo Flores H., an academic at the Faculty of Engineering, is leading a research paper published in the IEEE Journal of Biomedical and Health Informatics that combines voice and facial expressions to support the early detection of depression.
By: Carolina Vega Artigues, Journalist – Faculty of Engineering cvegaa@udec.cl
Images: Courtesy of Ricardo Flores H.
Mental health is one of humanity’s most significant challenges. The World Health Organization (WHO) estimates that more than one billion people in the world have a mental disorder, with anxiety and depression being the most frequent. It is found in all ages, countries, and socioeconomic levels. According to the World Mental Health Today and Mental Health Atlas 2024 reports, these diseases represent the second cause of prolonged disability, reduce the quality of life, and generate an enormous economic cost for health systems and families.
The situation in Chile is no different. According to the sixth version of the “Ipsos Health Services Monitor 2024” study, mental health is identified as the primary health problem by 69% of the people consulted, and according to the ACHS-UC Mental Health Thermometer, launched in January 2025, 13.7% of respondents have moderate or severe symptoms of depression, an increase of 3.3 percentage points compared to April 2024. Data from the Ministry of Health show that more than one million people are currently receiving care in the public system for mental health problems.
Faced with this outlook, the question arises: can science and technology deliver faster, more effective solutions? From innovative medical devices to the application of artificial intelligence (AI), technology has become a valuable ally that complements specialists’ work and improves the early detection of anxiety and depression symptoms.
WavFace: combining voice and face to detect depression
It is in this intersection between mental health and artificial intelligence that the research of academic Ricardo Flores H., PhD in Data Science from Worcester Polytechnic Institute (WPI), Massachusetts, is found. His recent publication, WavFace: A Multimodal Transformer-Based Model for Depression Screening, proposes a pioneering approach: using deep learning to analyze patients’ voices and facial expressions in virtual clinical interviews, emulating how psychologists observe and listen to their patients.
“I started this work in the United States, while I was finishing my doctorate in a mental health research group, and I finished it at the University of Concepción,” explains Dr. Flores, a professor in the Department of Computer Engineering and Computer Science. The motivation arose after participating in the Machine Learning for Healthcare congress in North Carolina, where he received feedback from both artificial intelligence specialists and doctors.
“They told us: the model works, but it needs to look at the patient. That’s where the idea of combining the auditory with the visual was born, just as a psychologist does: when they listen to a patient and observe the tone of voice, changes in expression, and how all of that is combined. I tried to emulate that process,” he recalls.
From that inspiration comes the name WavFace: Wav for the audio signal and Face for the facial expressions.
How does WavFace work
The model uses pre-trained artificial intelligence algorithms to convert both voice and facial expressions into mathematical representations, known as embeddings. Then, it applies layers of transformers and sequential and spatial attention mechanisms to identify which fragments of the interview are most relevant.
“The model detects signals at different times of the conversation. It may be a change in the voice or a particular facial expression. Not necessarily both at the same time. The important thing is that it learns to identify what attracts attention, as a clinician would,” explains the researcher.
Trained with a database of about 150 patients diagnosed with depression and anxiety, WavFace managed to differentiate preliminarily between patients with and without depression, reaching an 81% balanced accuracy.
Although it is still a preliminary study, the potential is clear. “Imagine that a patient sends a video answering some basic questions. The model processes audio and facial expressions to provide an initial indicator. This does not replace the specialist, but it helps to prioritize: deciding who to treat most urgently or who to refer,” says Flores.
In addition, its scope goes beyond depression. “In autism, for example, it is very useful to analyze eye patterns. Today, that analysis is performed by a doctor, who notes whether the child looks up or to the side. Our model can automate it and deliver objective reports,” he adds.
Collaborative work at UdeC
The professor works in the Human-Computer Interaction Laboratory (HCI) of the University of Concepción, together with researchers from engineering, computer science, pharmacy, and psychology. The team is also exploring the use of physiological data using smartwatches and EEG records to complement multimodal analysis, in collaboration with Biomedical Civil Engineering academics and students. “We want to study how voice, expressions, text, heart rate, and brain activity are combined. All this can give a more complete picture, for example, of student well-being and mental health in general,” he says.
According to the WHO, mental health remains one of the most neglected areas: on average, countries allocate less than 2% of their health budgets to this area, with insufficient resources in community care. In this scenario, investigations such as Ricardo Flores’s take on special relevance. WavFace does not seek to replace specialists but to provide a complementary tool that allows prioritizing patients, detecting symptoms early, and optimizing clinical resources.
Last modified: 20 de mayo de 2026
