With 5 Minutes of Recorded Speech, Machine Learning Attempts to Distinguish Psychosis-Related Disorders from Bipolar Disorder and Major Depression
With 5 Minutes of Recorded Speech, Machine Learning Attempts to Distinguish Psychosis-Related Disorders from Bipolar Disorder and Major Depression

In a new paper, researchers co-led by a BBRF grantee demonstrate a computer-based method for analyzing just 5 minutes of recorded speech to help distinguish people with a variety of psychiatric conditions, with a stress on schizophrenia, psychosis, and bipolar disorder (BD).
This research, among other things, is dedicated to helping to diagnose psychiatric disorders early, notably including those in which psychosis is present. Outcomes in psychotic disorders tend to be markedly worse when the illness is not treated promptly following a first psychotic episode. The problem is compounded by the fact that people who for reasons of family history or genetics are thought to be at high risk for psychosis often do not become ill—but it is not currently possible to determine who will and who will not. Are there clues in their speech that can help guide their care?
For some years, researchers have been aware that patterns in speech and language use are linked to key psychosis symptoms, including disordered thought, changes in vocal expression, and flattened affect. Various features of speech, including pitch, rhythm, and connectivity, may be related to changes in motor control in the brain, as well as to patterns of neural connectivity that might be correlated with the severity of psychotic symptoms.
Recently, efforts have been made to combine knowledge of various speech characteristics with machine- learning algorithms to detect psychotic illness, gauge symptom severity, or predict relapse. Like research reported in the new paper, these efforts are based on the hypothesis that computer- or AI-aided analysis of recorded speech can provide useful clinical information.
Led by Julianna Olah, Ph.D., and Sunny X. Tang, M.D., a 2022 BBRF Young Investigator, the team hoped to address several issues that have so far prevented such speech analysis from being adopted by clinicians. In addition to overall accuracy, those issues include, in the team’s account, the lack of a standardized method to collect speech samples (with most studies relying on recordings made in controlled laboratory settings); relatively small sample sizes; and narrow clinical scope (e.g., focusing on psychosis, but not how psychosis might be distinguished reliably from other conditions).
In setting out their approach, the researchers explained that “research to date has focused mainly on binary classifications between ‘healthy’ individuals and others with one specific disorder, like BD or schizophrenia. This does not translate into the discriminatory diagnosis or detection of subtle changes that clinicians perform routinely.” The “binary” discrimination of healthy vs. ill, they add, also provides little insight into whether speech-based machine learning models can capture alterations in speech that are specific to particular disorders—BD, schizophrenia, psychosis, or others.
The team’s attempt to address these issues began with recruitment of a large cross-diagnostic sample of people each of whom completed two questionnaires: one used to gauge pre-psychosis (“prodromal”) symptoms, and the other, depression symptoms. The 1,140 participants included 84 people diagnosed with schizophrenia-spectrum disorders (SSD), 227 with BD, as well as 343 with sub-clinical experiences of psychosis (SPE). The latter subgroup included people thought to have some vulnerability to psychosis but who did not meet the diagnostic criteria of psychotic disorder. In addition to the main focus on the schizophrenia-bipolar spectrum, the team collected speech from 156 people with major depressive disorder (MDD), i.e., who had a psychiatric diagnosis not involving psychosis. Finally, speech was collected from 330 healthy individuals, i.e., with no psychiatric diagnosis and low scores on the two questionnaires.
Speech was collected remotely, using an online platform where participants provided voice samples in response to standardized prompts. There were five different speech-based tasks, together taking about 20 minutes to complete. Participants had a week to complete the five tasks.
In all, over 943 hours of speech was recorded. Each participant was asked: to recall a recent dream; describe eight black-and-white pictures; briefly discuss four neutral topics (e.g., “describe your favorite food”); read a brief prewritten neutral text; and read three brief prewritten emotional stories (angry, happy, frightening).
After automated transcription of the recordings, the team used natural language processing (NLP) techniques to “extract features reflecting abnormalities in semantics, syntax, and speech morphology” (the building blocks that form words). The selection of features, the team said, was guided by prior research demonstrating their ability to help identify whether a speaker has psychotic disorder. The analysis also attempted to capture paralinguistic features, i.e., acoustic features of recorded speech that signal changes in emotional state (such as flattened affect) and articulation (which can be related to changes in motor control). In all, 116 paralinguistic parameters were extracted from the audio files.
The team concluded that “speech, even when collected remotely and online, has a sufficient degree of between-group variability to discriminate between different forms and stages of psychosis spectrum conditions, as well as to discriminate between affective conditions (i.e., MDD) and psychotic conditions (such as schizophrenia or psychosis, or cases of bipolar disorder in which psychosis is present).
Of the five tasks assigned to each participant, the team found that each had a different predictive power. Increasing the length of the spoken samples did not necessarily lead to better predictions. The most informative tasks were those in which participants had to generate speech themselves (e.g., “describe your favorite food”) rather than reciting a prewritten text.
Changes in affect and deviations in motor control related to forming speech could “easily be captured from acoustic information,” the team noted. Disordered thought—a key symptom of psychosis—is more related, they said, to alterations in language.
The machine learning model the team used was able, using 5 minutes of speech, to distinguish between healthy controls and those with either schizophrenia spectrum disorder or bipolar disorder with 86% accuracy. The same figure applied to the model’s ability to discern among the healthy controls, schizophrenia and bipolar patients, and those with sub-clinical psychotic experiences.
This, in turn, suggested to the investigators that “the screening of mental disorders is possible via a fully automated, remote speech assessment pipeline.” The accuracy of the machine-learning model’s ability to identify people on the psychosis spectrum, as well as between major depression and bipolar disorder, could be of clinical benefit, as discriminating these conditions often proves difficult in the clinic, especially at the primary-care level.
The team believes the method, if validated with further testing, could serve as a valuable complement to clinical decision making in the future. However, before clinical implementation, the team says it is critical to test the real-world feasibility, acceptance, and accuracy of the method. They are therefore currently working on clinical implementation and monitoring of the approach in various psychiatric and behavioral clinics across the US.
Julianna Olah, Ph.D., first author of the team’s paper, which appears in Translational Psychiatry, is formerly of Kings College London and a cofounder and officer of a King’s College spin-out company called Psyrin, which is attempting to develop digital biomarkers for serious mental illness. Dr. Tang is a scientific advisor for Psyrin, and other study participants also have links to the company.