MIT Model Aims to Break Through AI 'Overfitting'

August 8, 2019

Article

Better automating machine learning to tackle tough-to-find diseases.

^{The voicing patterns of patients with vocal cord nodules are shown in spectrograms. A new MIT model attempts to cut through the data noise and eliminate a problem common in machine learning: the phenomenon of “overfitting.” Image/Thumb have been modified. Courtesy of MIT.}

Artificial intelligence has virtually unlimited upside to tackling human health problems, almost all experts agree. But the machine needs guidance. The sheer scope of data can make manual training, by human hand, almost impossibly work-intensive. At the same time, a machine left to learn by itself will eventually just memorize its sample set, instead of picking out relevant points — leading to “overfitting,” with inaccurate results.

A model using data manipulation to pick out vocal cord disorders in a limited set of subjects may hold the key to eliminating the “overfitting” bugaboo, according to a new paper by data scientists from the Massachusetts Institute of Technology, Harvard, and the University of Toronto.

Getting the machine to better learn what it’s looking for could have applications in a wide range of applications where subjects are few, but the data mound is huge, according to the work scheduled for presentation at the Machine Learning for Healthcare conference later this week.

“If you have few subjects and lots of data, there’s a failure model that’s just memorizing who’s who,” said Jose Javier Gonzalez Ortiz, the lead author, a Ph.D. student at the MIT Computer Science and Artificial Intelligence Laboratory, in a phone interview with Inside Digital Health™ this week. “What we saw is, if you just split this process in two, you have better odds… By learning the features first, without knowing who is who, and then performing the classification, you have a better chance of not failing that way.”

The paper outlines the model, employed by Gonzalez Ortiz and the rest of the team. A group of 104 subjects, half diagnosed with vocal cord nodules (a growth somewhat like a callous in the throat). Each of the subjects were set up with an accelerometer, a node affixed to their neck, for tracking entire days’ worth of data for every time they spoke.

That data trove was huge — billions of time samples.

So the machine was tasked with picking out which of the patients had vocal cord nodules — but most importantly, picking out the features that identified them as such.

To automate the normally manual aspects of “featuring engineering” — picking the most pertinent decisive criteria – Gonzalez and Ortiz used the two-step data analysis to better discriminate among the data.

The voicing segments created spectrograms, which are a visual representation of the frequencies capturing speech. These, in turn, created huge complex matrices.

From there, and to help the machine learn the data inside and out, it was instructed to perform two operations.

First, to use an autoencoder to compress the spectrograms down to 30 values.

Second, to reverse course again — decompressing that spectrogram back into a new spectrogram entirely, according to the paper.

After the second operation, the model is instructed to make sure the new spectrogram resembles the initial data inputs. In this step — by being forced to learn to discriminate these apart from one another – the machine better learns what separates the spectrograms. That means discriminating different ones coming from the same patients, in addition to determining differences between subjects.

The authors concluded that the two-step method largely eliminated “overfitting.”

“By decoupling the feature extraction of the from downstream learning tasks, our learned representation prevents common overfitting issues that approaches with direct supervision experience,” they wrote. “The features generalize across subjects, while capturing relevant patterns for downstream clinical prediction tasks.”

Gonzalez Ortiz said, in this vocal-cord scenario with few subjects and lots of data, meant “overfitting” was extremely likely. But their model could have many more applications — especially when it comes to wearable devices, the researcher added. Monitoring for Parkinson’s disease, or sleep disorders, where you have long periods of observation punctuated by fleeting data points, could benefit from the two-step process of having machines distinguish the criteria, he said.

“You always have to account for overfitting, because pretty much all systems and algorithms, if given enough time and parameters, they will memorize the training dataset, and will fail to generalize to the test data,” said Gonzalez Ortiz.

Get the best insights in digital health directly to your inbox.

AI-Enabled ECG Accurate in Detecting A-Fib, Mayo Clinic Study Finds

AI Solution Reduces Clinical Trial Screening Time by 34%

Related Content

Ensuring that AI tools aren’t called ‘nurse’ | Bills and Laws

Ron Southwick

April 16th 2025

Article

A bill in Oregon’s legislature would bar companies from using “nurse” in the title of AI tools. It comes in response to tech solutions billed as handling many of the duties done by nurses.

Iodine Software CEO talks about AI, hospitals, and demand for results | Data Book podcast

Ron Southwick

June 6th 2024

Podcast

William Chan, the co-founder of the healthcare technology company, discusses artificial intelligence in the latest episode of our podcast.

Google Cloud’s Aashima Gupta sees AI as a ‘collaborator’ in healthcare

Ron Southwick

March 14th 2025

Article

The company launched new offerings at the HIMSS conference. In a conversation with Chief Healthcare Executive, Google Cloud’s healthcare leader talked about AI’s emergence.

Using voice technology to connect with patients | Data Book podcast

Ron Southwick

May 29th 2024

Podcast

In the latest episode of Chief Healthcare Executive’s podcast, we talk with Freddie Feldman of Wolters Kluwer Health about patient engagement and helping patients get the care they need.

HIMSS 2025: Excitement, anxiety and other takeaways

Ron Southwick

March 7th 2025

Article

The annual event brought healthcare leaders from across the world to Las Vegas. Attendees expressed enthusiasm about digital tools and anxiety about changes in Washington.

Samsung Medical Center offers a glimpse of the future of health care | HIMSS 2025

Ron Southwick

March 5th 2025

Article

The hospital in South Korea is utilizing AI, robots and other digital tools to improve patient care and business operations. The medical center’s leaders took the stage at the HIMSS conference.

Do Not Sell My Personal Information

Contact Info

2 Commerce Drive
Cranbury, NJ 08512

609-716-7777