This lesson is still being designed and assembled (Pre-Alpha version)

Dealing with Issues in Data and Model Training

Overview

Teaching: 20 min
Exercises: 40 min
Questions
  • What are common data issues in Machine Learning (ML)?

  • How do data issues impact ML model performance?

  • How to identify data issue effects on ML model performance?

  • How to address these data issues?

  • How to identify issues in ML model training caused by overfitting and underfitting?

  • How to address overfitting and underfitting?

Objectives
  • Understand common data issues and their effects.

  • Learn metrics for evaluating model performance.

  • Identify and diagnose the effects of data issues.

  • Explore methods to address data issues.

  • Recognize underfitting and overfitting in ML.

  • Learn approaches to mitigate overfitting and underfitting.

Introduction

Real-world datasets often contain imperfections such as missing values, imbalanced classes, or inaccurate labels; when such datasets are used to train and validate machine learning models, these issues can degrade model’s performance. Additionally, training issues, such as overfitting, underfitting, or gradient problems, further add to the challenges of building reliable models. This episode explores some of the common challenges, focusing on identifying, diagnosing, and addressing data and training issues.

We will continue to use the smartphone app classification task to illustrate the dataset-related problems. In previous episodes, we built and tuned neural network models with Keras, and we were able to achieve accuracy which appears to be impressive (over 99%). However, as we shall see in this episode, there is more to it than a single-valued accuracy metric if we aim to build a well-balanced classifier. We will need to use other metrics that can capture a model’s performance in greater details, and use them to guide improvements in our modeling.

Along the way, we will answer following questions:

Common Data Issues and Their Impacts

Real-world datasets often contain imperfections such as missing values, imbalanced classes, Understanding, detecting and mitigating these data issues is essential for ensuring that the trained models perform effectively in real-world applications.

There are five common data-related issues, their causes, and examples:

  1. Missing Values
    Description: Occurs when data points are absent, leaving gaps in the dataset.
    Cause: Hardware or software failures, human oversight, or incomplete data collection processes.
    Examples: In a fitness tracker app, heart rate data might be missing for some time intervals if the wearable device loses contact with the skin or the battery dies.
    Impact: Missing values can disrupt training (e.g., neural networks fail on NaN values), reduce the amount of usable data, or bias predictions if not handled properly.

  2. Imbalanced Classes
    Description: Occurs when some classes in a classification task have significantly more samples than others, skewing the dataset.
    Cause: Natural variations in data distribution, such as differing frequencies of events or behaviors.
    Examples: In a spam email filter, legitimate emails vastly outnumber spam emails, leading to an imbalanced dataset where the “not spam” class dominates.
    Impact: Models tend to favor majority classes, resulting in poor performance on minority classes despite high overall accuracy.

  3. Inaccurate Data
    Description: Occurs when features or labels contain errors, introducing noise into the dataset.
    Cause: Human errors, software bugs, or misinterpretations during data collection or annotation.
    Examples: In a grocery store inventory system, a product might be mistakenly labeled as “organic” instead of “conventional” due to a data entry error, confusing the pricing model.
    Impact: Models learn incorrect patterns, leading to mispredictions and reduced generalization to new data.

  4. Irrelevant Features
    Description: Occurs when dataset features are unrelated to the task, adding unnecessary complexity.
    Cause: Over-inclusive data collection or lack of feature selection during preprocessing.
    Examples: In a movie recommendation system, including the user’s email domain (e.g., gmail.com or yahoo.com) as a feature might be irrelevant to predicting movie preferences. Impact: Increases computational cost, introduces noise, and may degrade model performance by diluting focus on relevant patterns.

  5. Outliers
    Description: Occurs when data points have extreme values that deviate significantly from the norm.
    Cause: Rare events, measurement errors, or anomalies in the data collection process.
    Examples: In a dataset of daily commute times, a single trip taking 5 hours due to a road closure is an outlier compared to the usual 30-minute average.
    Impact: Outliers can skew model parameters, leading to poor performance on typical cases and distorted decision boundaries.

Common Model Training Issues and Their Impacts

Training a machine learning (ML) model involves optimizing its parameters to learn patterns from data. Ideally, a well-trained model achieves normal fitting, where the model’s complexity aligns with the data’s complexity, allowing it to learn general patterns effectively and form a decision boundary that accurately separates different classes of samples, performing well on both training and validation sets.

Normal (Good) Fitting Illustration

*Caption*: Normal (Good) Fitting Illustration

Figure Description: A scatter plot showing two classes of samples in a two-dimensional feature space. A well-balanced decision boundary effectively separates two classes, illustrating a model with appropriate complexity for the data.

However, when the model’s complexity does not align with the data’s complexity, training problems such as overfitting and underfitting arise, leading to poor performance or unstable training. These issues can prevent models from generalizing effectively in tasks like sherlock_18apps app classification. Understanding and addressing these training issues is crucial for building reliable ML models.

Below, we outline overfitting and underfitting:

  1. Overfitting
    Description: Occurs when a model is overly complex relative to the data’s complexity, capturing noise and specific details in the training data rather than general patterns, leading to poor performance on new data.
    Cause: High model complexity (e.g., too many layers or parameters) compared to limited or noisy data. Examples: A student memorizes every detail of practice questions, including irrelevant typos, but fails to answer new questions on the exam that require understanding core concepts.
    Impact: High training accuracy but low validation accuracy, resulting in poor generalization in real-world applications.

    OverFitting Illustration

    *Caption*: Overfitting Illustration

    Figure Description: A scatter plot showing two classes of samples in a two-dimensional feature space. A highly complex, wiggly decision boundary perfectly separates the classes but closely follows individual training points, illustrating overfitting due to excessive model complexity.

  2. Underfitting
    Description: Occurs when a model is too simple relative to the data’s complexity, failing to capture the underlying patterns, resulting in poor performance on both training and validation sets.
    Cause: Low model complexity (e.g., too few layers or parameters) or insufficient training time, unable to handle the complexity or variability of the data.
    Examples: A weather app uses only temperature to predict rain, ignoring complex factors like humidity and pressure, leading to inaccurate forecasts.
    Impact: Low accuracy on both training and validation data, indicating the model has not learned meaningful patterns.

    OverFitting Illustration

    *Caption*: Underfitting Illustration

    Figure Description: A scatter plot showing two classes of samples in a two-dimensional feature space. An overly simplistic decision boundary fails to separate the classes effectively, with many points misclassified, illustrating underfitting due to insufficient model complexity.

Why Address Data Issues?

Data quality drives model performance: Data → Model → Predictions. Issues in the dataset propagate through training, leading to biased or unreliable models. Below, we explore how specific data issues impact models trained on sherlock_18apps.

Effects of Incomplete Data

Missing values can disrupt training or bias predictions. For example, missing CPU_USAGE in sherlock_18apps may prevent the model from learning patterns for affected samples, lowering accuracy. Some algorithms, like neural networks, may fail to process incomplete data, halting training.

Manifestation: Reduced accuracy, biased predictions, or training failures.

Effects of Imbalanced Data

Imbalanced data biases models toward majority classes. In sherlock_18apps, a model may excel at classifying Google App (60,001 samples) but struggle with Messages (2,517 samples), leading to poor minority class performance despite high overall accuracy.

Manifestation: High overall accuracy but actual low accuracy for minority classes, visible in the confusion matrix.

Effects of Inaccurate Data

Erroneous features or labels introduce noise, causing models to learn incorrect patterns. For example, mislabeled packets in sherlock_18apps (e.g., WhatsApp as Telegram) lead to mispredictions during inference, reducing generalization.

Manifestation: Increased errors, noisy decision boundaries, poor validation performance.

Issues in Model Training

Beyond data issues, training problems can hinder model performance. Common training issues include:

Evaluation Metrics for Model Performance

To diagnose data and training issues, we use metrics that reveal overall and class-specific performance. Below, we recap and introduce key metrics for sherlock_18apps:

Accuracy

Definition: Accuracy measures the proportion of correctly predicted samples out of all predictions. It answers the question: “How often is the model correct overall?

Formula: Accuracy Formula

Where: - TP (True Positive): Number of correctly predicted positive samples. - TN (True Negative): Number of correctly predicted negative samples. - FP (False Positive): Number of samples incorrectly predicted as positive. - FN (False Negative): Number of samples incorrectly predicted as negative.

Intuitive Explanation:
Accuracy reflects the overall correctness of the model. For example, if a model correctly classifies 95 out of 100 apps in sherlock_18apps, the accuracy is 95%. However, in imbalanced datasets like sherlock_18apps, high accuracy may mask poor performance on minority classes (e.g., Messages).

When to Use:
Accuracy is useful when classes are balanced and all errors have similar costs, such as: - Balanced Classification Tasks: When classes have roughly equal representation, accuracy provides a reliable overall performance measure. - Preliminary Model Evaluation: To get a quick sense of model performance before diving into class-specific metrics.

Limitation:
In imbalanced datasets like sherlock_18apps, accuracy can be misleading. For example, a model predicting only the majority class (e.g., Google App) may achieve high accuracy but fail on minority classes (e.g., Messages).


Precision

Definition: Precision measures how many of the samples predicted as a certain class actually belong to that class. It answers the question: “Of all the instances I predicted as positive, how many were actually positive?

Formula:
Precision Formula

Where:
- TP (True Positive): The number of correctly predicted positive samples.
- FP (False Positive): The number of samples incorrectly predicted as positive.

Intuitive Explanation:
Precision can be understood as “how reliable the model’s positive predictions are.” For example, if your model predicts 100 apps as “Moriarty” (positive class), but only 80 of them are actually “Moriarty,” then the precision is 80%.

When to Use:
Precision is important when false positives are costly, such as:
- Spam Detection: A low Precision would mean that too many legitimate emails are misclassified as spam.
- Medical Diagnosis (e.g., cancer screening): If Precision is low, many healthy individuals might be misdiagnosed as having a disease, leading to unnecessary anxiety and medical costs.


Recall

Definition: Recall measures how many of the actual positive samples were correctly identified by the model. It answers the question: “Of all the actual positive instances, how many did I correctly predict?

Formula:
Recall Formula

Where:
- TP (True Positive): The number of correctly predicted positive samples.
- FN (False Negative): The number of actual positive samples incorrectly classified as negative.

Intuitive Explanation:
Recall can be thought of as “how many real targets were successfully found.” A high Recall means that the model rarely misses true positive cases. For example, if there are 100 actual “Moriarty” apps in the dataset, but your model only detects 60 of them, then the recall is 60%.

When to Use:
Recall is crucial when missing positive cases is costly, such as:
- Medical Screening: If Recall is low, some actual patients may go undetected, delaying treatment.
- Security Surveillance (e.g., intrusion detection): If Recall is low, many actual threats might be overlooked, increasing security risks.


F1-Score

Definition: F1-Score is the harmonic mean of Precision and Recall, providing a balance between both metrics.

Formula:
F1-Score Formula

F1-Score is useful when there is an imbalance between Precision and Recall, as it considers both metrics in a single number.

Intuitive Explanation:
F1-Score acts as a “compromise” between Precision and Recall. In scenarios where both false positives and false negatives are undesirable, F1-Score helps evaluate overall model performance.

When to Use:
F1-Score is ideal when both Precision and Recall matter, such as:
- Information Retrieval (e.g., search engines): The model should not only return accurate results (Precision) but also retrieve all relevant results (Recall).
- Fraud Detection: The model must avoid both false alarms (Precision) and missed fraud cases (Recall).


Trade-off Between Precision and Recall

So, precision and recall often have an inverse relationship:

The F1-score helps you find the optimal balance between these two metrics.

Overview of Hands-On Activities

This episode includes hands-on activities to diagnose, address, and mitigate data and training issues using sherlock_18apps:

  1. Addressing Imbalanced Data:
    • Explore imbalanced data by computing class frequencies for ApplicationName.
    • Recap accuracy and confusion matrix for the baseline model on the original sherlock_18apps dataset.
    • Compute precision, recall, and F1 score per class to identify minority class issues.
    • Mitigate imbalance by experimenting with techniques like oversampling (e.g., SMOTE) and undersampling majority classes.
  2. Addressing Incomplete Data:
    • Investigate incomplete data by identifying missing values in features like CPU_USAGE.
    • Mitigate incomplete data by applying imputation (e.g., replacing missing values with mean/median) or removing rows/columns with excessive missing data, and evaluate the impact on model performance.
  3. Diagnosing and Solving Training Issues:
    • Analyze overfitting and underfitting by comparing training and validation accuracy/loss.
    • Mitigate training issues by testing solutions like adjust model complexity or training epochs.

Key Points

  • Data issues like missing values, imbalance, and errors lead to biased models and poor predictions.

  • Metrics like precision, recall, and F1 score provide class-specific insights into model performance.

  • Overfitting and underfitting can be detected by comparing training and validation metrics.