This lesson is still being designed and assembled (Pre-Alpha version)

Dealing with Issues in Data and Model Training

Overview

Teaching: 20 min
Exercises: 40 min
Questions
  • What are common data issues in Machine Learning (ML)?

  • How do data issues impact ML model performance?

  • How to identify data issue effects on ML model performance?

  • How to address these data issues?

  • How to identify issues in ML model training caused by overfitting and underfitting?

  • How to address overfitting and underfitting?

Objectives
  • Understand common data issues and their effects.

  • Learn metrics for evaluating model performance.

  • Identify and diagnose the effects of data issues.

  • Explore methods to address data issues.

  • Recognize underfitting and overfitting in ML.

  • Learn approaches to mitigate overfitting and underfitting.

Introduction

In the previous episodes, we trained and validated NN models assuming that the dataset is ideal for the purpose of training the models. We also assumed that the training algorithm works flawlessly to produce a well-performing models. In real world, these two things do not always hold true. This lesson episode introduces methods to address issues related to dataset and model training; they are intended to produce a reliable model for real-world deployment.

Real-world datasets often contain imperfections such as missing values, inaccurate labels, imbalance among classes, and many other issues. When such datasets are used to train and validate machine learning models, these issues can negatively affect the model’s performance. Understanding, detecting and mitigating these data issues are essential for ensuring that the trained models perform effectively in real-world applications.

In training a machine learning model, one must be cautious of issues such as overfitting, underfitting, or gradient problems. These training-related problems also lead to degradation in model’s performance and reliability. We will show how we can identify and address issues related to model training.

We will continue to use the smartphone app classification task with sherlock_18apps dataset to illustrate the dataset-related problems. In previous episodes, we built and tuned neural network models with Keras, and we were able to achieve accuracy which appears to be impressive (over 99%). However, as we shall see in this episode, there is more to it than a single-valued accuracy metric if we aim to build a well-balanced classifier. We will need to use other metrics that can capture a model’s performance in greater details, and use them to guide improvements in our modeling.

Along the way, we will answer following questions:

Common Data Issues and Their Impacts

Here are five common data issues:

Missing Values

One common issue in real-world datasets is missing values, which occur when certain pieces of data are simply absent—like questions left unanswered on a survey. This can happen for various reasons, such as hardware malfunctions, software errors, human oversight, or incomplete data collection. For example, in a fitness tracking app, heart rate data might be missing during periods when the wearable device loses contact with the skin, or the device runs out of battery, or the Bluetooth connection with the phone is temporarily interrupted.

Missing values can pose serious problems during model training. Many machine learning models, including NN models, would fail completely if they encounter missing values, which are encoded as NaN (short for Not a Number). Even when training continues, missing data can reduce the amount of usable information, or lead to biased predictions if not handled carefully.

Class Imbalance

In many real-world datasets, some classes may have significantly more samples than the others; this situation is known as class imbalance. The following graph shows the number of images (samples) in a dataset that may be used to train image classification models:

A graphic showing uneven, long-tail distribution of images in a sample dataset

Open Long Tailed Datasets. Curated by Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, & S.X. Yu (2019). Published in this paper: “Large-Scale Long-Tailed Recognition in an Open World”. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537-2546.

In this dataset, there are over 1200 samples for partridge (a kind of bird), whereas there are fewer than 20 samples for water snake. In general, there are a few common scenarios which lead to class imbalance in a dataset:

More Real-World Class Imbalance

Can you think of more examples in real life where data is naturally imbalanced? Try to classify whether they are due to rare events, long-tail distribution, or data collection bias. Discuss with your peers and share your ideas.

Hint: Try to come up with examples from different domains like social media, education, or transportation.

Examples

  • In social media platforms, the number of likes or followers per user is highly imbalanced — a few users have millions, while most have only a few.
  • In education platforms, only a small percentage of students might drop out of a course, making “dropout” cases rare.
  • In transportation, vehicle breakdowns are much less frequent compared to normal trips, creating an imbalance in incident reports.
  • In industrial equipment monitoring, failures are infrequent compared to normal operations.

Class imbalance can pose serious challenges for machine learning models. When training a classification model with an imbalanced dataset where most samples belong to one class, the model tends to overly focus on that majority class and overlook the minority ones. This leads to poor performance in classifying the minority classes. This problem is often masked when we use the overall model accuracy as a metric during model training, since it can be artificially high due its correct predictions for the dominant class. We will need to use other metrics that are more sensitive to minority classes, such as precision, recall, and F1-score.

Inaccurate Data

Another common issue in real-world datasets is inaccurate data, which refers to values that are recorded incorrectly, mislabeled, or corrupted. These inaccuracies can occur due to human error, sensor malfunctions, miscommunication in labeling processes, or even software bugs.

For instance, in a smartphone app usage dataset, an app might be mislabeled as “Games” when it’s actually a productivity tool. Similarly, sensor readings from a wearable device could contain spikes or flat lines due to poor contact or temporary hardware issues.

Inaccurate data can mislead the model during training by providing incorrect patterns to learn from. If the training data contains mislabeled examples, the model may associate features with the wrong class, reducing its ability to generalize correctly. Moreover, unlike missing data, inaccuracies are harder to detect automatically — they look like valid data but are actually wrong. This makes data cleaning and validation steps critical before training a model.

Irrelevant Features

Not all features in a dataset contribute meaningfully to the task at hand. Irrelevant features are variables that have little or no predictive power for the target label but are still included in the data. These may come from overly broad data collection, automatic logging, or simply poor feature selection.

For example, in our smartphone app classification task, including a user’s phone wallpaper color or device serial number is unlikely to help determine what type of app is being used. These features may introduce noise, confuse the model, where the model learns patterns that are specific to the training set but do not generalize to new data.

The presence of irrelevant features increases the dimensionality of the input, which can slow down training, increase memory usage, and even worsen model performance due to the curse of dimensionality — the phenomenon where high-dimensional spaces make it harder for models to learn meaningful patterns.

Outliers

Outliers are data points that deviate significantly from the rest of the dataset. They may occur due to genuine rare events, human or sensor errors, or data entry mistakes.

For example, in a dataset measuring daily steps from wearable devices, a user suddenly logging 100,000 steps in one day could either be an extreme fitness enthusiast — or a data glitch.

Outliers can distort training, especially in models sensitive to numerical scale like linear regression or neural networks. They may cause the model to place undue emphasis on extreme cases, which can shift decision boundaries in unhelpful ways. In classification tasks, a single mislabeled or extreme-value point can disproportionately affect the learned model, particularly if the dataset is small.

Real-World Data Issues

For each of the five data issues discussed above, can you think of real-world examples from different domains?
Also try to think about how each issue might appear in our smartphone app classification scenario.
Discuss your thoughts with your peers.

Examples

  • Missing Values
    • Different domain: In healthcare, patient records may have missing blood pressure readings due to skipped checkups or broken devices.
    • Smartphone App: A fitness app may lose heart rate or step count data when the wearable device disconnects.
  • Class Imbalance
    • Different domain: In fraud detection, fraudulent transactions make up less than 1% of all transactions.
    • Smartphone App: Most apps belong to popular categories like “Social” or “Entertainment”, while categories like “Accessibility” are rare.
  • Inaccurate Data
    • Different domain: In financial systems, a stock trade might be recorded with the wrong timestamp or price due to system errors.
    • Smartphone App: An app used for reading the news might be mislabeled as a “Game” during data annotation.
  • Irrelevant Features
    • Different domain: In education, a student’s favorite color may be included in the data but is irrelevant to predicting course success.
    • Smartphone App: The user’s device wallpaper or battery level is unlikely to help predict app type.
  • Outliers
    • Different domain: A person recording 10,000 calories in a diet app in one day might be an entry mistake or an extreme case.
    • Smartphone App: A user opening an app 500 times in one day is an unusual usage pattern that may distort training.

Hands-On Activity: Exploring Data Issues in sherlock_18apps (Will be edited later.)

Before training a machine learning model, it’s crucial to understand the characteristics of the dataset to anticipate potential challenges. The sherlock_18apps dataset, which we’ve been using for smartphone app classification, contains real-world imperfections that can impact model performance. In this hands-on activity, you’ll explore the dataset to identify potential data issues. This exploration will help you connect the theoretical data issues discussed earlier to practical observations, setting the stage for evaluating model performance with appropriate metrics.

Loading Required Python Libraries and Objects

Ensure the following libraries are loaded in your environment:

import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

Exploring Data Issues in sherlock_18apps

In this activity, we explore missing values and class imbalance in the sherlock_18apps dataset because they are the most straightforward issues to identify by examining data and class distributions.

Load the sherlock_18apps dataset using sherlock_ML_toolbox.py and answer:

  1. Which features have missing values, and how many?
  2. What is the class distribution of ApplicationName?

Solution

Import the function of loading dataset from sherlock_ML_toolbox

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)
from sherlock_ML_toolbox import load_prep_data_18apps

1. Inspecting Missing Values:

Load the dataset using load_prep_data_18apps to check for missing values.

datafile = "sherlock/sherlock_18apps.csv"
df, df2, labels, df_labels_onehot, df_features = load_prep_data_18apps(datafile,print_summary=False)

Output:

Loading input data from: sherlock/sherlock_18apps.csv Cleaning: - dropped 2 columns: ['cminflt', 'guest_time'] - remaining missing data (per feature): CPU_USAGE 52 cutime 52 num_threads 52 priority 52 rss 52 state 52 stime 52 utime 52 vsize 52 dtype: int64 - dropping the rest of missing data - remaining shape: (273077, 17) Step: Separating the labels (ApplicationName) from the features. Step: Converting all non-numerical features to one-hot encoding. Step: Feature scaling with StandardScaler

Analysis: The output shows that the sherlock_18apps dataset has missing values in multiple features: CPU_USAGE, cutime, num_threads, priority, rss, state, stime, utime, and vsize, each with 52 missing entries.


2. Inspecting Class Imbalance:

To examine the class distribution of ApplicationName, we define a function to print and visualize the label distribution.

def visualize_label_distribution(labels):
    """
    Visualize the distribution of labels and print the frequency distribution.

    Parameters:
    labels (pd.Series): A pandas Series containing the labels (e.g., ApplicationName).
    """
    # Check the frequency distribution of each category
    labels_distribution = labels.value_counts()
    
    # Print the frequency distribution
    print("Frequency Distribution of Labels:")
    print(labels_distribution)
    
    # Set the figure size
    plt.figure(figsize=(12, 8))
    
    # Create a bar plot
    plt.bar(labels_distribution.index, labels_distribution.values, color='skyblue')
    
    # Add title and labels
    plt.title('Distribution of Labels', fontsize=16)
    plt.xlabel('Labels', fontsize=14)
    plt.ylabel('Frequency', fontsize=14)
    
    # Rotate x-axis labels for better readability
    plt.xticks(rotation=45, ha='right')
    
    # Display the plot
    plt.tight_layout()
    plt.show()

Apply the function to the labels from the loaded dataset:

visualize_label_distribution(labels)

Output:

Frequency Distribution of Labels: Google App 60001 Chrome 28045 Facebook 20103 Geo News 19991 Messenger 19989 WhatsApp 19985 Photos 17380 ES File Explorer 16660 Gmail 16414 Calendar 8986 Moovit 8365 Waze 8228 Hangouts 7601 YouTube 5173 Maps 5157 Skype 4876 Moriarty 3616 Messages 2507 Name: ApplicationName, dtype: int64

The following bar plot visualizes the class distribution of ApplicationName:

Label Distribution of ApplicationName

Analysis: The output reveals a significant class imbalance in the sherlock_18apps dataset. The Google App is heavily over-represented with 60,001 samples, while apps like Messages (2,507 samples), Moriarty (3,616 samples), and Skype (4,876 samples) are under-represented.

Conclusion:

By exploring and answeing above two questions, we confirm the presence of both missing values and class imbalance in the sherlock_18apps dataset.

Common Model Training Issues and Their Impacts

Training a machine learning (ML) model involves optimizing its parameters to learn patterns from data. Ideally, the model’s complexity will align with the data’s complexity, allowing it to learn general patterns effectively and form a decision boundary that accurately separates different classes of samples, performing well on both training and validation sets.

However, when the model’s complexity does not align with the data’s complexity, training problems such as overfitting and underfitting arise, leading to poor performance or unstable training. These issues can prevent models from generalizing effectively in tasks like sherlock_18apps app classification. Understanding and addressing these training issues is crucial for building reliable ML models.

Model Complexity and Fitting Relationship
Model Complexity and Fitting Relationship
Ideal Decision Boundary
An Illustration of Ideal Fitting.
Ideal Fitting Loss and Accuracy
Ideal Fitting Loss and Accuracy

Overfitting

Overfitting occurs when a machine learning model becomes overly attuned to the noise and specific details of the training data, failing to capture generalizable patterns that apply to new, unseen data. A classic example is a student who memorizes every detail of practice questions, including irrelevant typos, and consequently fails to solve new problems that require genuine understanding.

This issue is often reflected in the divergence between training and validation metrics: while the model achieves extremely low training loss and high training accuracy (sometimes approaching perfection), its performance on the validation set deteriorates significantly, with rising loss and dropping accuracy. This gap signals that the model has memorized the training data’s idiosyncrasies rather than learning underlying trends, a critical flaw for real-world applications where data variability is inevitable.

Visually, overfitting can be understood through the lens of decision boundaries in simplified feature spaces. In a two-dimensional example, an overfitted model might produce a highly complex, “wiggly” decision boundary that tightly wraps around every training point, even those influenced by random noise. Unlike a smooth, generalized boundary that captures the core separation between classes, this overly intricate boundary reflects the model’s attempt to explain every nuance of the training data—including irrelevant or misleading details. The result is a model that performs flawlessly on the training set but struggles to classify new samples correctly, as shown in the side-by-side illustrations below.

Overfitting Loss & Accuracy
Training and Validation Loss/Accuracy in Overfitting
Overfitting Decision Boundary
Decision Boundary in Overfitting

Underfitting

Underfitting occurs when a machine learning model is too simple relative to the data’s complexity, failing to capture the underlying patterns. This typically happens when the model has insufficient complexity (e.g., too few layers or parameters) or is not trained long enough to learn meaningful trends in the data. A typical example is a weather app that uses only temperature to predict rain, ignoring important factors like humidity and pressure. Such a model will fail to produce accurate forecasts because it lacks the capacity to understand the true complexity of the problem.

This issue is often reflected in the convergence of training and validation metrics at suboptimal levels: the model exhibits high training loss and low training accuracy, with similar poor performance on the validation set. This lack of divergence between training and validation metrics signals that the model has not learned meaningful trends from the training data, rendering it ineffective for real-world applications where capturing essential patterns is crucial.

Visually, underfitting can be understood through the lens of decision boundaries in simplified feature spaces. In a two-dimensional example, an underfitted model might produce an overly simplistic decision boundary, such as a straight line, that fails to adequately separate the classes, leaving many points misclassified. Unlike a well-balanced boundary that captures the core separation between classes, this rudimentary boundary reflects the model’s inability to model the data’s complexity. The result is a model that performs poorly on both training and validation sets, as shown in the side-by-side illustrations below.

Underfitting Loss & Accuracy
Training and Validation Loss/Accuracy in Underfitting
Underfitting Decision Boundary
Decision Boundary in Underfitting

Evaluation Metrics for Model Performance

Evaluating machine learning models goes beyond achieving a high accuracy score, especially for real-world datasets like sherlock_18apps, which suffer from issues such as class imbalance. A single metric can hide critical weaknesses, such as poor performance on minority classes or biases from data imperfections.

In this section, we will systematically introduce several commonly used key performance evaluation metrics for machine learning models, including Accuracy, Precision, Recall and F1 Score-to assess models comprehensively and detect issues stemming from data or training problems. These metrics will guide us in improving our sherlock_18apps classifier and set the stage for hands-on evaluation.

Why Metrics Matter

Accuracy, while intuitive, can be misleading for imbalanced datasets like sherlock_18apps, where classes are unevenly represented (e.g., Google App with 60,001 samples vs. Messages with 2,517, as observed in our earlier exploration). A model might achieve high accuracy by correctly predicting the majority class while failing on minority classes, which are often critical in real-world applications. To illustrate this, consider the baseline model we developed in previous episodes for sherlock_18apps. While it may report a high overall accuracy (e.g., 99%), this metric doesn’t reveal whether the model performs well across all classes, particularly for underrepresented ones like Messages or Accessibility.

Challenge: Investigating Per-Class Accuracy and Confusion Matrix

In the previous episode (e.g., 24-keras-classify.md), we trained a baseline model for sherlock_18apps and printed its overall accuracy on the validation set using the following code:

def NN_Model_1H(hidden_neurons, learning_rate):
    """Definition of deep learning model with one dense hidden layer"""
    model = Sequential([
        # More hidden layers can be added here
        Dense(hidden_neurons, activation='relu', input_shape=(19,),
              kernel_initializer='random_normal'), # Hidden Layer
        Dense(18, activation='softmax',
              kernel_initializer='random_normal')  # Output Layer
    ])
    adam_opt = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False)
    model.compile(optimizer=adam_opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model
model_1H = NN_Model_1H(18,0.0003)
model_1H_history = model_1H.fit(train_features,
                                train_L_onehot,
                                epochs=10, batch_size=32,
                                validation_data=(val_features, val_L_onehot),
                                verbose=2)

This code gives us a single accuracy score, but it doesn’t show how the model performs for each app class. Can you extend this to: 1. Compute and print the per-class accuracy for each of the 18 app classes? 2. Generate and visualize the confusion matrix (like below showed Reference Confusion Matrix)to identify misclassifications?

Reference Confusion Matrix (from episode 24-keras-classify.md, decision tree classifier):

 1829     1     0     0     0     0     0     0     0     0     0    18     0     0     0     0     1     0
    0  5477     0     0     0    69     0     0     0     0     0     0     0     5     0     0     2     0
    1   610  2753     0     0    25     0     5     0     1     1     1     0     2     0     0     0     0
    0     0     0  4029     0     0    15     0     0     0     0     0     0     0     0     0     0    10
    0     0     0     0  4006     0     0     0     0     0     0     0     0     0     0     0     0     0
   64    28     0     0     0  3183     1     0     0     0     0     1     0    49     0     0     0     0
    0   143     0     0     0     2 10459     0     0     0    15     0     0     0     0     0  1369     0
    0    58     0     0     0    24     4  1408     0     1     0     0     0     1     0     0    11     0
    3    39     0     0     0     0     1     0   935     0     0     0     0     0     1     0     4     0
    0     0     0     0     0     0     0     0     1   486     0     0     0     8     0     0     0     0
    0     0     0     0     0     0     0     0     0     0  4016     0     0     0     0     0     0     0
    0     0     0     0     0     0     0     0     0     0     0  1697     0     0     0     0     0     0
    0    13     0     0     4     0     0     0     0     0     0     0   680     1     0     0     0     0
    0     0     0     0     0     0     0     0     0     0     0     6     0  3473     0     0     0     0
    0     0     0     0     0     0     0     0     0     0     0     0     0     0  1003     0     0     0
    0     0     0     0     3     0     0     0     0     0     0     0     0     0     0  1642     0     0
    0     4     0     0     0     0     4     0     0     0     0     0     0     0     0     0  3897     0
    0     0     0     0     0   116     0     0     0     0     0     0     0     0     0     0     0   897

Solution

Firstly, we need to split the dataset and preprocess the data by importing the function from sherlock_ML_toolbox.

from sherlock_ML_toolbox import split_data_18apps
train_features, val_features, train_labels, val_labels, train_L_onehot, val_L_onehot = split_data_18apps(df_features, labels, df_labels_onehot)

Then, we retrain the baseline model.

def NN_Model_1H(hidden_neurons, learning_rate):
    """Definition of deep learning model with one dense hidden layer"""
    model = Sequential([
        # More hidden layers can be added here
        Dense(hidden_neurons, activation='relu', input_shape=(19,),
              kernel_initializer='random_normal'), # Hidden Layer
        Dense(18, activation='softmax',
              kernel_initializer='random_normal')  # Output Layer
    ])
    adam_opt = Adam(learning_rate=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False)
    model.compile(optimizer=adam_opt,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model
def model_evaluate_custom(model, test_F, test_L_one_hot):
 # 1. Get model predictions
 test_L_pred_prob = model.predict(test_F)
 test_L_pred_indices = np.argmax(test_L_pred_prob, axis=1) # Predicted class indices
 
 # 2. Convert true one-hot encoded labels to class indices
 # Also, try to get class names (if input is a DataFrame)
 if isinstance(test_L_one_hot, pd.DataFrame):
 class_names_from_df = [str(col) for col in test_L_one_hot.columns]
 test_L_true_values = test_L_one_hot.values
 else: # Assuming NumPy array
 class_names_from_df = None
 test_L_true_values = test_L_one_hot
 
 test_L_true_indices = np.argmax(test_L_true_values, axis=1) # True class indices
 
 num_classes = test_L_true_values.shape[1] # Get number of classes from the shape of the labels

 # Determine class names (target_names) for the report
 if class_names_from_df and len(class_names_from_df) == num_classes:
 target_names = class_names_from_df
 else:
 # If no valid column names are provided, generate generic names (e.g., "0", "1", ..., "N-1")
 target_names = [str(i) for i in range(num_classes)]

 # 3. Calculate sklearn's classification_report (includes overall accuracy)
 # zero_division=0 avoids division by zero warnings if a class has no samples, setting the metric to 0
 report_dict = classification_report(test_L_true_indices, test_L_pred_indices,
 target_names=target_names,
 output_dict=True,
 digits=4,
 zero_division=0)
 overall_accuracy = report_dict['accuracy']

 # 4. Calculate Per-class Accuracy
 per_class_accuracy_dict = {}
 # unique_true_classes = np.unique(test_L_true_indices) # Classes present in true labels

 for class_idx in range(num_classes): # Iterate over all possible class indices from 0 to num_classes-1
 class_name_str = target_names[class_idx] # Get the class name corresponding to the current index

 # Check if this class is present in the true labels
 if class_idx in test_L_true_indices: 
 mask = (test_L_true_indices == class_idx)
 correct_predictions = np.sum(test_L_pred_indices[mask] == test_L_true_indices[mask])
 total_samples_in_class = np.sum(mask)
 acc = correct_predictions / total_samples_in_class if total_samples_in_class > 0 else 0.0
 else: # If the class is not present in true labels, its accuracy is typically considered 0 or not applicable
 acc = 0.0 # Or can be set to np.nan

 per_class_accuracy_dict[class_name_str] = acc
 
 # Also add the per-class accuracy to the corresponding class entry in report_dict
 if class_name_str in report_dict: # Keys in report_dict are the class names from target_names
 report_dict[class_name_str]['accuracy'] = acc
 else: 
 # If a class name is not in report_dict for some reason (e.g., no samples in true and predicted)
 # We still want per_class_accuracy_dict to include it.
 # For report_dict, classification_report might not include classes with zero support.
 # We ensure an entry is created in report_dict for all target_names if it doesn't exist.
 if class_name_str not in report_dict:
 report_dict[class_name_str] = {'precision':0, 'recall':0, 'f1-score':0, 'support':0, 'accuracy':acc}


 # 5. Calculate Confusion Matrix
 # The 'labels' parameter ensures the confusion matrix dimensions are consistent 
 # with the number of classes (0 to num_classes-1)
 cm_labels = np.arange(num_classes)
 cm = confusion_matrix(test_L_true_indices, test_L_pred_indices, labels=cm_labels)

 return overall_accuracy, per_class_accuracy_dict, cm, report_dict, target_names
model_1H = NN_Model_1H(18,0.0003)
model_1H_history = model_1H.fit(train_features,
 train_L_onehot,
 epochs=10, batch_size=32,
 validation_data=(val_features, val_L_onehot),
 verbose=2)

overall_acc, per_class_acc_dict, conf_matrix, full_report_dict, class_names_for_report = \
 model_evaluate_custom(model_1H, val_features, val_L_onehot)

print("\n========== Model Evaluation Results ==========")

print(f"\n[1] Overall Accuracy: {overall_acc:.4f}")

print("\n[2] Per-Class Accuracy:")
if per_class_acc_dict:
 for class_name, acc in per_class_acc_dict.items():
 print(f" - {class_name}: {acc:.4f}")
else:
 print(" Could not calculate per-class accuracy.")

print("\n[3] Confusion Matrix:")
print(conf_matrix)

Output:

Epoch 1/10
6827/6827 - 6s - loss: 1.0943 - accuracy: 0.6878 - val_loss: 0.5419 - val_accuracy: 0.8664
Epoch 2/10
6827/6827 - 6s - loss: 0.4082 - accuracy: 0.8973 - val_loss: 0.3260 - val_accuracy: 0.9236
Epoch 3/10
6827/6827 - 6s - loss: 0.2808 - accuracy: 0.9315 - val_loss: 0.2500 - val_accuracy: 0.9375
Epoch 4/10
6827/6827 - 6s - loss: 0.2250 - accuracy: 0.9411 - val_loss: 0.2075 - val_accuracy: 0.9449
Epoch 5/10
6827/6827 - 6s - loss: 0.1916 - accuracy: 0.9513 - val_loss: 0.1792 - val_accuracy: 0.9560
Epoch 6/10
6827/6827 - 6s - loss: 0.1668 - accuracy: 0.9591 - val_loss: 0.1570 - val_accuracy: 0.9620
Epoch 7/10
6827/6827 - 6s - loss: 0.1475 - accuracy: 0.9660 - val_loss: 0.1400 - val_accuracy: 0.9677
Epoch 8/10
6827/6827 - 6s - loss: 0.1327 - accuracy: 0.9704 - val_loss: 0.1286 - val_accuracy: 0.9706
Epoch 9/10
6827/6827 - 6s - loss: 0.1214 - accuracy: 0.9727 - val_loss: 0.1177 - val_accuracy: 0.9727
Epoch 10/10
6827/6827 - 6s - loss: 0.1120 - accuracy: 0.9737 - val_loss: 0.1085 - val_accuracy: 0.9734

========== Model Evaluation Results ==========

[1] Overall Accuracy: 0.9734

[2] Per-Class Accuracy:
    - Calendar: 0.9210
    - Chrome: 0.9247
    - ES File Explorer: 0.9788
    - Facebook: 0.9938
    - Geo News: 0.9980
    - Gmail: 0.9603
    - Google App: 0.9962
    - Hangouts: 0.9854
    - Maps: 0.9288
    - Messages: 0.8747
    - Messenger: 0.9998
    - Moovit: 0.9959
    - Moriarty: 0.9914
    - Photos: 0.8893
    - Skype: 0.9731
    - Waze: 0.9945
    - WhatsApp: 1.0000
    - YouTube: 0.9812

[3] Confusion Matrix:
[[ 1703     0     0     0     2   141     0     0     2     0     0     0     0     1     0     0     0     0]
 [    0  5135    31     0     0    45     1     5     3    17     0     0     2    47     1   256    10     0]
 [    4    25  3327     1     0    11     0     6    13     3     2     1     0     2     1     0     3     0]
 [    0     3     0  4029     0     0    13     6     0     0     2     0     1     0     0     0     0     0]
 [    0     0     0     0  3998     0     0     0     0     6     0     0     0     2     0     0     0     0]
 [   83     2     3     0     0  3194    20     9     8     0     0     0     1     1     0     0     5     0]
 [    0     2     0    12     0     3 11943     6     1     0    18     0     0     0     1     0     2     0]
 [    0     0     1     3     0    13     1  1485     1     0     0     0     0     0     2     0     1     0]
 [    7     0    54     0     0     0     0     0   913     0     0     2     1     0     5     0     1     0]
 [    0    10     1     0     0     0     0     0     0   433     0     0     0    51     0     0     0     0]
 [    0     0     0     0     0     0     1     0     0     0  4015     0     0     0     0     0     0     0]
 [    0     0     1     0     0     0     1     0     1     0     0  1690     0     4     0     0     0     0]
 [    0     1     0     0     0     1     0     0     0     4     0     0   692     0     0     0     0     0]
 [   20     3   254     0     0     0     0     0     1   106     0     0     0  3094     1     0     0     0]
 [    9     0    18     0     0     0     0     0     0     0     0     0     0     0   976     0     0     0]
 [    0     3     3     0     0     0     0     0     0     0     0     0     0     0     0  1636     3     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0  3905     0]
 [    0     0    13     0     0     0     0     0     0     0     5     0     0     0     0     0     1   994]]

Expected Observations: - Overall Accuracy: The retrained model (NN_Model_1H) demonstrates a high overall accuracy (0.9734) on the validation set. - Per-Class Accuracy: However, the per-class accuracies reveal significant performance disparities, confirming the impact of class imbalance: majority class apps (e.g., ‘WhatsApp’ with 1.0000 accuracy, ‘Google App’ with 0.9962 accuracy) are classified excellently, while minority class apps (e.g., ‘Messages’ with 0.8747 accuracy, ‘Photos’ with 0.8893 accuracy) perform much worse. - Confusion Matrix: The Confusion Matrix further give more details.

The class imbalance of sherlock_18apps dataset leads the model to prioritize predicting the majority class, resulting in poor performance on minority classes,

Discussions

  • Why does the model perform poorly on minority classes despite high overall accuracy? How does the class imbalance in sherlock_18apps contribute?
  • In the confusion matrix, which classes are most frequently misclassified, and what does this suggest about the model’s performance?
  • Why is overall accuracy insufficient, and what additional metrics might help evaluate minority class performance?

Takeaways

This challenge shows that high overall accuracy can mask poor performance on minority classes in sherlock_18apps, as seen in the lower per-class accuracies and misclassifications in the confusion matrix. This underscores the need for metrics like precision, recall, and F1-score, which we’ll explore next to assess per-class performance more effectively.

Key Points

  • Data issues like missing values, imbalance, and errors lead to biased models and poor predictions.

  • Metrics like precision, recall, and F1 score provide class-specific insights into model performance.

  • Overfitting and underfitting can be detected by comparing training and validation metrics.