Dealing with Issues in Data and Model Training
Overview
Teaching: 20 min
Exercises: 40 minQuestions
What are common data issues in Machine Learning (ML)?
How do data issues impact ML model performance?
How to identify data issue effects on ML model performance?
How to address these data issues?
How to identify issues in ML model training caused by overfitting and underfitting?
How to address overfitting and underfitting?
Objectives
Understand common data issues and their effects.
Learn metrics for evaluating model performance.
Identify and diagnose the effects of data issues.
Explore methods to address data issues.
Recognize underfitting and overfitting in ML.
Learn approaches to mitigate overfitting and underfitting.
Introduction
In the previous episodes, we trained and validated NN models assuming that the dataset is ideal for the purpose of training the models. We also assumed that the training algorithm works flawlessly to produce a well-performing models. In real world, these two things do not always hold true. This lesson episode introduces methods to address issues related to dataset and model training; they are intended to produce a reliable model for real-world deployment.
Real-world datasets often contain imperfections such as missing values, inaccurate labels, imbalance among classes, and many other issues. When such datasets are used to train and validate machine learning models, these issues can negatively affect the model’s performance. Understanding, detecting and mitigating these data issues are essential for ensuring that the trained models perform effectively in real-world applications.
In training a machine learning model, one must be cautious of issues such as overfitting, underfitting, or gradient problems. These training-related problems also lead to degradation in model’s performance and reliability. We will show how we can identify and address issues related to model training.
We will continue to use the smartphone app classification task
with sherlock_18apps
dataset to illustrate the dataset-related problems.
In previous episodes,
we built and tuned neural network models with Keras,
and we were able to achieve accuracy which appears to be impressive
(over 99%).
However, as we shall see in this episode,
there is more to it than a single-valued accuracy metric
if we aim to build a well-balanced classifier.
We will need to use other metrics that can
capture a model’s performance in greater details,
and use them to guide improvements in our modeling.
Along the way, we will answer following questions:
- What are the common data and training issues in ML?
- How do these issues affect model performance?
- What other metircs can help us to detect these issues?
- What techniques can we use to mitigate these issues to improve the model’s overall performance?
Common Data Issues and Their Impacts
Here are five common data issues:
Missing Values
One common issue in real-world datasets is missing values, which occur when certain pieces of data are simply absent—like questions left unanswered on a survey. This can happen for various reasons, such as hardware malfunctions, software errors, human oversight, or incomplete data collection. For example, in a fitness tracking app, heart rate data might be missing during periods when the wearable device loses contact with the skin, or the device runs out of battery, or the Bluetooth connection with the phone is temporarily interrupted.
Missing values can pose serious problems during model training.
Many machine learning models, including NN models,
would fail completely if they encounter missing values,
which are encoded as NaN
(short for Not a Number).
Even when training continues,
missing data can reduce the amount of usable information,
or lead to biased predictions if not handled carefully.
Class Imbalance
In many real-world datasets, some classes may have significantly more samples than the others; this situation is known as class imbalance. The following graph shows the number of images (samples) in a dataset that may be used to train image classification models:
Open Long Tailed Datasets. Curated by Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, & S.X. Yu (2019). Published in this paper: “Large-Scale Long-Tailed Recognition in an Open World”. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537-2546.
In this dataset, there are over 1200 samples for partridge (a kind of bird), whereas there are fewer than 20 samples for water snake. In general, there are a few common scenarios which lead to class imbalance in a dataset:
-
Rare events: Certain events naturally occur far less frequently than others. In fraud detection, for example, only a tiny fraction of transactions are fraudulent — often less than 0.1%. In computer networking, cyber attack events are also extremely rare and difficult spot among the vast number of benign network packets. Similarly, rare diseases such as certain types of cancer or genetic disorders may affect only a small fraction of the population (e.g., 1 in 100,000 individuals). These naturally low-frequency events lead to class imbalance, where one class significantly outnumbers another.
-
Long-tail distribution: In many domains, data follows a long-tail distribution, where a few classes (the “head”) have a large number of samples, while many other classes (the “tail”) have very few samples. This pheonemnon is shown in the Open Long Tailed Datasets graph shown earlier.
-
Data collection bias: During data collection, certain categories of samples may be overrepresented or underrepresented due to imperfect sampling methods, limitations of tools, or human factors, resulting in an imbalanced dataset.
More Real-World Class Imbalance
Can you think of more examples in real life where data is naturally imbalanced? Try to classify whether they are due to rare events, long-tail distribution, or data collection bias. Discuss with your peers and share your ideas.
Hint: Try to come up with examples from different domains like social media, education, or transportation.
Examples
- In social media platforms, the number of likes or followers per user is highly imbalanced — a few users have millions, while most have only a few.
- In education platforms, only a small percentage of students might drop out of a course, making “dropout” cases rare.
- In transportation, vehicle breakdowns are much less frequent compared to normal trips, creating an imbalance in incident reports.
- In industrial equipment monitoring, failures are infrequent compared to normal operations.
Class imbalance can pose serious challenges for machine learning models. When training a classification model with an imbalanced dataset where most samples belong to one class, the model tends to overly focus on that majority class and overlook the minority ones. This leads to poor performance in classifying the minority classes. This problem is often masked when we use the overall model accuracy as a metric during model training, since it can be artificially high due its correct predictions for the dominant class. We will need to use other metrics that are more sensitive to minority classes, such as precision, recall, and F1-score.
Inaccurate Data
Another common issue in real-world datasets is inaccurate data, which refers to values that are recorded incorrectly, mislabeled, or corrupted. These inaccuracies can occur due to human errors (whether intentional or not), sensor glitches, miscommunication in labeling processes, or even software bugs.
For instance, in a smartphone app usage dataset, an app might be misclassified as a game when it’s actually a productivity tool. Similarly, sensor readings from a wearable device could contain spikes or flat lines due to poor contact or temporary hardware issues.
Inaccurate data can mislead the model during training by providing incorrect patterns to learn from. Unlike missing data, inaccuracies are harder to detect automatically—they may look like valid data but theay are actually wrong. This makes data cleaning and validation steps critical before training a model.
Outliers
Outliers are data points that deviate significantly from the rest of the dataset. They may occur due to genuine rare events. Suppose a smartwatch or fitness tracker measures 100,000 steps of walk in a day, whereas it typically registers under 10,000 steps per day. This measurement might be valid in very rare cases (e.g., for an extreme athlete during his/her competition day). Outlier values must be carefully examined to rule out the possibility of errors in data. For example, in the case of fitness tracker word by common people, an outlier measurement of 100,000 steps in a day is most likely due to an error.
Outliers can distort training, especially in models sensitive to numerical scales like linear regression or neural networks. They may cause the model to place undue emphasis on extreme cases, which can shift decision boundaries in unhelpful ways. In classification tasks, a single mislabeled or extreme-value point can disproportionately affect the learned model, particularly if the dataset is small.
Irrelevant Features
Not all features in a dataset contribute meaningfully to the task at hand. Irrelevant features are features that have little or no predictive power for the target label, but are still included in the data. These may come from overly broad data collection, automatic logging, or simply poor feature selection.
For example, in our smartphone app classification task, a user’s phone wallpaper color or device serial number is unlikely to help predict type of app being used. These features may introduce noise, confuse the model, where the model learns patterns that are specific to the training set but do not generalize to real-world data.
The presence of irrelevant features increases the dimensionality of the input, which can slow down training, increase memory usage, and even worsen the model’s performance due to the curse of dimensionality the phenomenon where the high-dimensionality of the data makes it harder for models to learn meaningful patterns.
Real-World Data Issues
For each of the five data issues discussed above, can you think of more real-world examples from different domains? Try to think about how each data issue might appear in our smartphone app classification scenario. Discuss your thoughts with your peers.
Examples
- Missing Values
- Different domain: In healthcare, patient records may have missing blood pressure readings due to skipped checkups or broken devices.
- Smartphone App: A fitness app may lose heart rate or step count data when the wearable device disconnects.
- Class Imbalance
- Different domain: In fraud detection, fraudulent transactions make up less than 1% of all transactions.
- Smartphone App: Most apps belong to popular categories like “Social” or “Entertainment”, while categories like “Accessibility” are rare.
- Inaccurate Data
- Different domain: In financial systems, a stock trade might be recorded with the wrong timestamp or price due to system errors.
- Smartphone App: An app used for reading the news might be mislabeled as a “Game” during data annotation.
- Irrelevant Features
- Different domain: In education, a student’s favorite color may be included in the data but is irrelevant to predicting course success.
- Smartphone App: The user’s device wallpaper or battery level is unlikely to help predict app type.
- Outliers
- Different domain: A person recording 10,000 calories in a diet app in one day might be an entry mistake or an extreme case.
- Smartphone App: A user opening an app 500 times in one day is an unusual usage pattern that may distort training.
Hands-On Activity: Exploring Data Issues in sherlock_18apps
The sherlock_18apps
dataset, which we’ve been using for smartphone app classification,
is sourced from real-world, authentically collected data.
Therefore, it may contain imperfections that can impact model performance.
In this hands-on activity, you’ll explore the sherlock_18apps
dataset to identify potential data issues.
This exploration will help you connect the theoretical data issues discussed earlier to practical observations,
leading to a better understanding.
Loading Required Python Libraries and Objects
Ensure the following libraries are loaded in your environment:
import os import sys import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import sklearn from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix, classification_report import tensorflow as tf import tensorflow.keras as keras from tensorflow.keras.models import Sequential, load_model from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import Adam
Exploring Data Issues in
sherlock_18apps
DatasetIn this activity, we focus on missing values and class imbalance in the
sherlock_18apps
dataset because they are the most straightforward issues to identify by examining data and class distributions.Load and check the
sherlock_18apps
dataset usingsherlock_ML_toolbox
and answer:
- Which features have missing values, and how many?
- Are classes in the dataset imbalanced?
Solution
1. Inspecting Missing Values:
Import the function
load_prep_data_18apps
fromsherlock_ML_toolbox
to load dataset.from sherlock_ML_toolbox import load_prep_data_18apps
Load the dataset using
load_prep_data_18apps
to check for missing values.datafile = "sherlock/sherlock_18apps.csv" df, df2, labels, df_labels_onehot, df_features = load_prep_data_18apps(datafile,print_summary=False)
Output:
Loading input data from: sherlock/sherlock_18apps.csv Cleaning: - dropped 2 columns: ['cminflt', 'guest_time'] - remaining missing data (per feature): CPU_USAGE 52 cutime 52 num_threads 52 priority 52 rss 52 state 52 stime 52 utime 52 vsize 52 dtype: int64 - dropping the rest of missing data - remaining shape: (273077, 17) Step: Separating the labels (ApplicationName) from the features. Step: Converting all non-numerical features to one-hot encoding. Step: Feature scaling with StandardScaler
Analysis: The output shows that the
sherlock_18apps
dataset has missing values in multiple features: CPU_USAGE, cutime, num_threads, priority, rss, state, stime, utime, and vsize, each with 52 missing entries.
2. Inspecting Class Imbalance:
To investigate whether there is a class imbalance issue in the dataset, we define a function to print and visualize the label distribution.
def visualize_label_distribution(labels): """ Visualize the distribution of labels and print the frequency distribution. Parameters: labels (pd.Series): A pandas Series containing the labels (e.g., ApplicationName). """ # Check the frequency distribution of each category labels_distribution = labels.value_counts() # Print the frequency distribution print("Frequency Distribution of Labels:") print(labels_distribution) # Set the figure size plt.figure(figsize=(12, 8)) # Create a bar plot plt.bar(labels_distribution.index, labels_distribution.values, color='skyblue') # Add title and labels plt.title('Distribution of Labels', fontsize=16) plt.xlabel('Labels', fontsize=14) plt.ylabel('Frequency', fontsize=14) # Rotate x-axis labels for better readability plt.xticks(rotation=45, ha='right') # Display the plot plt.tight_layout() plt.show()
Apply the function to the variable
labels
, which is derived from the output ofload_prep_data_18apps
.visualize_label_distribution(labels)
Output:
Frequency Distribution of Labels: Google App 60001 Chrome 28045 Facebook 20103 Geo News 19991 Messenger 19989 WhatsApp 19985 Photos 17380 ES File Explorer 16660 Gmail 16414 Calendar 8986 Moovit 8365 Waze 8228 Hangouts 7601 YouTube 5173 Maps 5157 Skype 4876 Moriarty 3616 Messages 2507 Name: ApplicationName, dtype: int64
Analysis: The output reveals a significant class imbalance in the
sherlock_18apps
dataset. The Google App is heavily over-represented with 60,001 samples, while apps like Messages (2,507 samples), Moriarty (3,616 samples), and Skype (4,876 samples) are under-represented.Conclusion:
By exploring and answeing above two questions, we confirm the presence of both missing values and class imbalance in the
sherlock_18apps
dataset.
Common Model Training Issues and Their Impacts
Training a machine learning (ML) model involves optimizing its parameters to learn patterns from data. Ideally, the model’s complexity will align with the data’s complexity, allowing it to learn general patterns effectively and form a decision boundary that accurately separates different classes of samples, performing well on both training and validation sets.
However, when the model’s complexity does not align with the data’s complexity,
training problems such as overfitting and underfitting arise,
leading to poor performance or unstable training.
These issues can prevent models from generalizing effectively in tasks like sherlock_18apps
classification.
Understanding and addressing these training issues is crucial for building reliable ML models.



Overfitting
Overfitting occurs when a machine learning model becomes overly attuned to the noise and specific details of the training data, failing to capture generalizable patterns that apply to new, unseen data. A classic example is a student who memorizes every detail of practice questions, including irrelevant typos, and consequently fails to solve new problems that require genuine understanding.
This issue is often reflected in the divergence between training and validation metrics: while the model achieves extremely low training loss and high training accuracy (sometimes approaching perfection), its performance on the validation set deteriorates significantly, with rising loss and dropping accuracy. This gap signals that the model has memorized the training data’s idiosyncrasies rather than learning underlying trends, a critical flaw for real-world applications where data variability is inevitable.
Visually, overfitting can be understood through the lens of decision boundaries in simplified feature spaces. In a two-dimensional example, an overfitted model might produce a highly complex, “wiggly” decision boundary that tightly wraps around every training point, even those influenced by random noise. Unlike a smooth, generalized boundary that captures the core separation between classes, this overly intricate boundary reflects the model’s attempt to explain every nuance of the training data—including irrelevant or misleading details. The result is a model that performs flawlessly on the training set but struggles to classify new samples correctly, as shown in the side-by-side illustrations below.


Underfitting
Underfitting occurs when a machine learning model is too simple relative to the data’s complexity, failing to capture the underlying patterns. This typically happens when the model has insufficient complexity (e.g., too few layers or parameters) or is not trained long enough to learn meaningful trends in the data. A typical example is a weather app that uses only temperature to predict rain, ignoring important factors like humidity and pressure. Such a model will fail to produce accurate forecasts because it lacks the capacity to understand the true complexity of the problem.
This issue is often reflected in the convergence of training and validation metrics at suboptimal levels: the model exhibits high training loss and low training accuracy, with similar poor performance on the validation set. This lack of divergence between training and validation metrics signals that the model has not learned meaningful trends from the training data, rendering it ineffective for real-world applications where capturing essential patterns is crucial.
Visually, underfitting can be understood through the lens of decision boundaries in simplified feature spaces. In a two-dimensional example, an underfitted model might produce an overly simplistic decision boundary, such as a straight line, that fails to adequately separate the classes, leaving many points misclassified. Unlike a well-balanced boundary that captures the core separation between classes, this rudimentary boundary reflects the model’s inability to model the data’s complexity. The result is a model that performs poorly on both training and validation sets, as shown in the side-by-side illustrations below.


Alternative Evaluation Metrics for Model Performance
In this section, we will introduce several commonly used alternative evaluation metrics beyond accuracy,
including precision, recall and F1 Score,
to comprehensively assess the performance of ML models,
even when dealing with imbalanced datasets where accuracy alone can be misleading.
These metrics will guide us in improving our sherlock_18apps
classifier.
Why Alternative Metrics Matter
In previous episodes, we primarily used accuracy to evaluate ML model performance.
While intuitive, accuracy can be misleading in imbalanced datasets.
A model might achieve high overall accuracy by correctly predicting majority classes
while failing on minority classes.
For example, the baseline model we developed in previous episodes for sherlock_18apps
reports a high overall accuracy, even with the dataset’s significant imbalance.
However, we don’t know whether the model performs well across all classes,
particularly for minority classes.
To investigate whether class imbalance compromises the baseline model’s performance on minority classes, we first need to look beyond a single accuracy score. A powerful tool for this is the confusion matrix, which provides a detailed breakdown of a classifier’s performance.
Challenge: Investigating Confusion Matrix
In the previous episode (e.g.,
24-keras-classify.md
), we trained a baseline model forsherlock_18apps
and printed its overall accuracy on the validation set.def NN_Model_1H(hidden_neurons, learning_rate): """Definition of deep learning model with one dense hidden layer""" model = Sequential([ # More hidden layers can be added here Dense(hidden_neurons, activation='relu', input_shape=(19,), kernel_initializer='random_normal'), # Hidden Layer Dense(18, activation='softmax', kernel_initializer='random_normal') # Output Layer ]) adam_opt = Adam(lr=learning_rate, beta_1=0.9, beta_2=0.999, amsgrad=False) model.compile(optimizer=adam_opt, loss='categorical_crossentropy', metrics=['accuracy']) return model model_1H = NN_Model_1H(18,0.0003) model_1H_history = model_1H.fit(train_features, train_L_onehot, epochs=10, batch_size=32, validation_data=(val_features, val_L_onehot), verbose=2)
Epoch 1/10 6827/6827 - 7s - loss: 1.1037 - accuracy: 0.6752 - val_loss: 0.5488 - val_accuracy: 0.8702 Epoch 2/10 6827/6827 - 6s - loss: 0.4071 - accuracy: 0.9047 - val_loss: 0.3205 - val_accuracy: 0.9245 Epoch 3/10 6827/6827 - 6s - loss: 0.2743 - accuracy: 0.9319 - val_loss: 0.2425 - val_accuracy: 0.9385 Epoch 4/10 6827/6827 - 6s - loss: 0.2177 - accuracy: 0.9468 - val_loss: 0.1990 - val_accuracy: 0.9509 Epoch 5/10 6827/6827 - 6s - loss: 0.1818 - accuracy: 0.9592 - val_loss: 0.1692 - val_accuracy: 0.9628 Epoch 6/10 6827/6827 - 6s - loss: 0.1561 - accuracy: 0.9664 - val_loss: 0.1470 - val_accuracy: 0.9671 Epoch 7/10 6827/6827 - 6s - loss: 0.1363 - accuracy: 0.9703 - val_loss: 0.1296 - val_accuracy: 0.9708 Epoch 8/10 6827/6827 - 6s - loss: 0.1209 - accuracy: 0.9740 - val_loss: 0.1171 - val_accuracy: 0.9739 Epoch 9/10 6827/6827 - 6s - loss: 0.1089 - accuracy: 0.9769 - val_loss: 0.1058 - val_accuracy: 0.9770 Epoch 10/10 6827/6827 - 6s - loss: 0.0995 - accuracy: 0.9786 - val_loss: 0.0970 - val_accuracy: 0.9792
This code gives us a single accuracy score, but it doesn’t show whether the class imbalance issue affect model’s performance, especially on minority classes.
Can you print the confusion matrix (like below showed Reference Confusion Matrix) to show the count of correct and incorrect predictions compared to the actual outcomes?
Reference Confusion Matrix (from episode
24-keras-classify.md
, decision tree classifier):1829 1 0 0 0 0 0 0 0 0 0 18 0 0 0 0 1 0 0 5477 0 0 0 69 0 0 0 0 0 0 0 5 0 0 2 0 1 610 2753 0 0 25 0 5 0 1 1 1 0 2 0 0 0 0 0 0 0 4029 0 0 15 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 4006 0 0 0 0 0 0 0 0 0 0 0 0 0 64 28 0 0 0 3183 1 0 0 0 0 1 0 49 0 0 0 0 0 143 0 0 0 2 10459 0 0 0 15 0 0 0 0 0 1369 0 0 58 0 0 0 24 4 1408 0 1 0 0 0 1 0 0 11 0 3 39 0 0 0 0 1 0 935 0 0 0 0 0 1 0 4 0 0 0 0 0 0 0 0 0 1 486 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4016 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1697 0 0 0 0 0 0 0 13 0 0 4 0 0 0 0 0 0 0 680 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 3473 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1003 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 1642 0 0 0 4 0 0 0 0 4 0 0 0 0 0 0 0 0 0 3897 0 0 0 0 0 0 116 0 0 0 0 0 0 0 0 0 0 0 897
Solution
Firstly, we need to split the dataset and preprocess the data by importing the function
split_data_18apps
fromsherlock_ML_toolbox
.from sherlock_ML_toolbox import split_data_18apps train_features, val_features, train_labels, val_labels, train_L_onehot, val_L_onehot = split_data_18apps(df_features, labels, df_labels_onehot)
Then, we train the baseline model.
model_1H = NN_Model_1H(18,0.0003) model_1H_history = model_1H.fit(train_features, train_L_onehot, epochs=10, batch_size=32, validation_data=(val_features, val_L_onehot), verbose=2)
To output the confusion matrix, we need to define a new model evaluation function.
def model_evaluate_custom(model, test_F, test_L_one_hot): # 1. Get model predictions test_L_pred_prob = model.predict(test_F) test_L_pred_indices = np.argmax(test_L_pred_prob, axis=1) # Predicted class indices # 2. Convert true one-hot encoded labels to class indices if isinstance(test_L_one_hot, pd.DataFrame): test_L_true_values = test_L_one_hot.values else: # Assuming NumPy array test_L_true_values = test_L_one_hot test_L_true_indices = np.argmax(test_L_true_values, axis=1) # True class indices num_classes = test_L_true_values.shape[1] # Get number of classes from the shape of the labels # 3. Calculate Confusion Matrix # The 'labels' parameter ensures the confusion matrix dimensions are consistent # with the number of classes (0 to num_classes-1) cm_labels = np.arange(num_classes) cm = confusion_matrix(test_L_true_indices, test_L_pred_indices, labels=cm_labels) return cm
Then we use the new model evaluation function to evaluate the trained baseline model, and print the confusion matrix.
conf_matrix = model_evaluate_custom(model_1H, val_features, val_L_onehot) print("Confusion Matrix:") print(conf_matrix)
Output:
Confusion Matrix: [[ 1746 0 1 0 2 90 1 0 2 0 0 5 0 0 1 0 1 0] [ 0 5199 36 0 33 1 0 1 4 25 0 0 0 1 0 250 3 0] [ 2 21 3342 0 0 7 0 4 0 0 0 0 0 10 5 0 3 5] [ 0 0 1 4017 0 1 22 6 0 0 4 1 0 0 0 0 0 2] [ 0 0 0 0 4000 0 0 0 0 6 0 0 0 0 0 0 0 0] [ 82 1 10 0 0 3198 21 2 5 0 0 0 0 1 2 0 1 3] [ 4 1 1 34 0 3 11922 5 2 0 16 0 0 0 0 0 0 0] [ 0 4 2 1 0 22 3 1464 0 0 0 0 0 0 0 9 2 0] [ 11 2 11 0 0 0 0 0 951 0 0 0 0 2 4 0 1 1] [ 0 2 0 0 2 0 0 0 0 474 0 0 0 16 0 1 0 0] [ 0 0 1 0 0 0 1 0 0 0 4013 0 0 0 0 0 0 1] [ 1 0 0 0 0 0 0 0 0 0 0 1695 0 0 0 0 1 0] [ 0 1 0 0 4 0 0 1 0 0 0 0 692 0 0 0 0 0] [ 9 18 198 0 0 0 1 0 2 0 0 0 0 3231 19 0 1 0] [ 0 0 8 0 0 0 0 0 4 0 0 0 0 0 991 0 0 0] [ 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 1637 3 0] [ 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3902 0] [ 0 2 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1007]]
Introducing Precision, Recall, and F1-Score
Indeed, the confusion matrix clearly displays the classification results, detailing the model’s predictions for each category. However, merely looking at these raw counts makes it difficult to intuitively judge whether the model’s performance on different classes is good or bad, especially when dealing with imbalanced datasets. To gain a deeper understanding of the model’s strengths and weaknesses, we need to convert these numbers into more insightful evaluation metrics. Here, we three crucial alternative evaluation metrics: precision, recall, and the F1-score. These metrics provide a more granular view of classification performance beyond just correct versus incorrect predictions.

Precision
Definition: Precision measures how many of the samples predicted as a certain class actually belong to that class. It answers the question: “Of all the instances I predicted as positive, how many were actually positive?”
Formula:
Where:
-
TP (True Positive): The number of correctly predicted positive samples.
-
FP (False Positive): The number of samples incorrectly predicted as positive.
Intuitive Explanation:
Precision can be understood as “how reliable the model’s positive predictions are.”
For example, if your model predicts 100 apps as “Moriarty” (positive class),
but only 80 of them are actually “Moriarty,” then the precision is 80%.
When to Use:
Precision is important when false positives are costly, such as: (needed to be modified later.)
-
Spam Detection: A low Precision would mean that too many legitimate emails are misclassified as spam.
-
Medical Diagnosis (e.g., cancer screening): If Precision is low, many healthy individuals might be misdiagnosed as having a disease, leading to unnecessary anxiety and medical costs.
Recall
Definition: Recall measures how many of the actual positive samples were correctly identified by the model. It answers the question: “Of all the actual positive instances, how many did I correctly predict?”
Formula:
\[Recall = \frac{TP}{TP + FN}\]Where:
-
TP (True Positive): The number of correctly predicted positive samples.
-
FN (False Negative): The number of actual positive samples incorrectly classified as negative.
Intuitive Explanation:
Recall can be thought of as “how many real targets were successfully found.”
A high Recall means that the model rarely misses true positive cases.
For example, if there are 100 actual “Moriarty” apps in the dataset,
but your model only detects 60 of them, then the recall is 60%.
When to Use:
Recall is crucial when missing positive cases is costly, such as:
-
Medical Screening: If Recall is low, some actual patients may go undetected, delaying treatment.
-
Security Surveillance (e.g., intrusion detection): If Recall is low, many actual threats might be overlooked, increasing security risks.
F1-Score
Definition: F1-Score is the harmonic mean of Precision and Recall, providing a balance between both metrics.
Formula:
F1-Score is useful when there is an imbalance between Precision and Recall, as it considers both metrics in a single number.
Intuitive Explanation:
F1-Score acts as a “compromise” between Precision and Recall. In scenarios where both false positives and false negatives are undesirable, F1-Score helps evaluate overall model performance.
When to Use:
F1-Score is ideal when both Precision and Recall matter, such as:
-
Information Retrieval (e.g., search engines): The model should not only return accurate results (Precision) but also retrieve all relevant results (Recall).
-
Fraud Detection: The model must avoid both false alarms (Precision) and missed fraud cases (Recall).
Trade-off Between Precision and Recall
-
High Precision: Your model is very strict and only predicts “Moriarty” when it’s very confident. This reduces false positives but may miss some actual “Moriarty” apps (lower recall).
-
High Recall: Your model is more lenient and tries to catch as many “Moriarty” apps as possible. This reduces false negatives but may include more false positives (lower precision).
So, precision and recall often have an inverse relationship:
-
High Precision typically means lower Recall: Your model is very selective and only predicts positive when it’s very confident, which may cause it to miss some actual positives.
-
High Recall typically means lower Precision: Your model is more aggressive in predicting positives, which may include more false positives.
The F1-score helps you find the optimal balance between these two metrics.
Key Points
Data issues like missing values, imbalance, and errors lead to biased models and poor predictions.
Metrics like precision, recall, and F1 score provide class-specific insights into model performance.
Overfitting and underfitting can be detected by comparing training and validation metrics.