Model Optimization

Cross-Validation Techniques

Learning Outcome

Understanding what Cross-Validation is.

Understand why cross-validation is important in machine learning.

Describe different cross-validation techniques.

Implement K-Fold Cross-Validation using Scikit-Learn.

Understand the role of Stratified K-Fold for classification tasks.

Use cross-validation to evaluate machine learning models more reliably.

Recall: Train-Test Split

Basic evaluation method:

Divide dataset into two subsets

Train on one, test on the other

Example Split

20%

Testing set

80%

Training set

Limitations:

Model performance may depend heavily on the specific split.

A lucky split may produce very high accuracy.

A poor split may produce low accuracy even if the model is good.

Model performance may depend heavily on the specific split.

This issue motivates the use of cross-validation, which evaluates models across multiple data splits

You are preparing for an important exam....

You take one practice test:

You score: 85%

But does that really prove you're ready?

Maybe that test covered topics you already knew well !!

Now imagine taking many different practice tests...

Some focus on algebra

Some test probability

Some include harder questions

After taking several tests, your average score gives a much clearer picture of your true preparation

Machine learning models face the same challenge:

Evaluating a model once may give a misleading result

Evaluating it multiple times on different data splits reveals its real performance

That repeated evaluation process is called Cross-Validation

Cross-Validation is a statistical technique used to estimate how well a machine learning model will perform on unseen data

It splits data into multiple subsets and evaluates the model multiple times:

For each iteration:

The model is trained on a portion of the data.
The remaining portion is used for testing.
The performance score is recorded.

Why Cross-Validation?

The Problem with Single Split

• Results highly dependent on random split

• Unreliable for small datasets

• Risk of overfitting to specific split

Overfitting Risk

Model memorizes training data quirks instead of learning patterns

Cross-Validation Solution

• Multiple train-test iterations

• Uses entire dataset efficiently

• Provides robust performance estimates

Holdout Validation

Simplest Evaluation Method

Dataset divided into two subsets for training and testing

Example Split

20%

Testing set

80%

Training set

Limitations

• Results depend on split

• Unreliable for small datasets

These limitations lead to more reliable methods like K-Fold Cross-Validation

Advantages

• Simple to implement

• Fast computation

K-Fold Cross-Validation

How It Works

Divide dataset into k equal folds

Example

Train on 4 folds → Test on 1 fold
Repeat 5 times, rotating test fold
Average performance across all folds

k = 5

Advantages

Uses entire dataset

Reliable evaluation

Reduces Bias

Multiple iterations

Typical Values

k = 5 or 10

Implementation in Python (Scikit-Learn)

The implementation of cross-validation in Python is very similar to the process used for training models.

Step 1 — Import required libraries

from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Step 2 — Load the dataset

X, y = load_iris(return_X_y=True)

Step 3 — Initialize the model

model = LogisticRegression(max_iter=200)

Step 4 — Apply K-Fold Cross-Validation

kf = KFold(n_splits=5)

scores_kf = cross_val_score(model, X, y, cv=kf)

print("K-Fold Scores:", scores_kf)
print("Mean Accuracy:", scores_kf.mean())

Step 5 — Apply Stratified K-Fold Cross-Validation

skf = StratifiedKFold(n_splits=5)

scores_skf = cross_val_score(model, X, y, cv=skf)

print("Stratified K-Fold Scores:", scores_skf)
print("Mean Accuracy:", scores_skf.mean())

Stratified K-Fold Cross-Validation