Explore Data, and Run Experiments

Explore Data, and Run Experiments Detailed Explanation

1. Data Exploration and Analysis

Before training a machine learning model, it is essential to understand the dataset through Exploratory Data Analysis (EDA). This step helps identify:

Patterns and trends in the data.
Missing values and outliers that may impact model performance.
Relationships between variables, helping in feature selection.
Potential errors or inconsistencies in the dataset.

In this section, we will explore different data visualization techniques, statistical methods, missing value handling, and outlier detection to prepare data for machine learning.

1.1 Data Visualization

Data visualization is one of the most effective ways to understand the structure of the data. It helps in identifying data distributions, correlations, and anomalies.

Here are the most common types of visualizations used in data exploration:

1. Histogram (Understanding Feature Distribution)

A histogram is a bar graph that represents the distribution of numerical data. It helps in identifying:

Whether the data follows a normal distribution.
If the data is skewed (biased toward a range of values).
Whether there are outliers or anomalies in the dataset.

Example: Histogram for Age Distribution

import matplotlib.pyplot as plt
import seaborn as sns

#Create histogram
plt.hist(data['age'], bins=20, color='blue', edgecolor='black')

#Labels and title
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution")

plt.show()

This histogram shows the frequency of different age groups in the dataset.

2. Box Plot (Detecting Outliers and Data Spread)

A box plot (also called a whisker plot) shows:

The spread of the data (minimum, maximum, quartiles).
Presence of outliers (marked as individual dots outside the whiskers).
Whether the data is symmetrical or skewed.

Example: Box Plot for Salary Distribution

sns.boxplot(x=data['salary'])

If the box plot has dots far from the whiskers, they are potential outliers.
If the median (middle line) is not centered, it indicates skewness in the data.

3. Scatter Plot (Analyzing Relationships Between Two Variables)

A scatter plot helps analyze whether two variables have a relationship (correlation).

Example: Relationship Between Age and Salary

sns.scatterplot(x=data['age'], y=data['salary'])

If points form a diagonal line, there is a strong correlation.
If points are scattered randomly, there is no correlation.

4. Heatmap (Correlation Matrix)

A heatmap visualizes correlations between numerical variables.

Example: Creating a Heatmap

import numpy as np

#Compute correlation matrix
corr_matrix = data.corr()

#Create heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

Values close to +1 indicate a strong positive correlation.
Values close to -1 indicate a strong negative correlation.
Values around 0 indicate no correlation.

5. Pair Plot (Relationships Between Multiple Variables)

A pair plot visualizes relationships between multiple numerical variables.

Example: Creating a Pair Plot

sns.pairplot(data[['age', 'salary', 'experience']])

Diagonal plots show histograms for each feature.
Off-diagonal plots show scatter plots to analyze relationships.

1.2 Statistical Analysis

While visualizations help us see patterns, statistical methods help us quantify these patterns.

1. Measures of Central Tendency

These metrics describe the middle value of a dataset.

Mean (Average): Sum of values divided by count.
Median: Middle value of the sorted dataset.
Mode: Most frequently occurring value.

Example: Calculate Mean, Median, and Mode in Python

print("Mean Salary:", data['salary'].mean())
print("Median Salary:", data['salary'].median())
print("Mode Salary:", data['salary'].mode()[0])

2. Measures of Spread

These metrics describe how much data varies.

Variance: Measures how far values are spread from the mean.
Standard Deviation: Square root of variance.
Range: Difference between max and min value.

Example: Calculate Variance and Standard Deviation

print("Variance:", data['salary'].var())
print("Standard Deviation:", data['salary'].std())

3. Skewness and Kurtosis

Skewness: Measures asymmetry of data distribution.
Kurtosis: Measures whether data has heavy tails (outliers).

Example: Check Skewness and Kurtosis

print("Skewness:", data['salary'].skew())
print("Kurtosis:", data['salary'].kurtosis())

Skewness > 0: Right-skewed (positive).
Skewness < 0: Left-skewed (negative).

1.3 Handling Missing Values

Missing values can negatively impact model accuracy. They must be handled properly.

1. Methods for Handling Missing Values

Method	When to Use
Delete Rows	If missing values are very few and removing them won’t affect analysis.
Fill with Mean/Median	When data is numerical and missing values are randomly spread.
Fill with Mode	For categorical variables with common values.
Predictive Imputation	Using machine learning to predict missing values based on other features.

Example: Handling Missing Values in Python

1. Removing Missing Values

data_cleaned = data.dropna()

2. Filling Missing Values with Mean

data['salary'].fillna(data['salary'].mean(), inplace=True)

3. Filling Categorical Missing Values with Mode

data['city'].fillna(data['city'].mode()[0], inplace=True)

1.4 Outlier Detection

Outliers are extreme values that can distort model training.

Methods for Detecting Outliers

Z-Score Method

Measures how far a data point is from the mean.
Values above 3 or below -3 are considered outliers.

from scipy.stats import zscore
data['zscore'] = zscore(data['salary'])
outliers = data[data['zscore'].abs() > 3]

Interquartile Range (IQR) Method

Values outside 1.5 × IQR are outliers.

Q1 = data['salary'].quantile(0.25)
Q3 = data['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data['salary'] < Q1 - 1.5 * IQR) | (data['salary'] > Q3 + 1.5 * IQR)]

2. Running Experiments and Model Evaluation

Once you have explored and preprocessed the data, the next step is to run experiments and evaluate the performance of various machine learning models. This process involves testing different algorithms, hyperparameters, and training strategies to find the most effective model for your data.

In this section, we will cover the following topics:

Experimental Design: Planning and running experiments effectively.
Model Comparison: Evaluating and comparing different models.
Hyperparameter Tuning: Optimizing the parameters of your models for better performance.

2.1 Experimental Design

Experimental design refers to how you organize and conduct your experiments to ensure reliable and consistent results. This step is crucial because the results of your experiments will guide the selection of the best model for your task.

The basic steps involved in experimental design include:

Define your objective: What do you want to predict or classify? Define the problem clearly.
Choose models to test: Decide which machine learning models are appropriate for your problem.
Select performance metrics: Determine how to evaluate the success of each model (e.g., accuracy, F1-score, mean squared error).
Preprocess the data: Apply the necessary preprocessing steps such as scaling, encoding, and handling missing values.
Train the model: Train your models on a training dataset.
Evaluate the model: Assess how well each model performs on a test or validation dataset.
Repeat the process: Iterate by trying new models or adjusting the preprocessing steps until you achieve the desired performance.

2.2 Model Comparison

After training multiple models, it's essential to compare them based on their performance. This is where evaluation metrics come in, helping you understand which model best suits your data and problem.

Here are a few methods and techniques for comparing models:

1. Cross-Validation

Cross-validation helps you assess how well a model generalizes to unseen data. It splits the dataset into K folds (typically 5 or 10), and the model is trained and evaluated on each fold.

K-Fold Cross-Validation: The data is divided into K parts. For each part, the model is trained on the remaining K-1 parts and tested on the current part.
This process helps ensure that the model is not overfitting to a particular subset of the data.

Example: K-Fold Cross-Validation in Python

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Average score: {scores.mean()}")

2. Holdout Method

The holdout method splits the dataset into two subsets:

Training Set: Used to train the model.
Test Set: Used to evaluate the model's performance.

Example: Train-Test Split in Python

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Model Ensembling

Ensembling is the process of combining multiple models to improve performance. The idea is that multiple models will make fewer errors than individual models. Common ensembling techniques include:

Bagging (Bootstrap Aggregating): Randomly samples subsets of data and trains multiple models on them. Random Forest is an example.
Boosting: Sequentially trains models, each learning from the errors of the previous model. XGBoost and AdaBoost are popular examples.
Stacking: Combines the predictions of multiple models using another model (a meta-model).

Example: Ensembling with Random Forest

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

4. Model Evaluation Metrics

Model evaluation metrics vary based on the type of problem (classification, regression, etc.):

For Classification Problems:

Accuracy: The proportion of correct predictions to total predictions.

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

Precision and Recall: Precision is the proportion of positive predictions that were correct, and recall is the proportion of actual positives that were identified by the model.
F1-Score: The harmonic mean of precision and recall, useful for imbalanced classes.
ROC-AUC: The Receiver Operating Characteristic curve evaluates the trade-off between true positive rate (sensitivity) and false positive rate (1-specificity).

from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, predictions)
print(f"ROC-AUC: {roc_auc}")

For Regression Problems:

Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")

Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual values.

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, predictions)
print(f"Mean Absolute Error: {mae}")

R-Squared (R²): Indicates how well the model fits the data, with values closer to 1 indicating a better fit.

from sklearn.metrics import r2_score
r2 = r2_score(y_test, predictions)
print(f"R-Squared: {r2}")

2.3 Hyperparameter Tuning

Hyperparameters are settings that control the behavior of the machine learning model but are not learned from the data itself (e.g., learning rate, number of trees in a random forest). Properly tuning these can improve the model’s performance.

There are several techniques for tuning hyperparameters:

1. Grid Search

Grid Search is a brute-force approach where you specify a set of hyperparameters to test, and the algorithm tries all possible combinations.

Example: Grid Search in Python

from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [10, 20, None]}
grid_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")

2. Random Search

Random Search randomly selects combinations of hyperparameters within specified ranges, providing a faster but less exhaustive search compared to Grid Search.

Example: Random Search in Python

from sklearn.model_selection import RandomizedSearchCV

param_dist = {'n_estimators': [50, 100, 200], 'max_depth': [10, 20, None]}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_dist, cv=5)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")

3. Bayesian Optimization

Bayesian Optimization is a more efficient method that models the performance of hyperparameters using probabilistic models. It’s particularly useful when searching large, complex spaces.

Explore Data, and Run Experiments (Additional Content)

1. Azure ML Experiment Tracking (Run, Metrics Logging)

1.1 Using the `Run` Object

In Azure Machine Learning, every training or evaluation execution is logged as a Run. This object tracks:

Metrics (e.g., accuracy, loss)
Parameters
Artifacts (e.g., model files, charts)
Logs

Logging Metrics

You can use the Run object to log custom metrics inside your training script. This is especially useful for:

Monitoring during training
Comparing runs after training

Example: Logging Accuracy

from azureml.core import Run

run = Run.get_context()
run.log("accuracy", 0.91)

You can also log multiple values across epochs or iterations:

for epoch in range(5):
    accuracy = 0.9 + 0.01 * epoch
    run.log("accuracy", accuracy)

1.2 Why This Matters in DP-100

Exam scenarios often ask about how to monitor or compare experiments.
Mastering metric logging allows better use of Azure ML Studio for visual diagnostics.

2. AutoML Exploratory Runs

Azure AutoML automates the model selection and tuning process by running multiple trials with different algorithms and hyperparameters.

2.1 Setting Up AutoML Configurations

AutoML can be configured for tasks such as classification, regression, and time series forecasting. Exploratory runs are especially useful when you're not sure which algorithm performs best.

Example: AutoMLConfig for Binary Classification

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(
    task='classification',
    training_data=train_data,
    label_column_name='target',
    iterations=10,
    primary_metric='AUC_weighted',
    compute_target='my-compute',
    experiment_timeout_minutes=30
)

iterations=10: Run 10 different combinations of algorithm + hyperparameters.
primary_metric='AUC_weighted': Optimizes for area under the ROC curve.
experiment_timeout_minutes=30: Limits exploration time.

Submitting AutoML Experiment

from azureml.core import Experiment

experiment = Experiment(workspace, 'automl_experiment')
run = experiment.submit(automl_config, show_output=True)

2.2 Benefits of AutoML Exploratory Runs

Accelerates model prototyping.
Automatically logs and ranks runs by performance.
Enables easy access to best model artifacts and metrics.

3. Azure ML Pipelines in Experiment Design (Advanced)

While experiment design usually includes steps like define → preprocess → train → evaluate → iterate, the DP-100 exam also emphasizes workflow automation via pipelines.

3.1 Why Use Azure ML Pipelines

Pipelines let you chain multiple ML steps (e.g., preprocessing → training → evaluation).
They support reusability, parameterization, and scheduling for production use.

3.2 Basic Pipeline Flow (Conceptual)

Step	Description
Step 1: Load Data	Use registered datasets or data assets
Step 2: Preprocess	Clean and transform data
Step 3: Train Model	Fit the model on prepared data
Step 4: Evaluate	Log metrics and visualize results
Step 5: Register	Store the best model

These steps are wrapped as PipelineStep objects (e.g., PythonScriptStep) and combined into a pipeline.

3.3 Pipeline Example (Conceptual)

from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline

train_step = PythonScriptStep(
    name='train_model',
    script_name='train.py',
    compute_target='my-compute',
    arguments=['--data-path', data_path],
    source_directory='scripts'
)

pipeline = Pipeline(workspace=ws, steps=[train_step])
pipeline_run = pipeline.submit('my-pipeline-run')

3.4 Pipeline Use Cases in DP-100

Automating retraining workflows.
Parameterized experiment runs.
Trigger-based training (e.g., new data availability).

Shopping cart

Subtotal:

DP-100 Explore Data, and Run Experiments

Detailed list of DP-100 knowledge points