Before training a machine learning model, it is essential to understand the dataset through Exploratory Data Analysis (EDA). This step helps identify:
In this section, we will explore different data visualization techniques, statistical methods, missing value handling, and outlier detection to prepare data for machine learning.
Data visualization is one of the most effective ways to understand the structure of the data. It helps in identifying data distributions, correlations, and anomalies.
Here are the most common types of visualizations used in data exploration:
A histogram is a bar graph that represents the distribution of numerical data. It helps in identifying:
Example: Histogram for Age Distribution
import matplotlib.pyplot as plt
import seaborn as sns
#Create histogram
plt.hist(data['age'], bins=20, color='blue', edgecolor='black')
#Labels and title
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution")
plt.show()
This histogram shows the frequency of different age groups in the dataset.
A box plot (also called a whisker plot) shows:
Example: Box Plot for Salary Distribution
sns.boxplot(x=data['salary'])
A scatter plot helps analyze whether two variables have a relationship (correlation).
Example: Relationship Between Age and Salary
sns.scatterplot(x=data['age'], y=data['salary'])
A heatmap visualizes correlations between numerical variables.
Example: Creating a Heatmap
import numpy as np
#Compute correlation matrix
corr_matrix = data.corr()
#Create heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()
A pair plot visualizes relationships between multiple numerical variables.
Example: Creating a Pair Plot
sns.pairplot(data[['age', 'salary', 'experience']])
While visualizations help us see patterns, statistical methods help us quantify these patterns.
These metrics describe the middle value of a dataset.
Example: Calculate Mean, Median, and Mode in Python
print("Mean Salary:", data['salary'].mean())
print("Median Salary:", data['salary'].median())
print("Mode Salary:", data['salary'].mode()[0])
These metrics describe how much data varies.
Example: Calculate Variance and Standard Deviation
print("Variance:", data['salary'].var())
print("Standard Deviation:", data['salary'].std())
Example: Check Skewness and Kurtosis
print("Skewness:", data['salary'].skew())
print("Kurtosis:", data['salary'].kurtosis())
Missing values can negatively impact model accuracy. They must be handled properly.
| Method | When to Use |
|---|---|
| Delete Rows | If missing values are very few and removing them won’t affect analysis. |
| Fill with Mean/Median | When data is numerical and missing values are randomly spread. |
| Fill with Mode | For categorical variables with common values. |
| Predictive Imputation | Using machine learning to predict missing values based on other features. |
data_cleaned = data.dropna()
data['salary'].fillna(data['salary'].mean(), inplace=True)
data['city'].fillna(data['city'].mode()[0], inplace=True)
Outliers are extreme values that can distort model training.
from scipy.stats import zscore
data['zscore'] = zscore(data['salary'])
outliers = data[data['zscore'].abs() > 3]
Q1 = data['salary'].quantile(0.25)
Q3 = data['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data['salary'] < Q1 - 1.5 * IQR) | (data['salary'] > Q3 + 1.5 * IQR)]
Once you have explored and preprocessed the data, the next step is to run experiments and evaluate the performance of various machine learning models. This process involves testing different algorithms, hyperparameters, and training strategies to find the most effective model for your data.
In this section, we will cover the following topics:
Experimental design refers to how you organize and conduct your experiments to ensure reliable and consistent results. This step is crucial because the results of your experiments will guide the selection of the best model for your task.
The basic steps involved in experimental design include:
After training multiple models, it's essential to compare them based on their performance. This is where evaluation metrics come in, helping you understand which model best suits your data and problem.
Here are a few methods and techniques for comparing models:
Cross-validation helps you assess how well a model generalizes to unseen data. It splits the dataset into K folds (typically 5 or 10), and the model is trained and evaluated on each fold.
Example: K-Fold Cross-Validation in Python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Average score: {scores.mean()}")
The holdout method splits the dataset into two subsets:
Example: Train-Test Split in Python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Ensembling is the process of combining multiple models to improve performance. The idea is that multiple models will make fewer errors than individual models. Common ensembling techniques include:
Example: Ensembling with Random Forest
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
Model evaluation metrics vary based on the type of problem (classification, regression, etc.):
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_test, predictions)
print(f"ROC-AUC: {roc_auc}")
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, predictions)
print(f"Mean Absolute Error: {mae}")
from sklearn.metrics import r2_score
r2 = r2_score(y_test, predictions)
print(f"R-Squared: {r2}")
Hyperparameters are settings that control the behavior of the machine learning model but are not learned from the data itself (e.g., learning rate, number of trees in a random forest). Properly tuning these can improve the model’s performance.
There are several techniques for tuning hyperparameters:
Grid Search is a brute-force approach where you specify a set of hyperparameters to test, and the algorithm tries all possible combinations.
Example: Grid Search in Python
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [10, 20, None]}
grid_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
Random Search randomly selects combinations of hyperparameters within specified ranges, providing a faster but less exhaustive search compared to Grid Search.
Example: Random Search in Python
from sklearn.model_selection import RandomizedSearchCV
param_dist = {'n_estimators': [50, 100, 200], 'max_depth': [10, 20, None]}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_dist, cv=5)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
Bayesian Optimization is a more efficient method that models the performance of hyperparameters using probabilistic models. It’s particularly useful when searching large, complex spaces.
Run ObjectIn Azure Machine Learning, every training or evaluation execution is logged as a Run. This object tracks:
Metrics (e.g., accuracy, loss)
Parameters
Artifacts (e.g., model files, charts)
Logs
You can use the Run object to log custom metrics inside your training script. This is especially useful for:
Monitoring during training
Comparing runs after training
Example: Logging Accuracy
from azureml.core import Run
run = Run.get_context()
run.log("accuracy", 0.91)
You can also log multiple values across epochs or iterations:
for epoch in range(5):
accuracy = 0.9 + 0.01 * epoch
run.log("accuracy", accuracy)
Exam scenarios often ask about how to monitor or compare experiments.
Mastering metric logging allows better use of Azure ML Studio for visual diagnostics.
Azure AutoML automates the model selection and tuning process by running multiple trials with different algorithms and hyperparameters.
AutoML can be configured for tasks such as classification, regression, and time series forecasting. Exploratory runs are especially useful when you're not sure which algorithm performs best.
from azureml.train.automl import AutoMLConfig
automl_config = AutoMLConfig(
task='classification',
training_data=train_data,
label_column_name='target',
iterations=10,
primary_metric='AUC_weighted',
compute_target='my-compute',
experiment_timeout_minutes=30
)
iterations=10: Run 10 different combinations of algorithm + hyperparameters.
primary_metric='AUC_weighted': Optimizes for area under the ROC curve.
experiment_timeout_minutes=30: Limits exploration time.
from azureml.core import Experiment
experiment = Experiment(workspace, 'automl_experiment')
run = experiment.submit(automl_config, show_output=True)
Accelerates model prototyping.
Automatically logs and ranks runs by performance.
Enables easy access to best model artifacts and metrics.
While experiment design usually includes steps like define → preprocess → train → evaluate → iterate, the DP-100 exam also emphasizes workflow automation via pipelines.
Pipelines let you chain multiple ML steps (e.g., preprocessing → training → evaluation).
They support reusability, parameterization, and scheduling for production use.
| Step | Description |
|---|---|
| Step 1: Load Data | Use registered datasets or data assets |
| Step 2: Preprocess | Clean and transform data |
| Step 3: Train Model | Fit the model on prepared data |
| Step 4: Evaluate | Log metrics and visualize results |
| Step 5: Register | Store the best model |
These steps are wrapped as PipelineStep objects (e.g., PythonScriptStep) and combined into a pipeline.
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline
train_step = PythonScriptStep(
name='train_model',
script_name='train.py',
compute_target='my-compute',
arguments=['--data-path', data_path],
source_directory='scripts'
)
pipeline = Pipeline(workspace=ws, steps=[train_step])
pipeline_run = pipeline.submit('my-pipeline-run')
Automating retraining workflows.
Parameterized experiment runs.
Trigger-based training (e.g., new data availability).
When should Automated Machine Learning (AutoML) be used instead of custom notebook training in Azure ML?
AutoML should be used when quickly identifying optimal models and hyperparameters for standard ML tasks such as classification, regression, or forecasting.
AutoML automates model selection, preprocessing, and hyperparameter optimization by testing multiple algorithms and configurations automatically. It is ideal during early experimentation or when domain experts want to evaluate baseline models without writing extensive training code.
Custom notebook training is preferred when models require specialized architectures, custom feature engineering pipelines, or frameworks not supported by AutoML.
Demand Score: 82
Exam Relevance Score: 84
What is the purpose of a sweep job in Azure Machine Learning?
A sweep job automates hyperparameter tuning by running multiple training jobs with different parameter combinations.
Sweep jobs evaluate multiple configurations of model hyperparameters to identify the best performing model. Azure ML supports search strategies such as random sampling, grid search, and Bayesian optimization. Each trial run executes a training script with a different parameter set, and the service tracks metrics such as accuracy or loss.
The job stops when predefined criteria are met, such as maximum trials or early termination policies. This process significantly improves model performance compared to manually testing parameters.
Demand Score: 79
Exam Relevance Score: 86
Why are notebooks commonly used for experimentation in Azure Machine Learning?
Notebooks provide an interactive environment for data exploration, model development, and experiment tracking.
Azure ML notebooks integrate Python environments, datasets, and compute resources into a single interactive interface. Data scientists can explore datasets, visualize patterns, train models, and immediately evaluate results. Each experiment run can be logged and tracked within the workspace, allowing comparisons between experiments.
Notebooks are particularly useful during early stages of model development where frequent code changes and data exploration are required.
Demand Score: 74
Exam Relevance Score: 78
What advantage does Bayesian optimization provide in Azure ML hyperparameter tuning?
Bayesian optimization efficiently selects promising hyperparameter combinations based on previous trial results.
Unlike grid or random search, Bayesian optimization uses probabilistic models to predict which hyperparameters will improve model performance. Each trial informs the algorithm about which regions of the search space are more promising.
This approach typically requires fewer training runs to achieve optimal results, which reduces compute cost and training time.
Demand Score: 71
Exam Relevance Score: 82
Why should experiment runs be logged and tracked in Azure ML during model development?
Tracking experiment runs enables comparison of model configurations and supports reproducible research.
Each experiment run records parameters, metrics, and outputs such as model artifacts. This allows data scientists to compare model performance across runs and identify the most effective configurations.
Without experiment tracking, it becomes difficult to determine which hyperparameters, datasets, or preprocessing steps produced the best results. Azure ML’s experiment tracking ensures transparency and reproducibility during model development.
Demand Score: 70
Exam Relevance Score: 76