Data Preparation for Machine Learning (ML)

Data Preparation for Machine Learning (ML) Detailed Explanation

Command usage note: AWS CLI snippets in this file are included to teach operational troubleshooting patterns. Always verify exact syntax, IAM permissions, regional availability, service quotas, and active AWS service limits against current AWS documentation before using commands in production.

Official task alignment for this domain:

Official MLA-C01 task	How this document covers it
Task 1.1: Ingest and store data	S3, EFS, FSx, Kinesis, Flink, Kafka, database extraction, file formats, partitions, and storage tradeoffs
Task 1.2: Transform data and perform feature engineering	Data Wrangler, AWS Glue, DataBrew, Spark, EMR, streaming transformation, encoding, scaling, Feature Store
Task 1.3: Ensure data integrity and prepare data for modeling	Data quality, SageMaker Clarify bias checks, class imbalance, encryption, masking, anonymization, dataset splitting

High-frequency service selection memory:

Scenario clue	Strong first choice	Common distractor
Repeated training reads only a subset of columns	S3 with Parquet or ORC and partitioning	Larger training instance before fixing scan pattern
Real-time event ingestion for feature updates	Kinesis, Kafka-compatible source, or Flink pipeline	Batch-only S3 upload path
Reusable online and offline features	SageMaker Feature Store	Notebook-generated CSV features
Human labels for supervised training	SageMaker Ground Truth or Mechanical Turk	Data Wrangler transformation flow
Protected data or encrypted S3 objects	IAM plus bucket policy plus KMS verification	Disable encryption or grant broad administrator access

S3, Streaming, and File Format Selection for ML Data Ingestion

Exam Radar

Core Priority: MLA-C01 often starts an ML scenario at the data boundary: where the data lands, how it is shaped, and whether the storage format supports the training or feature pipeline. Amazon S3 is the common data lake anchor, while Amazon Kinesis, Amazon Managed Service for Apache Flink, Apache Kafka, Amazon EFS, and Amazon FSx appear when access pattern, latency, or file-system semantics matter.

High Frequency: Expect questions that compare Parquet, JSON, CSV, ORC, Avro, and RecordIO against access patterns. Columnar formats are favored when analytics jobs read selected columns repeatedly; row or text formats appear when ingestion simplicity, interoperability, or stream payload structure is the dominant constraint.

Confusion Alert: Distractors commonly propose changing the model, endpoint, or training instance before proving that the data is reachable, correctly partitioned, and formatted for the consuming job. Another trap is choosing a streaming service for historical batch data or choosing CSV when schema evolution and column pruning are central requirements.

Scenario Logic: In a scalable ML pipeline, the first operational decision is whether the workload is batch, streaming, shared file-system, or low-latency transactional extraction. That choice determines the ingestion service, storage layer, object layout, and downstream validation method.

Version Delta: AWS documentation now uses current service naming such as Amazon SageMaker AI in the exam guide. Treat command examples below as version-aware AWS CLI verification patterns and confirm syntax in the active AWS CLI and SageMaker API documentation before production use.

Failure Trigger: Ingestion failures usually surface as missing objects, malformed records, schema drift, throttled reads, insufficient IOPS, partition imbalance, or a training job that cannot mount or read the source path.

Operational Dependency: The data source must satisfy storage durability, access permission, throughput, schema compatibility, and cost requirements before feature engineering or model training can be reliable.

How the Exam Asks It: The stem may describe high-volume JSON events, recurring batch feature extraction from S3, file-system access required by a training framework, or a need to merge RDS and object-store data. The correct answer aligns the service and format with the access pattern.

How Distractors Are Designed: Wrong choices often mix adjacent AWS services: using EBS for shared training data, Lambda for heavy Spark transformations, or Transfer Acceleration to solve a schema problem.

Why the Correct Answer Works: The correct option resolves the first blocking constraint: data movement, storage access semantics, file format efficiency, or scalable read throughput.

High-Value Exam Focus: If the question mentions historical batch training, repeated scans, selected columns, or analytics-style preparation, check S3 layout and file format before changing model code or endpoint infrastructure. Parquet/ORC and partitioning usually beat raw CSV when the bottleneck is scan efficiency.

Practice Question: A team stores 3 TB of clickstream data in S3 and repeatedly trains models that read only 12 of 140 columns. Training is slow because each job scans entire CSV files. Which change best improves repeated training reads?

A. Convert the dataset to Apache Parquet and partition it by training date in S3.
B. Move the files to Amazon EBS Provisioned IOPS volumes attached to one notebook instance.
C. Stream the historical data through Amazon Kinesis Data Streams before training.
D. Increase the endpoint instance size used for online inference.

Correct Answer: A

Explanation: A is correct because Parquet supports column pruning and efficient repeated analytics reads from S3. B creates an attachment and sharing constraint and does not address column scans. C solves real-time ingestion, not historical batch training reads. D changes inference capacity, which is outside the ingestion bottleneck.

Exam Takeaway: Select the storage format and ingestion service from the data access pattern first; distractors often remediate compute or inference before proving the data path.

Atomic Deconstruction - Operational Level

Data ingestion for ML is not only copying bytes. It establishes a contract between the producing system and the training or feature pipeline. S3 object keys, partition prefixes, file format, compression, schema, encryption, and IAM permission all become dependencies for Glue crawlers, SageMaker Processing jobs, EMR Spark jobs, Data Wrangler flows, and Feature Store ingestion.

Batch pipelines usually start with durable storage such as S3 because the same dataset must be replayed for experiment repeatability. Streaming pipelines start with Kinesis, Kafka, or Flink because order, event time, and near-real-time processing are stronger requirements than historical replay alone. Shared file-system sources such as EFS or FSx are selected when the training framework expects POSIX-like paths or high-performance file access.

The format choice controls runtime behavior. CSV is easy to inspect but expensive to scan at scale. Parquet and ORC encode columns and metadata, so Spark, Glue, Athena, or training preparation jobs can read only required columns. Avro and RecordIO can support record-oriented pipelines where sequential record processing is the dominant pattern.

If this step is skipped, later model failures can be misdiagnosed as algorithm errors. A training job that sees skewed partitions, partial files, wrong delimiters, or missing schema fields may produce poor metrics even when the model code is correct.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
S3 training prefix	Partition layout	Date, tenant, label, feature group, or workload-specific prefix	Flat object listing if unmanaged	Downstream reader must use compatible prefix filters	Full scans, high cost, delayed jobs
File format	Serialization and layout	CSV, JSON, Parquet, ORC, Avro, RecordIO	Often raw CSV or JSON at first landing	Processing engine and schema evolution requirements	Schema mismatch, slow reads, parsing errors
Streaming source	Event ingestion mode	Kinesis, Managed Service for Apache Flink, Kafka-compatible source	No replay or checkpoint unless configured	Producer throughput, retention, checkpoint strategy	Data loss, duplicate processing, lag
Shared file system	Training access path	EFS or FSx mount target/path	Unmounted in isolated training environment	VPC, security group, subnet, IAM or file permissions	Training job cannot read data
IAM data role	Read and decrypt scope	S3 read, KMS decrypt, Glue catalog read where applicable	Deny unless granted	Execution role trust and resource policies	AccessDenied, empty dataset, failed job

Step-by-Step Execution Path

Classify the workload before selecting a service: batch replay, real-time stream, shared file-system training, or database extraction. This prevents choosing a service that solves a different latency or access pattern.
Verify the source location and object distribution.

#Official AWS CLI verification; validate command syntax against the active AWS CLI version.  
aws s3 ls s3://example-ml-bucket/training/ --recursive --summarize

Expected state: object count and total size match the upstream handoff. This validates that missing data is not being masked as a model issue.

Inspect schema and file format with a supported processing engine or catalog. For S3-based analytics, use AWS Glue Data Catalog, Glue crawlers, Athena, or Spark jobs as appropriate.
Check the execution role and encryption dependency before running transformation jobs.

#Official AWS CLI verification; account-specific ARNs required.  
aws iam get-role --role-name SageMakerExecutionRole  
aws kms describe-key --key-id alias/example-ml-data-key

Expected state: the role can read the source path and decrypt the objects. This unlocks processing and training access.

Run a small validation read before full-scale training. A sample read catches delimiters, compression, schema drift, and corrupt objects before expensive compute is allocated.

Technical Chain

The producer writes records to S3 or a stream. The storage layer preserves object bytes, metadata, encryption state, and path layout. A processing job then resolves the execution role, opens the source objects or stream shards, parses records through the declared format, and emits transformed data or features. If the role lacks KMS decrypt, parsing never starts. If the file format is inefficient for the access pattern, the job reads unnecessary data and training latency increases. If the stream lacks retention or checkpointing, the consumer cannot replay missed events and the feature pipeline becomes non-repeatable.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate S3 source completeness	`aws s3 ls s3://example-ml-bucket/training/ --recursive --summarize`	Object count and total size match the expected ingestion batch
Inspect execution role	`aws iam get-role --role-name SageMakerExecutionRole`	Role exists and is the role used by the processing or training job
Confirm encryption key metadata	`aws kms describe-key --key-id alias/example-ml-data-key`	Key is enabled and accessible to the intended account boundary
Validate cataloged schema	AWS Glue console > Data Catalog > Tables	Columns, partitions, and serialization match the training dataset

Feature Engineering with Glue, Data Wrangler, and SageMaker Feature Store

Exam Radar

Core Priority: Feature engineering turns raw attributes into training signals. MLA-C01 tests whether the candidate can select the right AWS tool for cleaning, encoding, scaling, joining, and storing reusable features.

High Frequency: Questions often mention SageMaker Data Wrangler for visual exploration and transformation, AWS Glue or Spark on EMR for scalable ETL, Glue DataBrew for preparation workflows, and SageMaker Feature Store for online or offline feature reuse.

Confusion Alert: A common distractor is storing engineered features only in a notebook output path when multiple models or online inference require consistent features. Another is using a labeling service for transformation work that belongs in Data Wrangler, Glue, or Spark.

Scenario Logic: The operational question is whether the feature is experimental, batch-transformed, reusable across teams, or needed at low latency during inference. That determines whether a temporary transformation output is enough or whether a feature group is required.

Version Delta: SageMaker Feature Store and Data Wrangler features evolve. Treat console paths and CLI snippets as validation patterns and verify exact API names in current AWS documentation.

Failure Trigger: Bad feature engineering appears as training-serving skew, missing columns, wrong encoding, leakage from target variables, or online inference using a different transformation than training.

Operational Dependency: A feature pipeline depends on source schema, transformation code, feature definitions, event time, record identifier, and the online/offline store selection.

How the Exam Asks It: The stem may describe repeated models needing the same customer features, inference latency requirements, or a team needing lineage from raw columns to transformed features.

How Distractors Are Designed: Wrong answers choose a storage service without feature metadata, run manual notebook transformations with no repeatability, or apply online-only stores when offline training history is required.

Why the Correct Answer Works: The correct choice keeps transformation repeatable and aligns feature storage with training and inference access needs.

High-Value Exam Focus: If the same feature must be used in both training and low-latency inference, think Feature Store with record identifier, event time, and online/offline store alignment. If the feature is a one-time batch cleanup, Glue, Data Wrangler, DataBrew, Spark, or EMR may be enough.

Practice Question: A fraud model needs the same engineered transaction-risk feature for nightly training and millisecond online inference. The team must avoid training-serving skew. What should they use?

A. Save transformed CSV files from a notebook to a local volume.
B. Create a SageMaker Feature Store feature group with offline and online stores.
C. Use Amazon Mechanical Turk to label the risk feature during inference.
D. Store raw transactions only in Amazon S3 and let each application compute features differently.

Correct Answer: B

Explanation: B is correct because a Feature Store feature group can centralize definitions and support both offline training access and online lookup. A is not repeatable or shared. C is for human labeling, not low-latency feature serving. D invites training-serving skew.

Exam Takeaway: When a feature must be reused across training and inference, choose a managed feature store pattern; distractors often leave transformations scattered across notebooks or applications.

Atomic Deconstruction - Operational Level

Feature engineering changes the data distribution the model sees. Scaling, normalization, encoding, binning, tokenization, joins, deduplication, and missing-value treatment must be deterministic enough that training and inference agree. A transformation is operationally valid only when the source columns, transformation code, output schema, and validation evidence are known.

AWS Glue and Spark are strong for large joins, data cleansing, and repeatable batch transformations. SageMaker Data Wrangler is useful when the learner needs to explore data, build transformation flows, and export processing logic. SageMaker Feature Store becomes important when feature definitions need identity, event time, online lookup, offline history, and cross-model reuse.

The why-layer matters: encoding categorical values differently at inference than training changes the numeric representation and can break model behavior. Using different imputation rules can create drift that looks like model decay. Storing features without event time can leak future information into training.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Feature group	Record identifier	Customer ID, transaction ID, device ID, or entity key	Undefined until feature group design	Unique lookup key and schema	Duplicate or missing feature retrieval
Feature group	Event time	Timestamp feature	Required for time-aware features	Correct source timestamp and timezone handling	Training leakage or stale online value
Transformation job	Processing engine	Data Wrangler, Glue, Spark, EMR, Lambda for lightweight stream work	Manual notebook logic if unmanaged	Source scale and transformation complexity	Non-repeatable features
Encoding rule	Category handling	One-hot, label, binary, tokenization	Raw string values	Training/inference compatibility	Invalid feature vector or skew
Offline store	Historical feature data	S3-backed history	Disabled unless configured	Feature group storage and permissions	No repeatable training history

Step-by-Step Execution Path

Identify whether the feature is temporary, shared, or online-serving critical. This controls whether a batch output path or Feature Store is required.
Validate source schema before transformation.

#Version-aware AWS CLI verification pattern; confirm Glue command syntax for active CLI.  
aws glue get-table --database-name example_ml --name raw_transactions

Expected state: columns required for the feature exist with expected types.

Build or inspect the transformation flow in SageMaker Data Wrangler, AWS Glue, or Spark. Place the cleaning, encoding, and scaling logic before feature storage so the feature store receives consistent values.
Verify feature group metadata when reuse is required.

#Official AWS CLI verification pattern; feature group name is environment-specific.  
aws sagemaker describe-feature-group --feature-group-name transaction-risk-features

Expected state: FeatureGroupStatus is available, online/offline store configuration matches access requirements, and record/event identifiers are correct.

Run a small training/inference parity check by comparing one entity's offline feature row to its online lookup value through supported SDK/API calls.

Technical Chain

The raw dataset enters a transformation engine. The engine applies deterministic cleaning and encoding rules, producing feature columns with defined types. If the features are written to a feature group, SageMaker stores metadata, record identifiers, event timestamps, and online/offline storage paths. Training jobs read historical offline values, while inference code retrieves current online values by record identifier. When transformation rules diverge, the model receives vectors that do not match training distribution, causing degraded predictions even when infrastructure is healthy.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate raw schema	`aws glue get-table --database-name example_ml --name raw_transactions`	Required source columns and types exist
Inspect feature group	`aws sagemaker describe-feature-group --feature-group-name transaction-risk-features`	Status is available and store configuration matches use case
Confirm offline store path	SageMaker console > Feature Store > Feature group > Offline store	S3 location is configured for training history
Check transformation lineage	SageMaker Studio > Data Wrangler flow > Export or job details	Transformation steps match approved feature definitions

Data Quality, Bias Detection, and Secure Training Data Preparation

Exam Radar

Core Priority: MLA-C01 treats data integrity as a modeling dependency. Data quality, bias, encryption, anonymization, masking, PII/PHI handling, and data residency can determine whether a training dataset is usable.

High Frequency: Expect SageMaker Clarify for bias and explainability signals, AWS Glue Data Quality and DataBrew for validation, and AWS KMS, IAM, bucket policies, Macie, or classification controls for sensitive data.

Confusion Alert: A frequent trap is retraining or tuning hyperparameters before identifying class imbalance, missing labels, leakage, or protected data exposure. Another trap is encrypting data but leaving the execution role unable to decrypt it.

Scenario Logic: The exam usually gives a symptom: poor minority-class predictions, failed compliance review, rejected training job, or unexpected metric degradation. The correct answer identifies whether the root dependency is data quality, bias, split strategy, or security control.

Version Delta: Bias metric names and managed service capabilities change. Use the current SageMaker Clarify and AWS Glue Data Quality documentation when implementing production checks.

Failure Trigger: Failures appear as invalid rows, high null counts, skewed classes, data leakage, AccessDenied on encrypted objects, noncompliant sensitive fields, or test metrics that do not reflect production populations.

Operational Dependency: Model training requires a clean, representative, correctly split, and authorized dataset. Security and compliance controls must be satisfied before compute access is granted.

How the Exam Asks It: Stems may include class imbalance, PII in source data, data residency requirements, missing-value patterns, or model bias after deployment.

How Distractors Are Designed: Distractors often jump to endpoint monitoring, model registry approval, or larger instances when the actual issue is upstream data integrity.

Why the Correct Answer Works: The correct answer validates and repairs the data dependency before spending effort on modeling.

High-Value Exam Focus: If the stem mentions imbalance, protected groups, PII/PHI, data residency, missing values, leakage, or compliance, solve the data integrity and governance problem first. Tuning, scaling, or endpoint changes are usually downstream distractors.

Practice Question: A binary classifier performs well overall but poorly on a small protected segment. The dataset has severe class imbalance and the compliance team requires measurable pre-training bias evidence. What should the team do first?

A. Use SageMaker Clarify to compute pre-training bias metrics and apply a balancing strategy before retraining.
B. Increase the endpoint instance count to improve prediction quality.
C. Register the current model version and approve it for production.
D. Convert all files from Parquet to CSV to make the data easier to inspect.

Correct Answer: A

Explanation: A directly addresses bias measurement and data imbalance before training. B changes serving capacity, not bias. C promotes a model with known fairness risk. D may simplify viewing but does not provide bias metrics or mitigation.

Exam Takeaway: When the symptom is segment-level or compliance-driven, validate data quality and bias before modifying infrastructure or deployment.

Atomic Deconstruction - Operational Level

Data integrity controls answer whether the model should trust the training set. Quality rules check nulls, ranges, uniqueness, allowed values, freshness, and schema conformance. Bias checks compare label and feature distributions across groups to reveal whether the model may learn distorted relationships. Security checks verify that sensitive fields are protected, classified, masked when required, encrypted at rest, and accessed only by authorized roles.

Dataset splitting is also operational, not academic. Random splitting can leak entity history across train and test sets, while time-based splitting may be required for forecasting or fraud scenarios. Shuffling, augmentation, resampling, and synthetic data can reduce imbalance, but they must be applied only after the leakage and compliance boundaries are understood.

If encryption is configured without matching role permission, the training job fails before reading data. If PII is left in a feature column, the pipeline may violate policy even if the model metrics look strong. If bias is measured only after deployment, the team discovers a preventable data issue too late.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Data quality ruleset	Rule type	Null, range, uniqueness, schema, freshness	No validation unless configured	Glue/DataBrew profiling and expected schema	Corrupt or invalid training rows
Bias report	Metric category	Class imbalance, DPL, label distribution, group disparity	Not generated by default	Sensitive attributes and label columns	Undetected unfairness or compliance failure
KMS key	Decrypt permission	Allowed or denied per principal	Deny unless granted	Execution role and key policy	Training job AccessDenied
Sensitive field	Protection action	Classify, mask, anonymize, exclude, encrypt	Raw field present	PII/PHI policy and data residency requirements	Policy violation or audit failure
Dataset split	Split method	Random, stratified, time-based, entity-aware	Often ad hoc in notebooks	Model objective and leakage risk	Inflated evaluation metrics

Step-by-Step Execution Path

Profile the dataset before training. This exposes missing values, invalid ranges, and schema drift before compute is allocated.
Run or inspect data quality rules through AWS Glue Data Quality, DataBrew, or a controlled validation job.

#Version-aware AWS CLI verification pattern; confirm Glue Data Quality commands for active CLI.  
aws glue get-data-quality-ruleset --name training_quality_rules

Expected state: rules exist for the columns that feed the model.

Generate bias evidence when the scenario mentions class imbalance, protected groups, or fairness.

#Supported SDK/API verification pattern; implementation commonly uses SageMaker Clarify processing jobs.  
aws sagemaker describe-processing-job --processing-job-name clarify-pretraining-bias-job

Expected state: processing job completed and produced bias report artifacts in the expected S3 path.

Verify encryption and access.

#Official AWS CLI verification pattern.  
aws s3api get-bucket-encryption --bucket example-ml-bucket  
aws kms describe-key --key-id alias/example-ml-data-key

Expected state: encryption is enabled and the training execution role has required decrypt permission.

Confirm split logic and leakage control in the training pipeline code or experiment metadata before interpreting model metrics.

Technical Chain

A validation job reads the dataset schema and content, applies quality rules, and emits pass/fail evidence. A bias job groups labels and features by declared facets and computes metrics that reveal imbalance or disparity. Security controls then determine whether the execution role can read and decrypt the approved dataset. Only after those dependencies pass does the training job receive a trustworthy data matrix. If quality or bias checks are skipped, the model can encode missingness, leakage, or protected attribute imbalance as predictive signal.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Inspect quality ruleset	`aws glue get-data-quality-ruleset --name training_quality_rules`	Rules cover required model input columns
Verify Clarify job status	`aws sagemaker describe-processing-job --processing-job-name clarify-pretraining-bias-job`	Job status is completed and output artifacts exist
Check bucket encryption	`aws s3api get-bucket-encryption --bucket example-ml-bucket`	Expected SSE-S3 or SSE-KMS configuration is present
Confirm sensitive data classification	Amazon Macie console > Findings or S3 bucket classification results	PII/PHI findings are reviewed and dispositioned

Shopping cart

Subtotal:

MLA-C01 Data Preparation for Machine Learning (ML)

Detailed list of MLA-C01 knowledge points