Prepare and process data

Prepare and process data Detailed Explanation

Fast review map for this domain:

Exam signal	First object to inspect	Correct-answer pattern
Source data arrives with different shapes	Ingestion tool, extraction type, file type	Choose Lakeflow Connect, notebooks, ADF, SQL, CDC, streaming, or Auto Loader based on source and change pattern
Tables must support history and performance	Delta/Iceberg/Parquet choice, SCD, partition, clustering	Design model and storage layout before transformation code
Data must be transformed and loaded reliably	SQL/Python transformations, MERGE/INSERT/APPEND	Match operation semantics to dedupe, upsert, or append-only pattern
Bad data must be stopped or quarantined	Schema enforcement, schema drift, expectations	Use explicit quality constraints instead of downstream report fixes

flowchart LR  
    N1[Source system] --> N2  
    N2[Ingestion pattern] --> N3  
    N3[Unity Catalog model] --> N4  
    N4[Transformation logic] --> N5  
    N5[Quality control] --> N6  
    N6[Delta table]

Design Unity Catalog data modeling for ingestion, history, and performance

Exam Radar

Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: A customer dimension must preserve previous address values and support point-in-time reporting. Which modeling choice is most important?
A. Use an SCD Type 2 or temporal history design with effective date columns.
B. Use CSV files because they are easy to inspect.
C. Increase the number of streaming triggers.
D. Remove all partition columns so writes are faster.
Correct Answer: A.
Explanation: A is correct because preserving historical versions is a modeling requirement. B changes file convenience, not history. C affects pipeline cadence. D may reduce write complexity but does not record prior versions. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.

The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.

For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Extraction pattern	Change capture style	Full, incremental, CDC, streaming	Unknown until source analyzed	Source timestamp, key, or change feed	Pipeline reloads too much data or misses updates
Table format	Physical and transaction model	Delta, Parquet, CSV, JSON, Iceberg	Scenario dependent	Query engine support and transaction requirement	No ACID behavior or poor query capability
Partition scheme	Directory pruning attribute	Low to moderate cardinality time or domain column	No explicit partition unless defined	Query filter pattern and file-size strategy	Too many small partitions or no pruning benefit
SCD design	Dimension history type	Type 1 overwrite or Type 2 history most common	No history unless modeled	Business requirement for historical reporting	Reports cannot reconstruct prior state
Liquid clustering/Z-order	Data skipping strategy	Clustered columns or ZORDER columns	Unoptimized until configured/run	Delta table size and query filter pattern	High scan cost despite correct table schema

Step-by-Step Execution Path

Read the scenario for business grain, history requirement, source change signal, and most common query filter.
Choose extraction style before table design because full reload, incremental, CDC, and streaming create different merge and quality requirements.
Select Delta when ACID transactions, MERGE, schema enforcement, optimization, or Unity Catalog governance are required.
Choose SCD Type 1 when only current values matter; choose SCD Type 2 or temporal design when prior state must be queried.
Design partitioning only for stable, low-cardinality pruning patterns; use clustering features for high-cardinality query acceleration where appropriate.
Validate the resulting table with DESCRIBE DETAIL, query profile, and sample point-in-time queries.

Exam implementation pattern:

When to use: the stem asks for extraction type, file type, loading method, table format, partitioning, SCD, temporal history, or clustering strategy.
Minimal syntax: document source-change pattern and target grain, then validate table metadata with DESCRIBE DETAIL and history/grain checks.
What to verify: table format, partition columns, SCD fields, temporal columns, managed/external lifecycle, and query profile evidence.
Common wrong answer: choosing a load command before defining history, grain, and table format.

Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.

Technical Chain

The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.

The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.

If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.

Exam Trap Summary: Do not choose partitioning, SCD type, or file format before the source-change pattern, business grain, and history requirement are known.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate table format	SQL verification: `DESCRIBE DETAIL <catalog>.<schema>.<table>;`	Format and location match Delta or selected table requirement
Validate history columns	SQL verification: `DESCRIBE TABLE <catalog>.<schema>.<dimension_table>;`	Effective dates, current flag, or temporal fields exist when history is required
Validate partition choice	SQL verification: `DESCRIBE DETAIL <table>;` and inspect partitionColumns	Partition columns match frequent filters and avoid high-cardinality explosion
Validate clustering evidence	Query profile or table optimization history	Filtered queries skip data or scan fewer files after optimization

Choose managed versus unmanaged tables, granularity, liquid clustering, Z-ordering, and deletion vectors

Exam Radar

Core Priority: Table lifecycle, row grain, clustering, and deletion-vector choices determine whether Delta data is governed, queryable, and maintainable.
High Frequency: Expect design prompts about storage ownership, high-cardinality filters, point-in-time rows, privacy deletes, and file-layout cost.
Confusion Alert: Do not treat partitioning, Z-ordering, liquid clustering, and deletion vectors as interchangeable performance buttons.
Scenario Logic: Start with lifecycle and grain, then choose layout or Delta features from query predicates and delete/update behavior.
Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
Failure Trigger: The failure appears as retained files being dropped, double counting from unclear grain, excessive scans, or delete-heavy tables rewriting too much data.
Operational Dependency: External location, storage credential, table properties, downstream reader compatibility, and query filter pattern must match the design.
How the Exam Asks It: The exam asks which table type, grain, clustering strategy, or delete feature fits a concrete business and workload requirement.
How Distractors Are Designed: Wrong answers create namespace sprawl, remove optimization, or run MERGE without a valid business key.
Why the Correct Answer Works: The correct answer preserves lifecycle intent first and then chooses the layout feature that changes observable query or delete behavior.

Practice Question: A Delta table stores high-volume click events. Queries usually filter by customer_id and event_date, while deletes are frequent because of privacy requests. The team also needs a clear decision about whether table files are governed by Unity Catalog storage. Which design work is most important?
A. Document table grain, choose managed or unmanaged lifecycle, and select liquid clustering/Z-ordering or deletion vectors based on query and delete patterns.
B. Create a separate catalog for every customer_id.
C. Run a MERGE with no business key because Delta supports transactions.
D. Disable all table optimization so writes are simpler.
Correct Answer: A.
Explanation: A is correct because the scenario combines lifecycle ownership, row grain, query pruning, and delete behavior. B creates namespace explosion. C ignores merge keys and grain. D sacrifices performance and does not solve delete-heavy behavior. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Managed table	Storage lifecycle	Unity Catalog-managed storage and metadata	Created in managed storage when no external location is specified	Catalog/schema storage location and table owner	DROP behavior removes data unexpectedly for teams expecting retained external files
Unmanaged table	External data lifecycle	Metadata points to external storage path	Requires explicit external location	Storage credential, external location, and cloud RBAC	Metadata exists but files cannot be read or retained correctly
Granularity	Business row grain	Transaction, event, daily snapshot, aggregate, or dimension version	Implicit and often undocumented	Reporting requirement and merge key	Facts double count or dimensions cannot answer point-in-time questions
Liquid clustering	Adaptive clustering strategy	Cluster by selected columns where supported	Not applied unless configured and optimized	Delta table feature support and query filter pattern	Queries scan excessive files even with correct columns
Deletion vectors	Delta row-level delete tracking	Enabled or disabled according to table feature support	Feature and runtime dependent	Delta feature compatibility and downstream readers	Delete-heavy workloads rewrite more files or external readers become incompatible

Step-by-Step Execution Path

State the row grain before creating the table: one event, one customer snapshot, one daily aggregate, or one dimension version. This prevents later joins and MERGE operations from using the wrong key.
Choose managed tables when Unity Catalog should own both metadata and storage lifecycle; choose unmanaged/external tables when an approved external path must retain independent lifecycle control.
Inspect query predicates before choosing partitioning, liquid clustering, or Z-ordering; high-cardinality filters usually need clustering-style data skipping rather than one directory per value.
Assess deletion vectors for delete-heavy Delta workloads only after confirming reader compatibility and runtime support.
Use an optimization rehearsal in a non-production table before applying layout changes to a large production dataset.
Validate table properties, location, partition columns, and query profile evidence after the design change.

Exam implementation pattern:

When to use: lifecycle ownership, external path retention, row grain, clustering, Z-ordering, or deletion-vector behavior is the design issue.
Minimal syntax: inspect table type and properties with DESCRIBE EXTENDED, DESCRIBE DETAIL, or SHOW TBLPROPERTIES.
What to verify: managed or external location, business grain, clustering columns, Delta table features, and reader compatibility.
Common wrong answer: using partitioning, liquid clustering, Z-ordering, and deletion vectors as interchangeable tuning switches.

Technical Chain

Exam Trap Summary: Do not choose clustering or deletion vectors before table lifecycle, grain, query predicates, and reader compatibility are known.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate managed/unmanaged lifecycle	SQL verification: `DESCRIBE EXTENDED <catalog>.<schema>.<table>;`	Location and table type match storage lifecycle requirement
Validate row grain	SQL verification against business key counts and timestamp grain	Expected key has one row per declared grain or intentional versioning columns
Validate clustering or Z-order evidence	Query profile and Delta table history after optimization	Filtered queries show reduced scanned files or data skipping evidence
Validate deletion-vector suitability	SQL verification: `SHOW TBLPROPERTIES <catalog>.<schema>.<table>;`	Delta table features and downstream compatibility match delete-heavy workload requirement

Ingest batch, streaming, CDC, and Event Hubs data into Unity Catalog

Exam Radar

Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: JSON files continuously land in ADLS Gen2 and must be incrementally discovered with schema evolution handling in a Databricks pipeline. Which ingestion mechanism best fits?
A. Lakeflow Spark Declarative Pipelines using Auto Loader with checkpoint and schema locations.
B. A one-time CTAS statement without checkpointing.
C. Manual upload through the workspace UI for each file.
D. A SQL warehouse size increase.
Correct Answer: A.
Explanation: A is correct because Auto Loader is built for incremental file discovery and streaming-style ingestion with checkpoints. B is a one-time load pattern. C is manual and not reliable. D affects query serving, not file discovery. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Lakeflow Connect	Managed connector ingestion	Supported source and batch/streaming mode	Not configured	Connector availability and source credentials	Source ingestion requires custom code unnecessarily
Notebook ingestion	Programmable ingestion logic	Python, SQL, or Scala operations	Author-created	Compute, libraries, and source access	Logic becomes hard to schedule or govern if not job-managed
Azure Data Factory	Orchestration and movement	Pipeline activity and linked service patterns	External to Databricks	Integration runtime, credentials, and workspace activity	ADF used for transformation better suited to Databricks
COPY INTO/CTAS	SQL load method	File-based loading or table creation	No load until executed	File path access and schema definition	Duplicates or schema mismatch when source evolves
Auto Loader	Incremental file discovery	cloudFiles source options	Checkpoint required for reliable streaming	Landing path, checkpoint path, schema location	Files are reprocessed or missed

Step-by-Step Execution Path

Classify the source: database, files, event stream, CDC feed, or orchestrated external movement.
Choose Lakeflow Connect when a supported connector can reduce custom ingestion code and preserve managed behavior.
Choose SQL methods such as CTAS, CREATE OR REPLACE TABLE, or COPY INTO for file-based batch loading when the pattern is bounded.
Choose Structured Streaming or Auto Loader when files or events arrive continuously and checkpointed progress matters.
For Event Hubs, validate connection details, consumer group, checkpointing, and schema parse logic before blaming downstream transformations.
Write to Unity Catalog tables only after validating checkpoint path, schema location, and target table ownership.

Exam implementation pattern:

When to use: files or events arrive repeatedly and the pipeline must track incremental progress.
Minimal syntax: spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").option("cloudFiles.schemaLocation", schema_path).load(source_path).
What to verify: checkpoint path, schema location, stream offsets, target row counts, and pipeline/event log errors.
Common wrong answer: using a one-time CTAS or manual upload for a continuous landing-zone problem.

Technical Chain

Exam Trap Summary: Do not use one-time CTAS for continuous files or events; use checkpointed ingestion when the source keeps arriving.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate ingestion tool fit	Design review evidence: source type, change pattern, supported connector list	Selected tool matches batch, streaming, CDC, or orchestration requirement
Validate Auto Loader checkpoint	Pipeline or notebook configuration inspection	Checkpoint and schema location are stable and unique to the stream
Validate SQL load result	SQL verification: `SELECT COUNT(*) FROM <target_table>;` plus load history where available	Row counts and load metadata match expected source files
Validate Event Hubs stream health	Pipeline run details, Spark streaming query status, or Azure Monitor metrics	Offsets progress and errors are not accumulating

Cleanse, transform, and load data with SQL and Python operations

Exam Radar

Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: A daily customer feed contains updates for existing customers and new customer rows. The target Delta table must update matching business keys and insert new ones. Which load operation fits best?
A. MERGE using the customer business key and matched/not-matched clauses.
B. APPEND all rows every day without deduplication.
C. INTERSECT the source and target and discard the result.
D. UNPIVOT all columns before loading.
Correct Answer: A.
Explanation: A is correct because the target needs upsert semantics. B creates duplicate current records. C finds overlap but does not update or insert. D changes shape and does not solve incremental load logic. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Data profile	Distribution evidence	Counts, null rates, min/max, distinct counts	Unknown until profiled	Representative sample or full scan	Transformation rules target the wrong quality issue
Join operation	Row-combination logic	Inner, left, right, full, semi, anti	No join	Key uniqueness and null handling	Rows duplicate or disappear unexpectedly
Set operator	Dataset comparison	UNION, INTERSECT, EXCEPT	No set operation	Column compatibility	Scenario asks for difference but join is used incorrectly
Pivot/unpivot	Shape transformation	Wide-to-long or long-to-wide	Original shape	Known dimension values or dynamic logic	Analytics model has unusable granularity
MERGE operation	Upsert semantics	Matched update/delete and not matched insert	No target change until executed	Business key and change detection	Duplicates or stale records remain

Step-by-Step Execution Path

Profile source data first so nulls, duplicates, outliers, and distribution skew are known before writing transformation logic.
Choose data types based on semantic use, not only observed strings; timestamps, decimals, and identifiers have different failure modes.
Resolve duplicates with a deterministic business key and tie-breaker before MERGE or aggregation.
Use joins for enrichment, set operators for comparison, and pivot/unpivot for shape changes; do not substitute one because it is familiar.
Use MERGE for upserts, INSERT for explicit row creation, and APPEND for append-only event or fact loads.
Validate row counts, duplicate counts, null rates, and target change history after the operation.

Exam implementation pattern:

When to use: source rows must update existing target rows and insert new business keys.
Minimal syntax: MERGE INTO target t USING source s ON t.business_key = s.business_key WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *.
What to verify: duplicate key check, changed row counts, Delta history, and before/after target samples.
Common wrong answer: appending every daily feed row when the target is supposed to represent current state.

Technical Chain

Exam Trap Summary: Do not append a daily feed when the target needs upsert behavior; MERGE requires a stable business key and post-load duplicate checks.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate duplicate handling	SQL verification: `SELECT business_key, COUNT() FROM <table> GROUP BY business_key HAVING COUNT() > 1;`	No duplicate business keys remain when uniqueness is required
Validate null remediation	SQL verification: `SELECT COUNT(*) FROM <table> WHERE <required_col> IS NULL;`	Required columns have zero unexpected nulls
Validate MERGE outcome	Delta table history or row-count comparison before and after load	Updated and inserted row counts align with source change set
Validate transformation shape	SQL verification on expected columns and row grain	Output grain matches denormalized, pivoted, or unpivoted requirement

Implement schema enforcement, schema drift handling, and pipeline expectations

Exam Radar

Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: A declarative pipeline must stop records with negative transaction amounts from entering the curated table while preserving evidence for troubleshooting. What should be designed?
A. A pipeline expectation or validation rule with a controlled reject/quarantine path.
B. A larger cluster with more workers.
C. A column comment explaining that amounts should be positive.
D. A UNION operation between valid and invalid records.
Correct Answer: A.
Explanation: A is correct because data quality must be enforced at pipeline execution with observable handling for invalid rows. B is capacity. C documents intent but does not enforce it. D combines data and can worsen contamination. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Schema enforcement	Write-time contract	Column names and data types	Enforced for Delta writes according to operation	Target schema and write mode	Unexpected columns or types fail the write
Schema drift	Source evolution event	Added, missing, renamed, or type-changed column	No drift until source changes	Ingestion option and governance approval	Pipeline silently drops or fails on new fields
Validation check	Quality rule	Nullability, range, cardinality, uniqueness	Not active unless defined	Business rule and data profile evidence	Invalid records contaminate curated table
Pipeline expectation	Lakeflow quality rule	Expect, drop, fail, or quarantine behavior where supported	No expectation	Declarative pipeline configuration	Bad records are accepted or pipeline stops unexpectedly
Quarantine table	Rejected-record store	Invalid record payload and reason	Absent unless designed	Expectation action and storage target	Errors cannot be diagnosed after rejection

Step-by-Step Execution Path

Convert business rules into testable conditions: nullability, range, type, cardinality, uniqueness, or allowed values.
Decide whether a failed record should fail the pipeline, be dropped, or be quarantined; this is a business-risk choice.
Configure schema enforcement and controlled schema evolution separately because drift handling is not the same as quality validation.
Implement expectations in Lakeflow Spark Declarative Pipelines when the pipeline should own validation behavior.
Write invalid-record evidence to a quarantine or error table when troubleshooting and auditability are required.
Validate with known bad sample records and confirm the output table, pipeline event log, and quarantine path show the expected behavior.

Exam implementation pattern:

When to use: invalid values, nulls, out-of-range fields, drift, or quality failures must be handled during pipeline execution.
Minimal syntax: use a supported Lakeflow expectation such as a positive amount constraint with drop, fail, or quarantine behavior.
What to verify: expectation metrics, rejected-record evidence, target-table cleanliness, and pipeline run status.
Common wrong answer: adding comments that describe valid data but do not enforce any rule.

Technical Chain

Exam Trap Summary: Do not treat schema drift handling as data quality validation; schema evolution controls fields, while expectations control record correctness.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate schema contract	SQL verification: `DESCRIBE TABLE <catalog>.<schema>.<table>;`	Column names, data types, and nullable behavior match design
Validate quality rule result	Pipeline run details or event log quality metrics	Expectation counts show passed and failed records
Validate quarantine evidence	SQL verification: `SELECT error_reason, COUNT(*) FROM <quarantine_table> GROUP BY error_reason;`	Rejected records include actionable failure reason
Validate drift handling	Pipeline configuration and run history after a controlled schema-change test	New or changed fields follow approved drift behavior

Shopping cart

Subtotal:

DP-750 Prepare and process data

Detailed list of DP-750 knowledge points