Shopping cart

Subtotal:

$0.00

DP-750 Prepare and process data

Prepare and process data

Detailed list of DP-750 knowledge points

Prepare and process data Detailed Explanation

Fast review map for this domain:

Exam signal First object to inspect Correct-answer pattern
Source data arrives with different shapes Ingestion tool, extraction type, file type Choose Lakeflow Connect, notebooks, ADF, SQL, CDC, streaming, or Auto Loader based on source and change pattern
Tables must support history and performance Delta/Iceberg/Parquet choice, SCD, partition, clustering Design model and storage layout before transformation code
Data must be transformed and loaded reliably SQL/Python transformations, MERGE/INSERT/APPEND Match operation semantics to dedupe, upsert, or append-only pattern
Bad data must be stopped or quarantined Schema enforcement, schema drift, expectations Use explicit quality constraints instead of downstream report fixes
flowchart LR  
    N1[Source system] --> N2  
    N2[Ingestion pattern] --> N3  
    N3[Unity Catalog model] --> N4  
    N4[Transformation logic] --> N5  
    N5[Quality control] --> N6  
    N6[Delta table]  

Design Unity Catalog data modeling for ingestion, history, and performance

Exam Radar

  • Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
  • High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
  • Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
  • Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
  • Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
  • Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
  • Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
  • How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
  • How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
  • Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: A customer dimension must preserve previous address values and support point-in-time reporting. Which modeling choice is most important?
A. Use an SCD Type 2 or temporal history design with effective date columns.
B. Use CSV files because they are easy to inspect.
C. Increase the number of streaming triggers.
D. Remove all partition columns so writes are faster.
Correct Answer: A.
Explanation: A is correct because preserving historical versions is a modeling requirement. B changes file convenience, not history. C affects pipeline cadence. D may reduce write complexity but does not record prior versions. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.

The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.

For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Extraction pattern Change capture style Full, incremental, CDC, streaming Unknown until source analyzed Source timestamp, key, or change feed Pipeline reloads too much data or misses updates
Table format Physical and transaction model Delta, Parquet, CSV, JSON, Iceberg Scenario dependent Query engine support and transaction requirement No ACID behavior or poor query capability
Partition scheme Directory pruning attribute Low to moderate cardinality time or domain column No explicit partition unless defined Query filter pattern and file-size strategy Too many small partitions or no pruning benefit
SCD design Dimension history type Type 1 overwrite or Type 2 history most common No history unless modeled Business requirement for historical reporting Reports cannot reconstruct prior state
Liquid clustering/Z-order Data skipping strategy Clustered columns or ZORDER columns Unoptimized until configured/run Delta table size and query filter pattern High scan cost despite correct table schema

Step-by-Step Execution Path

  1. Read the scenario for business grain, history requirement, source change signal, and most common query filter.
  2. Choose extraction style before table design because full reload, incremental, CDC, and streaming create different merge and quality requirements.
  3. Select Delta when ACID transactions, MERGE, schema enforcement, optimization, or Unity Catalog governance are required.
  4. Choose SCD Type 1 when only current values matter; choose SCD Type 2 or temporal design when prior state must be queried.
  5. Design partitioning only for stable, low-cardinality pruning patterns; use clustering features for high-cardinality query acceleration where appropriate.
  6. Validate the resulting table with DESCRIBE DETAIL, query profile, and sample point-in-time queries.

Exam implementation pattern:

  • When to use: the stem asks for extraction type, file type, loading method, table format, partitioning, SCD, temporal history, or clustering strategy.
  • Minimal syntax: document source-change pattern and target grain, then validate table metadata with DESCRIBE DETAIL and history/grain checks.
  • What to verify: table format, partition columns, SCD fields, temporal columns, managed/external lifecycle, and query profile evidence.
  • Common wrong answer: choosing a load command before defining history, grain, and table format.

Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.

Technical Chain

The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.

The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.

If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.

Exam Trap Summary: Do not choose partitioning, SCD type, or file format before the source-change pattern, business grain, and history requirement are known.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate table format SQL verification: DESCRIBE DETAIL <catalog>.<schema>.<table>; Format and location match Delta or selected table requirement
Validate history columns SQL verification: DESCRIBE TABLE <catalog>.<schema>.<dimension_table>; Effective dates, current flag, or temporal fields exist when history is required
Validate partition choice SQL verification: DESCRIBE DETAIL <table>; and inspect partitionColumns Partition columns match frequent filters and avoid high-cardinality explosion
Validate clustering evidence Query profile or table optimization history Filtered queries skip data or scan fewer files after optimization

Choose managed versus unmanaged tables, granularity, liquid clustering, Z-ordering, and deletion vectors

Exam Radar

  • Core Priority: Table lifecycle, row grain, clustering, and deletion-vector choices determine whether Delta data is governed, queryable, and maintainable.
  • High Frequency: Expect design prompts about storage ownership, high-cardinality filters, point-in-time rows, privacy deletes, and file-layout cost.
  • Confusion Alert: Do not treat partitioning, Z-ordering, liquid clustering, and deletion vectors as interchangeable performance buttons.
  • Scenario Logic: Start with lifecycle and grain, then choose layout or Delta features from query predicates and delete/update behavior.
  • Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
  • Failure Trigger: The failure appears as retained files being dropped, double counting from unclear grain, excessive scans, or delete-heavy tables rewriting too much data.
  • Operational Dependency: External location, storage credential, table properties, downstream reader compatibility, and query filter pattern must match the design.
  • How the Exam Asks It: The exam asks which table type, grain, clustering strategy, or delete feature fits a concrete business and workload requirement.
  • How Distractors Are Designed: Wrong answers create namespace sprawl, remove optimization, or run MERGE without a valid business key.
  • Why the Correct Answer Works: The correct answer preserves lifecycle intent first and then chooses the layout feature that changes observable query or delete behavior.

Practice Question: A Delta table stores high-volume click events. Queries usually filter by customer_id and event_date, while deletes are frequent because of privacy requests. The team also needs a clear decision about whether table files are governed by Unity Catalog storage. Which design work is most important?
A. Document table grain, choose managed or unmanaged lifecycle, and select liquid clustering/Z-ordering or deletion vectors based on query and delete patterns.
B. Create a separate catalog for every customer_id.
C. Run a MERGE with no business key because Delta supports transactions.
D. Disable all table optimization so writes are simpler.
Correct Answer: A.
Explanation: A is correct because the scenario combines lifecycle ownership, row grain, query pruning, and delete behavior. B creates namespace explosion. C ignores merge keys and grain. D sacrifices performance and does not solve delete-heavy behavior. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.

The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.

For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Managed table Storage lifecycle Unity Catalog-managed storage and metadata Created in managed storage when no external location is specified Catalog/schema storage location and table owner DROP behavior removes data unexpectedly for teams expecting retained external files
Unmanaged table External data lifecycle Metadata points to external storage path Requires explicit external location Storage credential, external location, and cloud RBAC Metadata exists but files cannot be read or retained correctly
Granularity Business row grain Transaction, event, daily snapshot, aggregate, or dimension version Implicit and often undocumented Reporting requirement and merge key Facts double count or dimensions cannot answer point-in-time questions
Liquid clustering Adaptive clustering strategy Cluster by selected columns where supported Not applied unless configured and optimized Delta table feature support and query filter pattern Queries scan excessive files even with correct columns
Deletion vectors Delta row-level delete tracking Enabled or disabled according to table feature support Feature and runtime dependent Delta feature compatibility and downstream readers Delete-heavy workloads rewrite more files or external readers become incompatible

Step-by-Step Execution Path

  1. State the row grain before creating the table: one event, one customer snapshot, one daily aggregate, or one dimension version. This prevents later joins and MERGE operations from using the wrong key.
  2. Choose managed tables when Unity Catalog should own both metadata and storage lifecycle; choose unmanaged/external tables when an approved external path must retain independent lifecycle control.
  3. Inspect query predicates before choosing partitioning, liquid clustering, or Z-ordering; high-cardinality filters usually need clustering-style data skipping rather than one directory per value.
  4. Assess deletion vectors for delete-heavy Delta workloads only after confirming reader compatibility and runtime support.
  5. Use an optimization rehearsal in a non-production table before applying layout changes to a large production dataset.
  6. Validate table properties, location, partition columns, and query profile evidence after the design change.

Exam implementation pattern:

  • When to use: lifecycle ownership, external path retention, row grain, clustering, Z-ordering, or deletion-vector behavior is the design issue.
  • Minimal syntax: inspect table type and properties with DESCRIBE EXTENDED, DESCRIBE DETAIL, or SHOW TBLPROPERTIES.
  • What to verify: managed or external location, business grain, clustering columns, Delta table features, and reader compatibility.
  • Common wrong answer: using partitioning, liquid clustering, Z-ordering, and deletion vectors as interchangeable tuning switches.

Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.

Technical Chain

The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.

The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.

If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.

Exam Trap Summary: Do not choose clustering or deletion vectors before table lifecycle, grain, query predicates, and reader compatibility are known.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate managed/unmanaged lifecycle SQL verification: DESCRIBE EXTENDED <catalog>.<schema>.<table>; Location and table type match storage lifecycle requirement
Validate row grain SQL verification against business key counts and timestamp grain Expected key has one row per declared grain or intentional versioning columns
Validate clustering or Z-order evidence Query profile and Delta table history after optimization Filtered queries show reduced scanned files or data skipping evidence
Validate deletion-vector suitability SQL verification: SHOW TBLPROPERTIES <catalog>.<schema>.<table>; Delta table features and downstream compatibility match delete-heavy workload requirement

Ingest batch, streaming, CDC, and Event Hubs data into Unity Catalog

Exam Radar

  • Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
  • High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
  • Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
  • Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
  • Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
  • Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
  • Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
  • How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
  • How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
  • Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: JSON files continuously land in ADLS Gen2 and must be incrementally discovered with schema evolution handling in a Databricks pipeline. Which ingestion mechanism best fits?
A. Lakeflow Spark Declarative Pipelines using Auto Loader with checkpoint and schema locations.
B. A one-time CTAS statement without checkpointing.
C. Manual upload through the workspace UI for each file.
D. A SQL warehouse size increase.
Correct Answer: A.
Explanation: A is correct because Auto Loader is built for incremental file discovery and streaming-style ingestion with checkpoints. B is a one-time load pattern. C is manual and not reliable. D affects query serving, not file discovery. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.

The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.

For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Lakeflow Connect Managed connector ingestion Supported source and batch/streaming mode Not configured Connector availability and source credentials Source ingestion requires custom code unnecessarily
Notebook ingestion Programmable ingestion logic Python, SQL, or Scala operations Author-created Compute, libraries, and source access Logic becomes hard to schedule or govern if not job-managed
Azure Data Factory Orchestration and movement Pipeline activity and linked service patterns External to Databricks Integration runtime, credentials, and workspace activity ADF used for transformation better suited to Databricks
COPY INTO/CTAS SQL load method File-based loading or table creation No load until executed File path access and schema definition Duplicates or schema mismatch when source evolves
Auto Loader Incremental file discovery cloudFiles source options Checkpoint required for reliable streaming Landing path, checkpoint path, schema location Files are reprocessed or missed

Step-by-Step Execution Path

  1. Classify the source: database, files, event stream, CDC feed, or orchestrated external movement.
  2. Choose Lakeflow Connect when a supported connector can reduce custom ingestion code and preserve managed behavior.
  3. Choose SQL methods such as CTAS, CREATE OR REPLACE TABLE, or COPY INTO for file-based batch loading when the pattern is bounded.
  4. Choose Structured Streaming or Auto Loader when files or events arrive continuously and checkpointed progress matters.
  5. For Event Hubs, validate connection details, consumer group, checkpointing, and schema parse logic before blaming downstream transformations.
  6. Write to Unity Catalog tables only after validating checkpoint path, schema location, and target table ownership.

Exam implementation pattern:

  • When to use: files or events arrive repeatedly and the pipeline must track incremental progress.
  • Minimal syntax: spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").option("cloudFiles.schemaLocation", schema_path).load(source_path).
  • What to verify: checkpoint path, schema location, stream offsets, target row counts, and pipeline/event log errors.
  • Common wrong answer: using a one-time CTAS or manual upload for a continuous landing-zone problem.

Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.

Technical Chain

The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.

The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.

If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.

Exam Trap Summary: Do not use one-time CTAS for continuous files or events; use checkpointed ingestion when the source keeps arriving.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate ingestion tool fit Design review evidence: source type, change pattern, supported connector list Selected tool matches batch, streaming, CDC, or orchestration requirement
Validate Auto Loader checkpoint Pipeline or notebook configuration inspection Checkpoint and schema location are stable and unique to the stream
Validate SQL load result SQL verification: SELECT COUNT(*) FROM <target_table>; plus load history where available Row counts and load metadata match expected source files
Validate Event Hubs stream health Pipeline run details, Spark streaming query status, or Azure Monitor metrics Offsets progress and errors are not accumulating

Cleanse, transform, and load data with SQL and Python operations

Exam Radar

  • Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
  • High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
  • Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
  • Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
  • Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
  • Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
  • Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
  • How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
  • How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
  • Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: A daily customer feed contains updates for existing customers and new customer rows. The target Delta table must update matching business keys and insert new ones. Which load operation fits best?
A. MERGE using the customer business key and matched/not-matched clauses.
B. APPEND all rows every day without deduplication.
C. INTERSECT the source and target and discard the result.
D. UNPIVOT all columns before loading.
Correct Answer: A.
Explanation: A is correct because the target needs upsert semantics. B creates duplicate current records. C finds overlap but does not update or insert. D changes shape and does not solve incremental load logic. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.

The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.

For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Data profile Distribution evidence Counts, null rates, min/max, distinct counts Unknown until profiled Representative sample or full scan Transformation rules target the wrong quality issue
Join operation Row-combination logic Inner, left, right, full, semi, anti No join Key uniqueness and null handling Rows duplicate or disappear unexpectedly
Set operator Dataset comparison UNION, INTERSECT, EXCEPT No set operation Column compatibility Scenario asks for difference but join is used incorrectly
Pivot/unpivot Shape transformation Wide-to-long or long-to-wide Original shape Known dimension values or dynamic logic Analytics model has unusable granularity
MERGE operation Upsert semantics Matched update/delete and not matched insert No target change until executed Business key and change detection Duplicates or stale records remain

Step-by-Step Execution Path

  1. Profile source data first so nulls, duplicates, outliers, and distribution skew are known before writing transformation logic.
  2. Choose data types based on semantic use, not only observed strings; timestamps, decimals, and identifiers have different failure modes.
  3. Resolve duplicates with a deterministic business key and tie-breaker before MERGE or aggregation.
  4. Use joins for enrichment, set operators for comparison, and pivot/unpivot for shape changes; do not substitute one because it is familiar.
  5. Use MERGE for upserts, INSERT for explicit row creation, and APPEND for append-only event or fact loads.
  6. Validate row counts, duplicate counts, null rates, and target change history after the operation.

Exam implementation pattern:

  • When to use: source rows must update existing target rows and insert new business keys.
  • Minimal syntax: MERGE INTO target t USING source s ON t.business_key = s.business_key WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *.
  • What to verify: duplicate key check, changed row counts, Delta history, and before/after target samples.
  • Common wrong answer: appending every daily feed row when the target is supposed to represent current state.

Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.

Technical Chain

The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.

The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.

If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.

Exam Trap Summary: Do not append a daily feed when the target needs upsert behavior; MERGE requires a stable business key and post-load duplicate checks.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate duplicate handling SQL verification: SELECT business_key, COUNT(*) FROM <table> GROUP BY business_key HAVING COUNT(*) > 1; No duplicate business keys remain when uniqueness is required
Validate null remediation SQL verification: SELECT COUNT(*) FROM <table> WHERE <required_col> IS NULL; Required columns have zero unexpected nulls
Validate MERGE outcome Delta table history or row-count comparison before and after load Updated and inserted row counts align with source change set
Validate transformation shape SQL verification on expected columns and row grain Output grain matches denormalized, pivoted, or unpivoted requirement

Implement schema enforcement, schema drift handling, and pipeline expectations

Exam Radar

  • Core Priority: Data preparation questions test whether source-change pattern, table model, transformation operation, and quality rule match the target data contract.
  • High Frequency: Expect extraction type, file type, Lakeflow Connect, notebooks, ADF, batch, streaming, CDC, Event Hubs, Auto Loader, CTAS, COPY INTO, MERGE, INSERT, APPEND, and expectations.
  • Confusion Alert: Do not replace a modeling or quality decision with compute tuning; a fast bad load is still a bad load.
  • Scenario Logic: Classify source arrival and target grain before selecting ingestion, transformation, loading, schema, or quality controls.
  • Version Delta: This topic remains in the Microsoft DP-750 skills measured from March 11, 2026 under Prepare and process data; answer choices should use current Azure Databricks, Unity Catalog, Lakeflow, Azure Monitor, and Microsoft Entra terminology.
  • Failure Trigger: The failure appears as missed CDC records, duplicate current rows, schema drift breaks, invalid values in curated tables, or queries scanning too much data.
  • Operational Dependency: Business key, checkpoint, schema location, table format, SCD design, validation rule, and target ownership decide the correct operation.
  • How the Exam Asks It: The exam asks which data operation changes the right state: discovery progress, target row state, table history, or quality outcome.
  • How Distractors Are Designed: Wrong answers use one-time SQL loads for streams, append for upserts, comments for enforcement, or joins for set-difference requirements.
  • Why the Correct Answer Works: The correct answer protects the data contract and proves it with row counts, table history, expectation metrics, or profile evidence.

Practice Question: A declarative pipeline must stop records with negative transaction amounts from entering the curated table while preserving evidence for troubleshooting. What should be designed?
A. A pipeline expectation or validation rule with a controlled reject/quarantine path.
B. A larger cluster with more workers.
C. A column comment explaining that amounts should be positive.
D. A UNION operation between valid and invalid records.
Correct Answer: A.
Explanation: A is correct because data quality must be enforced at pipeline execution with observable handling for invalid rows. B is capacity. C documents intent but does not enforce it. D combines data and can worsen contamination. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.

Atomic Deconstruction - Operational Level

Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.

The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.

For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Schema enforcement Write-time contract Column names and data types Enforced for Delta writes according to operation Target schema and write mode Unexpected columns or types fail the write
Schema drift Source evolution event Added, missing, renamed, or type-changed column No drift until source changes Ingestion option and governance approval Pipeline silently drops or fails on new fields
Validation check Quality rule Nullability, range, cardinality, uniqueness Not active unless defined Business rule and data profile evidence Invalid records contaminate curated table
Pipeline expectation Lakeflow quality rule Expect, drop, fail, or quarantine behavior where supported No expectation Declarative pipeline configuration Bad records are accepted or pipeline stops unexpectedly
Quarantine table Rejected-record store Invalid record payload and reason Absent unless designed Expectation action and storage target Errors cannot be diagnosed after rejection

Step-by-Step Execution Path

  1. Convert business rules into testable conditions: nullability, range, type, cardinality, uniqueness, or allowed values.
  2. Decide whether a failed record should fail the pipeline, be dropped, or be quarantined; this is a business-risk choice.
  3. Configure schema enforcement and controlled schema evolution separately because drift handling is not the same as quality validation.
  4. Implement expectations in Lakeflow Spark Declarative Pipelines when the pipeline should own validation behavior.
  5. Write invalid-record evidence to a quarantine or error table when troubleshooting and auditability are required.
  6. Validate with known bad sample records and confirm the output table, pipeline event log, and quarantine path show the expected behavior.

Exam implementation pattern:

  • When to use: invalid values, nulls, out-of-range fields, drift, or quality failures must be handled during pipeline execution.
  • Minimal syntax: use a supported Lakeflow expectation such as a positive amount constraint with drop, fail, or quarantine behavior.
  • What to verify: expectation metrics, rejected-record evidence, target-table cleanliness, and pipeline run status.
  • Common wrong answer: adding comments that describe valid data but do not enforce any rule.

Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.

Technical Chain

The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.

The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.

If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.

Exam Trap Summary: Do not treat schema drift handling as data quality validation; schema evolution controls fields, while expectations control record correctness.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate schema contract SQL verification: DESCRIBE TABLE <catalog>.<schema>.<table>; Column names, data types, and nullable behavior match design
Validate quality rule result Pipeline run details or event log quality metrics Expectation counts show passed and failed records
Validate quarantine evidence SQL verification: SELECT error_reason, COUNT(*) FROM <quarantine_table> GROUP BY error_reason; Rejected records include actionable failure reason
Validate drift handling Pipeline configuration and run history after a controlled schema-change test New or changed fields follow approved drift behavior

Frequently Asked Questions

How should a data engineer decide between managed and external tables in Unity Catalog?

Answer:

Use managed tables when Unity Catalog should own the table storage lifecycle, and external tables when data must remain in an externally managed storage location.

Explanation:

Managed tables simplify lifecycle ownership because Databricks manages the storage location through the catalog and schema configuration. External tables are appropriate when files already exist in a controlled cloud path or another process owns the data lifecycle. The exam often tests whether the learner understands table ownership before choosing DDL, grants, or cleanup behavior.

Demand Score: 91

Exam Relevance Score: 97

When is Auto Loader a strong ingestion choice for Azure Databricks?

Answer:

Auto Loader is useful when files arrive continuously or incrementally and the pipeline needs scalable discovery, schema handling, and checkpointed processing.

Explanation:

Auto Loader is commonly used for cloud file ingestion into Delta or Unity Catalog tables. It can track discovered files, persist checkpoints, and handle evolving schemas when configured correctly. This makes it a better operational fit than manually listing directories for recurring ingestion workloads.

Demand Score: 88

Exam Relevance Score: 95

What should be done when incoming data may add new columns over time?

Answer:

Configure schema enforcement and schema evolution intentionally, then validate schema changes before downstream consumers rely on the data.

Explanation:

Schema drift can break transformations, dashboards, and data quality expectations. Azure Databricks supports controlled schema evolution patterns, but production pipelines should still define where schemas are stored, which changes are allowed, and how unexpected changes are handled. Accepting all drift blindly can make downstream failures harder to diagnose.

Demand Score: 87

Exam Relevance Score: 94

How should Change Data Capture data be prepared for analytics in Databricks?

Answer:

Ingest CDC records with ordering and key logic, then apply inserts, updates, and deletes into a reliable Delta target using the appropriate merge or pipeline pattern.

Explanation:

CDC data represents changes rather than simple append-only facts. A correct pipeline must understand keys, operation types, event ordering, late arrivals, and delete semantics. The target table should preserve the intended history or current-state model and provide verifiable results for downstream analytics.

Demand Score: 90

Exam Relevance Score: 96

Why are data quality expectations useful in Databricks pipelines?

Answer:

Expectations make quality rules explicit and allow the pipeline to track, quarantine, drop, or fail records based on defined conditions.

Explanation:

Operational pipelines need more than transformation code; they need observable quality decisions. Expectations help encode required fields, valid ranges, referential assumptions, or freshness checks. In exam scenarios, they are often the right answer when the requirement is to prevent bad data from silently reaching curated tables.

Demand Score: 89

Exam Relevance Score: 96

DP-750 Training Course