Fast review map for this domain:
| Exam signal | First object to inspect | Correct-answer pattern |
|---|---|---|
| Source data arrives with different shapes | Ingestion tool, extraction type, file type | Choose Lakeflow Connect, notebooks, ADF, SQL, CDC, streaming, or Auto Loader based on source and change pattern |
| Tables must support history and performance | Delta/Iceberg/Parquet choice, SCD, partition, clustering | Design model and storage layout before transformation code |
| Data must be transformed and loaded reliably | SQL/Python transformations, MERGE/INSERT/APPEND | Match operation semantics to dedupe, upsert, or append-only pattern |
| Bad data must be stopped or quarantined | Schema enforcement, schema drift, expectations | Use explicit quality constraints instead of downstream report fixes |
flowchart LR
N1[Source system] --> N2
N2[Ingestion pattern] --> N3
N3[Unity Catalog model] --> N4
N4[Transformation logic] --> N5
N5[Quality control] --> N6
N6[Delta table]
Practice Question: A customer dimension must preserve previous address values and support point-in-time reporting. Which modeling choice is most important?
A. Use an SCD Type 2 or temporal history design with effective date columns.
B. Use CSV files because they are easy to inspect.
C. Increase the number of streaming triggers.
D. Remove all partition columns so writes are faster.
Correct Answer: A.
Explanation: A is correct because preserving historical versions is a modeling requirement. B changes file convenience, not history. C affects pipeline cadence. D may reduce write complexity but does not record prior versions. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.
Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.
The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.
For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Extraction pattern | Change capture style | Full, incremental, CDC, streaming | Unknown until source analyzed | Source timestamp, key, or change feed | Pipeline reloads too much data or misses updates |
| Table format | Physical and transaction model | Delta, Parquet, CSV, JSON, Iceberg | Scenario dependent | Query engine support and transaction requirement | No ACID behavior or poor query capability |
| Partition scheme | Directory pruning attribute | Low to moderate cardinality time or domain column | No explicit partition unless defined | Query filter pattern and file-size strategy | Too many small partitions or no pruning benefit |
| SCD design | Dimension history type | Type 1 overwrite or Type 2 history most common | No history unless modeled | Business requirement for historical reporting | Reports cannot reconstruct prior state |
| Liquid clustering/Z-order | Data skipping strategy | Clustered columns or ZORDER columns | Unoptimized until configured/run | Delta table size and query filter pattern | High scan cost despite correct table schema |
Exam implementation pattern:
DESCRIBE DETAIL and history/grain checks. Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.
The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.
The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.
If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.
Exam Trap Summary: Do not choose partitioning, SCD type, or file format before the source-change pattern, business grain, and history requirement are known.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate table format | SQL verification: DESCRIBE DETAIL <catalog>.<schema>.<table>; |
Format and location match Delta or selected table requirement |
| Validate history columns | SQL verification: DESCRIBE TABLE <catalog>.<schema>.<dimension_table>; |
Effective dates, current flag, or temporal fields exist when history is required |
| Validate partition choice | SQL verification: DESCRIBE DETAIL <table>; and inspect partitionColumns |
Partition columns match frequent filters and avoid high-cardinality explosion |
| Validate clustering evidence | Query profile or table optimization history | Filtered queries skip data or scan fewer files after optimization |
Practice Question: A Delta table stores high-volume click events. Queries usually filter by customer_id and event_date, while deletes are frequent because of privacy requests. The team also needs a clear decision about whether table files are governed by Unity Catalog storage. Which design work is most important?
A. Document table grain, choose managed or unmanaged lifecycle, and select liquid clustering/Z-ordering or deletion vectors based on query and delete patterns.
B. Create a separate catalog for every customer_id.
C. Run a MERGE with no business key because Delta supports transactions.
D. Disable all table optimization so writes are simpler.
Correct Answer: A.
Explanation: A is correct because the scenario combines lifecycle ownership, row grain, query pruning, and delete behavior. B creates namespace explosion. C ignores merge keys and grain. D sacrifices performance and does not solve delete-heavy behavior. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.
Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.
The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.
For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Managed table | Storage lifecycle | Unity Catalog-managed storage and metadata | Created in managed storage when no external location is specified | Catalog/schema storage location and table owner | DROP behavior removes data unexpectedly for teams expecting retained external files |
| Unmanaged table | External data lifecycle | Metadata points to external storage path | Requires explicit external location | Storage credential, external location, and cloud RBAC | Metadata exists but files cannot be read or retained correctly |
| Granularity | Business row grain | Transaction, event, daily snapshot, aggregate, or dimension version | Implicit and often undocumented | Reporting requirement and merge key | Facts double count or dimensions cannot answer point-in-time questions |
| Liquid clustering | Adaptive clustering strategy | Cluster by selected columns where supported | Not applied unless configured and optimized | Delta table feature support and query filter pattern | Queries scan excessive files even with correct columns |
| Deletion vectors | Delta row-level delete tracking | Enabled or disabled according to table feature support | Feature and runtime dependent | Delta feature compatibility and downstream readers | Delete-heavy workloads rewrite more files or external readers become incompatible |
Exam implementation pattern:
DESCRIBE EXTENDED, DESCRIBE DETAIL, or SHOW TBLPROPERTIES. Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.
The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.
The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.
If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.
Exam Trap Summary: Do not choose clustering or deletion vectors before table lifecycle, grain, query predicates, and reader compatibility are known.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate managed/unmanaged lifecycle | SQL verification: DESCRIBE EXTENDED <catalog>.<schema>.<table>; |
Location and table type match storage lifecycle requirement |
| Validate row grain | SQL verification against business key counts and timestamp grain | Expected key has one row per declared grain or intentional versioning columns |
| Validate clustering or Z-order evidence | Query profile and Delta table history after optimization | Filtered queries show reduced scanned files or data skipping evidence |
| Validate deletion-vector suitability | SQL verification: SHOW TBLPROPERTIES <catalog>.<schema>.<table>; |
Delta table features and downstream compatibility match delete-heavy workload requirement |
Practice Question: JSON files continuously land in ADLS Gen2 and must be incrementally discovered with schema evolution handling in a Databricks pipeline. Which ingestion mechanism best fits?
A. Lakeflow Spark Declarative Pipelines using Auto Loader with checkpoint and schema locations.
B. A one-time CTAS statement without checkpointing.
C. Manual upload through the workspace UI for each file.
D. A SQL warehouse size increase.
Correct Answer: A.
Explanation: A is correct because Auto Loader is built for incremental file discovery and streaming-style ingestion with checkpoints. B is a one-time load pattern. C is manual and not reliable. D affects query serving, not file discovery. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.
Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.
The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.
For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Lakeflow Connect | Managed connector ingestion | Supported source and batch/streaming mode | Not configured | Connector availability and source credentials | Source ingestion requires custom code unnecessarily |
| Notebook ingestion | Programmable ingestion logic | Python, SQL, or Scala operations | Author-created | Compute, libraries, and source access | Logic becomes hard to schedule or govern if not job-managed |
| Azure Data Factory | Orchestration and movement | Pipeline activity and linked service patterns | External to Databricks | Integration runtime, credentials, and workspace activity | ADF used for transformation better suited to Databricks |
| COPY INTO/CTAS | SQL load method | File-based loading or table creation | No load until executed | File path access and schema definition | Duplicates or schema mismatch when source evolves |
| Auto Loader | Incremental file discovery | cloudFiles source options | Checkpoint required for reliable streaming | Landing path, checkpoint path, schema location | Files are reprocessed or missed |
Exam implementation pattern:
spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").option("cloudFiles.schemaLocation", schema_path).load(source_path). Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.
The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.
The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.
If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.
Exam Trap Summary: Do not use one-time CTAS for continuous files or events; use checkpointed ingestion when the source keeps arriving.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate ingestion tool fit | Design review evidence: source type, change pattern, supported connector list | Selected tool matches batch, streaming, CDC, or orchestration requirement |
| Validate Auto Loader checkpoint | Pipeline or notebook configuration inspection | Checkpoint and schema location are stable and unique to the stream |
| Validate SQL load result | SQL verification: SELECT COUNT(*) FROM <target_table>; plus load history where available |
Row counts and load metadata match expected source files |
| Validate Event Hubs stream health | Pipeline run details, Spark streaming query status, or Azure Monitor metrics | Offsets progress and errors are not accumulating |
Practice Question: A daily customer feed contains updates for existing customers and new customer rows. The target Delta table must update matching business keys and insert new ones. Which load operation fits best?
A. MERGE using the customer business key and matched/not-matched clauses.
B. APPEND all rows every day without deduplication.
C. INTERSECT the source and target and discard the result.
D. UNPIVOT all columns before loading.
Correct Answer: A.
Explanation: A is correct because the target needs upsert semantics. B creates duplicate current records. C finds overlap but does not update or insert. D changes shape and does not solve incremental load logic. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.
Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.
The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.
For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Data profile | Distribution evidence | Counts, null rates, min/max, distinct counts | Unknown until profiled | Representative sample or full scan | Transformation rules target the wrong quality issue |
| Join operation | Row-combination logic | Inner, left, right, full, semi, anti | No join | Key uniqueness and null handling | Rows duplicate or disappear unexpectedly |
| Set operator | Dataset comparison | UNION, INTERSECT, EXCEPT | No set operation | Column compatibility | Scenario asks for difference but join is used incorrectly |
| Pivot/unpivot | Shape transformation | Wide-to-long or long-to-wide | Original shape | Known dimension values or dynamic logic | Analytics model has unusable granularity |
| MERGE operation | Upsert semantics | Matched update/delete and not matched insert | No target change until executed | Business key and change detection | Duplicates or stale records remain |
Exam implementation pattern:
MERGE INTO target t USING source s ON t.business_key = s.business_key WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *. Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.
The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.
The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.
If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.
Exam Trap Summary: Do not append a daily feed when the target needs upsert behavior; MERGE requires a stable business key and post-load duplicate checks.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate duplicate handling | SQL verification: SELECT business_key, COUNT(*) FROM <table> GROUP BY business_key HAVING COUNT(*) > 1; |
No duplicate business keys remain when uniqueness is required |
| Validate null remediation | SQL verification: SELECT COUNT(*) FROM <table> WHERE <required_col> IS NULL; |
Required columns have zero unexpected nulls |
| Validate MERGE outcome | Delta table history or row-count comparison before and after load | Updated and inserted row counts align with source change set |
| Validate transformation shape | SQL verification on expected columns and row grain | Output grain matches denormalized, pivoted, or unpivoted requirement |
Practice Question: A declarative pipeline must stop records with negative transaction amounts from entering the curated table while preserving evidence for troubleshooting. What should be designed?
A. A pipeline expectation or validation rule with a controlled reject/quarantine path.
B. A larger cluster with more workers.
C. A column comment explaining that amounts should be positive.
D. A UNION operation between valid and invalid records.
Correct Answer: A.
Explanation: A is correct because data quality must be enforced at pipeline execution with observable handling for invalid rows. B is capacity. C documents intent but does not enforce it. D combines data and can worsen contamination. Exam Takeaway: Select the object that owns the dependency; the distractor pattern is an adjacent Databricks feature that is technically real but does not satisfy the scenario's first blocking condition.
Data preparation questions are about semantics before syntax. The first decision is the shape of change: full load, incremental load, CDC, stream, upsert, append-only event, historical dimension, or quality-gated curated table.
The why-layer sits in the data contract. Wrong grain creates double counting. A missing checkpoint causes reprocessing or data loss. A weak MERGE key creates duplicate current records. Schema drift handling is not a substitute for quality rules, and table comments do not enforce constraints.
For exam reasoning, bind the operation to the data state it changes: Auto Loader records file discovery, MERGE changes matched and unmatched rows, expectations classify records, clustering changes file skipping, and deletion vectors affect row-level delete mechanics.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Schema enforcement | Write-time contract | Column names and data types | Enforced for Delta writes according to operation | Target schema and write mode | Unexpected columns or types fail the write |
| Schema drift | Source evolution event | Added, missing, renamed, or type-changed column | No drift until source changes | Ingestion option and governance approval | Pipeline silently drops or fails on new fields |
| Validation check | Quality rule | Nullability, range, cardinality, uniqueness | Not active unless defined | Business rule and data profile evidence | Invalid records contaminate curated table |
| Pipeline expectation | Lakeflow quality rule | Expect, drop, fail, or quarantine behavior where supported | No expectation | Declarative pipeline configuration | Bad records are accepted or pipeline stops unexpectedly |
| Quarantine table | Rejected-record store | Invalid record payload and reason | Absent unless designed | Expectation action and storage target | Errors cannot be diagnosed after rejection |
Exam implementation pattern:
Command confidence note: Commands shown in this section are verification-oriented examples. Validate exact Databricks CLI syntax against the active CLI and workspace version before using it as an authoritative production procedure.
The chain begins with source state: file arrival, CDC record, database extract, event offset, or source table snapshot. The ingestion tool records progress or creates a bounded load before transformation code acts on the data.
The modeled table then enforces grain, format, history, schema, and quality rules. MERGE uses a business key to change target rows; expectations accept, reject, or fail records; clustering and optimization alter file layout for query execution rather than business meaning.
If the source-change signal, row grain, or quality rule is wrong, later SQL can still run while producing incorrect analytics. DP-750 therefore rewards answers that validate the data contract before tuning presentation or compute.
Exam Trap Summary: Do not treat schema drift handling as data quality validation; schema evolution controls fields, while expectations control record correctness.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate schema contract | SQL verification: DESCRIBE TABLE <catalog>.<schema>.<table>; |
Column names, data types, and nullable behavior match design |
| Validate quality rule result | Pipeline run details or event log quality metrics | Expectation counts show passed and failed records |
| Validate quarantine evidence | SQL verification: SELECT error_reason, COUNT(*) FROM <quarantine_table> GROUP BY error_reason; |
Rejected records include actionable failure reason |
| Validate drift handling | Pipeline configuration and run history after a controlled schema-change test | New or changed fields follow approved drift behavior |
How should a data engineer decide between managed and external tables in Unity Catalog?
Use managed tables when Unity Catalog should own the table storage lifecycle, and external tables when data must remain in an externally managed storage location.
Managed tables simplify lifecycle ownership because Databricks manages the storage location through the catalog and schema configuration. External tables are appropriate when files already exist in a controlled cloud path or another process owns the data lifecycle. The exam often tests whether the learner understands table ownership before choosing DDL, grants, or cleanup behavior.
Demand Score: 91
Exam Relevance Score: 97
When is Auto Loader a strong ingestion choice for Azure Databricks?
Auto Loader is useful when files arrive continuously or incrementally and the pipeline needs scalable discovery, schema handling, and checkpointed processing.
Auto Loader is commonly used for cloud file ingestion into Delta or Unity Catalog tables. It can track discovered files, persist checkpoints, and handle evolving schemas when configured correctly. This makes it a better operational fit than manually listing directories for recurring ingestion workloads.
Demand Score: 88
Exam Relevance Score: 95
What should be done when incoming data may add new columns over time?
Configure schema enforcement and schema evolution intentionally, then validate schema changes before downstream consumers rely on the data.
Schema drift can break transformations, dashboards, and data quality expectations. Azure Databricks supports controlled schema evolution patterns, but production pipelines should still define where schemas are stored, which changes are allowed, and how unexpected changes are handled. Accepting all drift blindly can make downstream failures harder to diagnose.
Demand Score: 87
Exam Relevance Score: 94
How should Change Data Capture data be prepared for analytics in Databricks?
Ingest CDC records with ordering and key logic, then apply inserts, updates, and deletes into a reliable Delta target using the appropriate merge or pipeline pattern.
CDC data represents changes rather than simple append-only facts. A correct pipeline must understand keys, operation types, event ordering, late arrivals, and delete semantics. The target table should preserve the intended history or current-state model and provide verifiable results for downstream analytics.
Demand Score: 90
Exam Relevance Score: 96
Why are data quality expectations useful in Databricks pipelines?
Expectations make quality rules explicit and allow the pipeline to track, quarantine, drop, or fail records based on defined conditions.
Operational pipelines need more than transformation code; they need observable quality decisions. Expectations help encode required fields, valid ranges, referential assumptions, or freshness checks. In exam scenarios, they are often the right answer when the requirement is to prevent bad data from silently reaching curated tables.
Demand Score: 89
Exam Relevance Score: 96