Fast review map for this domain:
| Exam signal | First object to inspect | Correct-answer pattern |
|---|---|---|
| Load only new or changed rows | Watermark or source change signal | Choose incremental loading only when a reliable boundary exists |
| Prepare facts and dimensions | Grain, keys, dimensions, target model | Design dimensional loading before selecting the transformation tool |
| Use PySpark over files and Delta tables | Lakehouse, notebook, Spark runtime | Match distributed transformation to Lakehouse and Delta workflow |
| Reference data without copying it | OneLake shortcut | Validate path, credential, permission, and shortcut support before troubleshooting transformations |
| Keep supported operational data synchronized | Mirroring configuration | Confirm source compatibility and replication status before querying target data |
flowchart LR
A[Source change behavior] --> B[Loading pattern]
B --> C[Fabric store]
C --> D[Transformation engine]
D --> E[Target validation]
E --> F[Operational monitoring]
Practice Question: A source system contains 900 million rows and exposes a reliable lastModified timestamp. The lakehouse target must load only new and changed rows each day without replacing the whole table. What pattern should be designed?
A. Full truncate-and-reload each night.
B. Incremental load using a persisted watermark based on lastModified.
C. Manual CSV export into a warehouse.
D. Streaming load without an event source.
Correct Answer: B.
Explanation: B is correct because the source provides a change signal and the target needs daily changed-row processing. A is expensive and risks long replacement windows. C is not operational. D lacks an event stream. Exam Takeaway: Incremental load requires a reliable change indicator and stored progress; distractors ignore the source-state dependency.
A load pattern is a contract between source change behavior and target state. Full load replaces or rebuilds the target from the complete source. Incremental load reads only data beyond a stored watermark or change boundary. Dimensional loading prepares facts, dimensions, surrogate keys, slowly changing attributes, and conformed structures for analytical models. Streaming loading processes events continuously or micro-batches them with ordering and checkpoint logic.
The why-layer is correctness under change. A full load can be correct but expensive. Incremental loading is efficient but only if the watermark cannot miss updates. Dimensional preparation prevents analytical joins from using unstable operational keys. Streaming ingestion is necessary when the business question depends on event freshness and backlog control.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Watermark | Change boundary | Timestamp, monotonically increasing ID, version, offset | Not stored | Reliable source change column or event offset | Duplicate or missed changed rows |
| Fact table | Grain | Transaction, snapshot, accumulating snapshot | Undefined until modeled | Stable business process definition | Measures double count or join incorrectly |
| Dimension table | Key strategy | Natural key, surrogate key, SCD handling | Source key only | Conformed modeling rules | Historical attributes overwrite incorrectly |
| Streaming source | Event ordering and checkpoint | Event time, processing time, offset, checkpoint | No checkpoint | Eventstream or supported streaming source | Backlog grows or events are reprocessed |
| Target Delta table | Merge behavior | Append, overwrite, upsert, merge | Depends on operation | Table schema and key match | Incorrect overwrite or duplicate rows |
Verification can use source query counts, target Delta table history, pipeline run output, notebook validation cells, warehouse reconciliation queries, and streaming backlog or checkpoint evidence.
The source emits either a full data state, a detectable change set, or an event sequence. The load design chooses how to transform that source state into target state. A persisted watermark causes the next run to request only changes beyond the last successful boundary. Dimensional rules convert operational rows into analytics-ready facts and dimensions. Streaming checkpoints record progress through an event sequence. If the boundary is wrong, the target is not merely slow; it becomes logically incorrect.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate watermark continuity | Query control table or run metadata for last successful watermark | Current run starts after previous successful boundary |
| Validate target merge result | Supported SQL or Spark count comparison against expected changed keys | Inserted and updated row counts match source change set |
| Validate dimensional grain | Warehouse or lakehouse SQL query grouping by candidate key | No unintended duplicate fact rows at declared grain |
| Validate streaming progress | Eventstream or notebook checkpoint/backlog view | Offsets or backlog show forward progress without replay storm |
Practice Question: A team wants to transform semi-structured files in OneLake with PySpark, store the curated result in Delta tables, and later expose some tables to SQL consumers. Which starting store is most appropriate?
A. Lakehouse.
B. Sensitivity label.
C. Deployment pipeline.
D. Audit log.
Correct Answer: A.
Explanation: A is correct because Lakehouse supports open data engineering and Spark/Delta workflows. B classifies data. C promotes artifacts. D provides evidence. Exam Takeaway: Match the store to the processing engine and target table format; distractors often name governance or lifecycle features rather than ingestion architecture.
Batch ingestion in Fabric separates source connection, movement or reference, transformation engine, and target serving. Pipelines move and orchestrate data. Dataflows Gen2 provide low-code preparation. Notebooks execute PySpark or custom code. Warehouses use T-SQL for relational transformation. KQL is used where event or real-time analytics query patterns apply. Shortcuts expose external or internal data through OneLake without copying the data. Mirroring keeps supported operational data synchronized into Fabric.
The why-layer is tool fitness. A shortcut failure is usually a permission, path, or external location issue, not a Spark optimization problem. A T-SQL warehouse transformation should not be forced into KQL unless the target is an Eventhouse analytics pattern. Mirroring only works when the source is supported and configuration prerequisites are met.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| Lakehouse | Table format | Delta tables, files, shortcuts | Empty until loaded | OneLake and Spark permissions | Files exist but tables are not queryable as expected |
| Warehouse | Transformation language | T-SQL | Empty schema | Warehouse item permissions | SQL consumers cannot access curated relational model |
| Dataflow Gen2 | Transformation mode | Low-code Power Query-style steps | No refresh | Connector credentials and destination | Refresh fails or schema drifts |
| OneLake shortcut | Target path and credential | Internal or supported external location | Not created | Source permissions and shortcut support | Shortcut error, inaccessible files, broken path |
| Mirroring | Source replication configuration | Supported mirrored source types | Not configured | Source compatibility and permissions | Replication lag or unsupported source error |
Evidence should come from Fabric item run history, Dataflow Gen2 refresh details, pipeline activity output, Lakehouse table preview, Warehouse SQL query output, shortcut status, and mirroring replication status.
The batch source is connected by credential or referenced by shortcut. A pipeline, dataflow, notebook, SQL script, or KQL query then transforms or moves data into a serving store. OneLake stores or references the data path. The target engine reads the data according to table metadata and permissions. If a shortcut credential fails, the engine never receives readable files. If the wrong transformation engine is chosen, the logic may be unsupported even though the data is present.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate shortcut health | Fabric portal > Lakehouse > Shortcuts | Shortcut resolves and files are browseable |
| Validate Dataflow Gen2 refresh | Fabric portal > Dataflow Gen2 > Refresh history | Latest refresh succeeds with expected row counts |
| Validate pipeline ingestion | Fabric portal > Pipeline > Run history > Copy activity output | Read/write counts and error rows match expectation |
| Validate warehouse serving | Warehouse SQL query against curated table | Schema and row counts match the batch load contract |
Practice Question: An Eventhouse contains high-volume operational events. A data engineer must query recent events by time window and aggregate failures by operation for monitoring analysis. Which language is most aligned?
A. KQL.
B. Dynamic data masking.
C. Git integration.
D. A deployment pipeline comparison.
Correct Answer: A.
Explanation: A is correct because KQL is designed for time-window event and log analytics patterns. B is a security feature. C and D are lifecycle features. Exam Takeaway: Select the language that belongs to the execution context; distractors often name valid Fabric features outside the transformation path.
Transformation design includes input schema, data quality rule, transformation language, output schema, and validation query. SQL is strong for relational joins, dimensional shaping, constraints, and warehouse serving. PySpark is strong for scalable file processing, Delta operations, and programmable transformations. KQL is strong for time-series, event, telemetry, and log-style analytics in supported real-time analytics contexts.
The why-layer is preserving analytical intent. A transformation is not correct just because it returns rows. Data types, nulls, keys, timestamps, partition columns, and aggregation grain must match the intended analytical question. If a PySpark job writes strings for numeric measures, downstream SQL reports may silently misaggregate. If a KQL query uses processing time instead of event time, monitoring conclusions can be wrong.
| Object | Attribute | Value Range | Default State | Dependency | Failure State |
|---|---|---|---|---|---|
| SQL transformation | Join and aggregation grain | T-SQL supported expressions | Query-only until materialized | Warehouse or SQL endpoint | Duplicate joins or incorrect measures |
| PySpark transformation | DataFrame schema and Delta write mode | append, overwrite, merge-style logic | In-memory until written | Spark session and OneLake permissions | Schema drift or duplicate output |
| KQL query | Time filter and summarize clause | Supported KQL operators | Query-only | Eventhouse or KQL database context | Wrong time window or high query cost |
| Quality rule | Null, range, uniqueness, referential check | Pass, warn, fail | Not enforced unless coded | Source schema and business rule | Bad rows enter curated layer |
| Output table | Schema contract | Column names, types, partitions | Created by first write or DDL | Target engine and permissions | Downstream model breaks on type or name change |
Use SQL query output, notebook cell output, KQL query results, table schema inspection, run history, and data-quality exception tables as verification evidence.
The engine reads source data according to its metadata and execution model. SQL optimizes relational operations against warehouse structures. PySpark distributes transformations across Spark executors and writes table files and metadata. KQL evaluates time-window and event analytics operators against the KQL store. The target schema becomes the contract for downstream consumers. If the selected engine does not match the storage or query pattern, the transformation can become slow, unsupported, or semantically wrong.
| Task | Precise Command or Path | Verification Standard |
|---|---|---|
| Validate SQL output schema | Warehouse query: inspect columns and data types with supported metadata views | Column names and types match target contract |
| Validate PySpark write result | Notebook displays Delta table schema and count after write | Schema and row count match expected transformation output |
| Validate KQL time-window logic | KQL query with explicit time filter and summarize output | Returned events fall inside required window and aggregation grain |
| Validate quality exceptions | Query exception table or notebook validation output | Bad rows are isolated and not silently loaded |
When should an incremental load pattern be chosen instead of a full load pattern?
Choose incremental loading when only new or changed data should be processed after an initial load.
Incremental loading reduces processing time, capacity consumption, and unnecessary data movement by using change indicators such as timestamps, watermarks, change tracking, or source-specific metadata. Full loads are simpler but can become inefficient or risky for large datasets. DP-700 scenarios commonly test whether the candidate can match load design to volume, freshness, and reliability requirements.
Demand Score: 94
Exam Relevance Score: 98
How should data be prepared before loading it into a dimensional model?
Clean, conform, deduplicate, and shape the data into fact and dimension structures before loading it into the analytical model.
Dimensional modeling depends on reliable keys, consistent grain, clean attributes, and predictable relationships. If raw operational data is loaded without preparation, reports can show duplicate facts, inconsistent dimension values, or incorrect aggregations. The exam expects candidates to recognize that modeling readiness is part of the ingestion and transformation design, not an afterthought.
Demand Score: 89
Exam Relevance Score: 95
When should a Lakehouse be preferred over a Warehouse for Fabric batch data workloads?
Prefer a Lakehouse when the workload needs open file-based storage, Spark processing, Delta tables, and flexible data engineering patterns.
A Lakehouse is well suited for Spark-based transformation, mixed structured and semi-structured data, and direct storage in OneLake. A Warehouse is better aligned with relational SQL modeling, T-SQL development, and traditional data warehousing experiences. DP-700 questions often provide workload clues that point to the correct data store.
Demand Score: 91
Exam Relevance Score: 96
What is the main value of OneLake shortcuts in a Fabric ingestion design?
OneLake shortcuts let Fabric items reference data in another location without physically copying it into the target lakehouse.
Shortcuts reduce duplication and simplify access to existing data, but they still depend on correct permissions, supported source behavior, and shortcut resolution. They are not the same as copying data with a pipeline or mirroring a database. Exam scenarios often use shortcuts when the goal is unified access to existing files or tables through OneLake.
Demand Score: 88
Exam Relevance Score: 94
How should duplicate, missing, and late-arriving data be handled in a Fabric transformation pipeline?
Apply explicit data quality logic such as deduplication keys, null handling rules, validation checks, and watermark or replay logic for late-arriving records.
Reliable analytics require deterministic handling of imperfect source data. Duplicate records can inflate measures, missing values can break joins or calculations, and late records can cause incomplete time-window results. DP-700 favors answers that preserve analytical correctness through repeatable transformation rules rather than manual cleanup.
Demand Score: 93
Exam Relevance Score: 97
When should Eventstreams or Spark Structured Streaming be considered instead of a batch pipeline?
Use streaming options when data must be processed continuously or with low latency instead of waiting for scheduled batch loads.
Batch pipelines are appropriate for periodic movement and transformation, but streaming workloads require engines that can process events as they arrive. Eventstreams support event ingestion and routing patterns, while Spark Structured Streaming supports code-driven streaming transformations. The exam tests whether the candidate can match data freshness and processing requirements to the correct Fabric capability.
Demand Score: 90
Exam Relevance Score: 96