Ingest and transform data

Ingest and transform data Detailed Explanation

Fast review map for this domain:

Exam signal	First object to inspect	Correct-answer pattern
Load only new or changed rows	Watermark or source change signal	Choose incremental loading only when a reliable boundary exists
Prepare facts and dimensions	Grain, keys, dimensions, target model	Design dimensional loading before selecting the transformation tool
Use PySpark over files and Delta tables	Lakehouse, notebook, Spark runtime	Match distributed transformation to Lakehouse and Delta workflow
Reference data without copying it	OneLake shortcut	Validate path, credential, permission, and shortcut support before troubleshooting transformations
Keep supported operational data synchronized	Mirroring configuration	Confirm source compatibility and replication status before querying target data

flowchart LR  
    A[Source change behavior] --> B[Loading pattern]  
    B --> C[Fabric store]  
    C --> D[Transformation engine]  
    D --> E[Target validation]  
    E --> F[Operational monitoring]

Design and implement full, incremental, dimensional, and streaming loading patterns

Exam Radar

Core Priority: Loading pattern selection determines cost, latency, correctness, and recoverability before the tool choice is made.
High Frequency: DP-700 frequently asks candidates to choose between full load, incremental load, dimensional preparation, and streaming ingestion patterns.
Confusion Alert: Incremental loading is not simply a smaller full load. It requires a reliable change signal, watermark, timestamp, version, event offset, or source change mechanism.
Scenario Logic: Use full loads when volume is small or replacement is required; incremental loads when changes can be detected; dimensional preparation when facts and dimensions must be conformed; streaming when low-latency event processing is required.
Version Delta: The current guide includes full and incremental data loads, dimensional model preparation, and streaming load pattern design.
Failure Trigger: Duplicate facts, missed late-arriving rows, dimension key mismatch, unbounded stream backlog, or overwrite of historical data.
Operational Dependency: The source must provide a trustworthy change indicator or event ordering signal for incremental and streaming designs.
How the Exam Asks It: The stem gives source behavior and target modeling requirements, then asks for the best load pattern.
How Distractors Are Designed: Distractors often choose a faster tool without satisfying change detection, keys, or latency requirements.
Why the Correct Answer Works: The correct pattern satisfies the data-state dependency first, then can be implemented with a suitable Fabric tool.

Practice Question: A source system contains 900 million rows and exposes a reliable lastModified timestamp. The lakehouse target must load only new and changed rows each day without replacing the whole table. What pattern should be designed?
A. Full truncate-and-reload each night.
B. Incremental load using a persisted watermark based on lastModified.
C. Manual CSV export into a warehouse.
D. Streaming load without an event source.
Correct Answer: B.
Explanation: B is correct because the source provides a change signal and the target needs daily changed-row processing. A is expensive and risks long replacement windows. C is not operational. D lacks an event stream. Exam Takeaway: Incremental load requires a reliable change indicator and stored progress; distractors ignore the source-state dependency.

Atomic Deconstruction - Operational Level

A load pattern is a contract between source change behavior and target state. Full load replaces or rebuilds the target from the complete source. Incremental load reads only data beyond a stored watermark or change boundary. Dimensional loading prepares facts, dimensions, surrogate keys, slowly changing attributes, and conformed structures for analytical models. Streaming loading processes events continuously or micro-batches them with ordering and checkpoint logic.

The why-layer is correctness under change. A full load can be correct but expensive. Incremental loading is efficient but only if the watermark cannot miss updates. Dimensional preparation prevents analytical joins from using unstable operational keys. Streaming ingestion is necessary when the business question depends on event freshness and backlog control.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Watermark	Change boundary	Timestamp, monotonically increasing ID, version, offset	Not stored	Reliable source change column or event offset	Duplicate or missed changed rows
Fact table	Grain	Transaction, snapshot, accumulating snapshot	Undefined until modeled	Stable business process definition	Measures double count or join incorrectly
Dimension table	Key strategy	Natural key, surrogate key, SCD handling	Source key only	Conformed modeling rules	Historical attributes overwrite incorrectly
Streaming source	Event ordering and checkpoint	Event time, processing time, offset, checkpoint	No checkpoint	Eventstream or supported streaming source	Backlog grows or events are reprocessed
Target Delta table	Merge behavior	Append, overwrite, upsert, merge	Depends on operation	Table schema and key match	Incorrect overwrite or duplicate rows

Step-by-Step Execution Path

Inspect source volume, latency requirement, change signal, and target model. Pattern selection happens before selecting a Fabric item.
For full load, validate that replacement windows, downstream dependencies, and target rebuild time are acceptable.
For incremental load, define the watermark and persist it outside the transient run state. The next run must know what the previous run processed.
For dimensional loading, define fact grain, dimension keys, late-arriving data handling, and attribute-change behavior before writing tables.
For streaming, validate source event ordering, checkpoint behavior, and target latency requirement.
Reconcile source counts, target counts, watermark values, and exception rows after each run.

Verification can use source query counts, target Delta table history, pipeline run output, notebook validation cells, warehouse reconciliation queries, and streaming backlog or checkpoint evidence.

Technical Chain

The source emits either a full data state, a detectable change set, or an event sequence. The load design chooses how to transform that source state into target state. A persisted watermark causes the next run to request only changes beyond the last successful boundary. Dimensional rules convert operational rows into analytics-ready facts and dimensions. Streaming checkpoints record progress through an event sequence. If the boundary is wrong, the target is not merely slow; it becomes logically incorrect.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate watermark continuity	Query control table or run metadata for last successful watermark	Current run starts after previous successful boundary
Validate target merge result	Supported SQL or Spark count comparison against expected changed keys	Inserted and updated row counts match source change set
Validate dimensional grain	Warehouse or lakehouse SQL query grouping by candidate key	No unintended duplicate fact rows at declared grain
Validate streaming progress	Eventstream or notebook checkpoint/backlog view	Offsets or backlog show forward progress without replay storm

Ingest and transform batch data with Lakehouse, Warehouse, Dataflows Gen2, notebooks, KQL, T-SQL, shortcuts, mirroring, and pipelines

Exam Radar

Core Priority: Batch ingestion questions test matching the data store and transformation engine to source format, engineering skill, target serving pattern, and governance requirements.
High Frequency: Expect choices between Lakehouse, Warehouse, Dataflows Gen2, notebooks, KQL, T-SQL, pipelines, shortcuts, and mirroring.
Confusion Alert: OneLake shortcuts reference data; they do not copy and transform it by themselves. Mirroring replicates supported source data into Fabric; it is not a generic notebook transformation.
Scenario Logic: Lakehouse fits open data engineering and Spark/Delta workflows. Warehouse fits relational SQL analytics. Dataflows Gen2 fits low-code transformations. Notebooks fit code-heavy transformation. KQL fits real-time analytics and Eventhouse-style query patterns. T-SQL fits warehouse transformations.
Version Delta: The current guide includes choosing data stores and transformation tools, managing shortcuts, implementing mirroring, using pipelines, and resolving shortcut errors.
Failure Trigger: Wrong engine for transformation language, shortcut authorization failure, unsupported mirrored source, or pipeline copy without schema reconciliation.
Operational Dependency: Source connectivity, credentials, OneLake permissions, supported mirroring source, and target schema must be validated before transformation.
How the Exam Asks It: The stem gives source, transformation language, and target consumption needs.
How Distractors Are Designed: Wrong answers select a familiar tool that cannot satisfy language, latency, storage, or support constraints.
Why the Correct Answer Works: The correct option aligns data store, transformation engine, and operational dependency.

Practice Question: A team wants to transform semi-structured files in OneLake with PySpark, store the curated result in Delta tables, and later expose some tables to SQL consumers. Which starting store is most appropriate?
A. Lakehouse.
B. Sensitivity label.
C. Deployment pipeline.
D. Audit log.
Correct Answer: A.
Explanation: A is correct because Lakehouse supports open data engineering and Spark/Delta workflows. B classifies data. C promotes artifacts. D provides evidence. Exam Takeaway: Match the store to the processing engine and target table format; distractors often name governance or lifecycle features rather than ingestion architecture.

Atomic Deconstruction - Operational Level

Batch ingestion in Fabric separates source connection, movement or reference, transformation engine, and target serving. Pipelines move and orchestrate data. Dataflows Gen2 provide low-code preparation. Notebooks execute PySpark or custom code. Warehouses use T-SQL for relational transformation. KQL is used where event or real-time analytics query patterns apply. Shortcuts expose external or internal data through OneLake without copying the data. Mirroring keeps supported operational data synchronized into Fabric.

The why-layer is tool fitness. A shortcut failure is usually a permission, path, or external location issue, not a Spark optimization problem. A T-SQL warehouse transformation should not be forced into KQL unless the target is an Eventhouse analytics pattern. Mirroring only works when the source is supported and configuration prerequisites are met.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
Lakehouse	Table format	Delta tables, files, shortcuts	Empty until loaded	OneLake and Spark permissions	Files exist but tables are not queryable as expected
Warehouse	Transformation language	T-SQL	Empty schema	Warehouse item permissions	SQL consumers cannot access curated relational model
Dataflow Gen2	Transformation mode	Low-code Power Query-style steps	No refresh	Connector credentials and destination	Refresh fails or schema drifts
OneLake shortcut	Target path and credential	Internal or supported external location	Not created	Source permissions and shortcut support	Shortcut error, inaccessible files, broken path
Mirroring	Source replication configuration	Supported mirrored source types	Not configured	Source compatibility and permissions	Replication lag or unsupported source error

Step-by-Step Execution Path

Identify source type, target serving pattern, and transformation language. This determines the Fabric store and engine.
Choose Lakehouse for Spark/Delta engineering, Warehouse for relational SQL serving, Dataflows Gen2 for low-code preparation, notebook for PySpark/custom logic, KQL for real-time analytics patterns, and pipeline for movement or orchestration.
If using shortcuts, validate source path, credential, permission, and supported location before troubleshooting downstream transformations.
If using mirroring, confirm the source is supported and replication status is healthy before querying target data.
Use a pipeline when ingestion must be scheduled, parameterized, retried, or combined with other activities.
Validate schema, row counts, error rows, refresh status, and target accessibility after ingestion.

Evidence should come from Fabric item run history, Dataflow Gen2 refresh details, pipeline activity output, Lakehouse table preview, Warehouse SQL query output, shortcut status, and mirroring replication status.

Technical Chain

The batch source is connected by credential or referenced by shortcut. A pipeline, dataflow, notebook, SQL script, or KQL query then transforms or moves data into a serving store. OneLake stores or references the data path. The target engine reads the data according to table metadata and permissions. If a shortcut credential fails, the engine never receives readable files. If the wrong transformation engine is chosen, the logic may be unsupported even though the data is present.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate shortcut health	Fabric portal > Lakehouse > Shortcuts	Shortcut resolves and files are browseable
Validate Dataflow Gen2 refresh	Fabric portal > Dataflow Gen2 > Refresh history	Latest refresh succeeds with expected row counts
Validate pipeline ingestion	Fabric portal > Pipeline > Run history > Copy activity output	Read/write counts and error rows match expectation
Validate warehouse serving	Warehouse SQL query against curated table	Schema and row counts match the batch load contract

Transform data with SQL, PySpark, and KQL while preserving schema, quality, and analytical intent

Exam Radar

Core Priority: DP-700 expects candidates to manipulate and transform data with SQL, PySpark, and KQL. The exam tests when each language is operationally appropriate.
High Frequency: Scenarios compare code-heavy transformation, relational warehouse shaping, and real-time analytics querying.
Confusion Alert: SQL, PySpark, and KQL can all filter or aggregate data, but their execution contexts and target stores differ.
Scenario Logic: Use PySpark for distributed file and Delta transformations, SQL for relational warehouse modeling and query serving, and KQL for high-performance event or log analytics patterns.
Version Delta: The official audience profile names SQL, PySpark, and KQL skills.
Failure Trigger: Schema mismatch, wrong data type conversion, skipped null handling, non-idempotent transformation, or language used outside its supported item context.
Operational Dependency: Transformation logic must match the Fabric item, table schema, security model, and refresh or run orchestration.
How the Exam Asks It: A stem describes a transformation objective and a Fabric context, then asks which language or pattern is best.
How Distractors Are Designed: Distractors select a language because it can express the calculation, while ignoring the item context or serving target.
Why the Correct Answer Works: The correct option fits both transformation semantics and execution location.

Practice Question: An Eventhouse contains high-volume operational events. A data engineer must query recent events by time window and aggregate failures by operation for monitoring analysis. Which language is most aligned?
A. KQL.
B. Dynamic data masking.
C. Git integration.
D. A deployment pipeline comparison.
Correct Answer: A.
Explanation: A is correct because KQL is designed for time-window event and log analytics patterns. B is a security feature. C and D are lifecycle features. Exam Takeaway: Select the language that belongs to the execution context; distractors often name valid Fabric features outside the transformation path.

Atomic Deconstruction - Operational Level

Transformation design includes input schema, data quality rule, transformation language, output schema, and validation query. SQL is strong for relational joins, dimensional shaping, constraints, and warehouse serving. PySpark is strong for scalable file processing, Delta operations, and programmable transformations. KQL is strong for time-series, event, telemetry, and log-style analytics in supported real-time analytics contexts.

The why-layer is preserving analytical intent. A transformation is not correct just because it returns rows. Data types, nulls, keys, timestamps, partition columns, and aggregation grain must match the intended analytical question. If a PySpark job writes strings for numeric measures, downstream SQL reports may silently misaggregate. If a KQL query uses processing time instead of event time, monitoring conclusions can be wrong.

Component Specifications

Object	Attribute	Value Range	Default State	Dependency	Failure State
SQL transformation	Join and aggregation grain	T-SQL supported expressions	Query-only until materialized	Warehouse or SQL endpoint	Duplicate joins or incorrect measures
PySpark transformation	DataFrame schema and Delta write mode	append, overwrite, merge-style logic	In-memory until written	Spark session and OneLake permissions	Schema drift or duplicate output
KQL query	Time filter and summarize clause	Supported KQL operators	Query-only	Eventhouse or KQL database context	Wrong time window or high query cost
Quality rule	Null, range, uniqueness, referential check	Pass, warn, fail	Not enforced unless coded	Source schema and business rule	Bad rows enter curated layer
Output table	Schema contract	Column names, types, partitions	Created by first write or DDL	Target engine and permissions	Downstream model breaks on type or name change

Step-by-Step Execution Path

Read the target Fabric item and serving requirement before writing transformation logic. The item context narrows the language choices.
Profile source schema, nulls, keys, and timestamp fields. This establishes quality checks before transformation.
Choose SQL for warehouse relational transformations, PySpark for scalable lakehouse transformations, or KQL for event/log analytics queries.
Implement transformation with explicit output schema, key behavior, and error handling. Avoid implicit type conversion where downstream analytics depend on precision.
Validate output row counts, duplicate keys, null exceptions, and business-rule totals.
Attach the transformation to a pipeline or schedule only after the transformation passes validation on representative data.

Use SQL query output, notebook cell output, KQL query results, table schema inspection, run history, and data-quality exception tables as verification evidence.

Technical Chain

The engine reads source data according to its metadata and execution model. SQL optimizes relational operations against warehouse structures. PySpark distributes transformations across Spark executors and writes table files and metadata. KQL evaluates time-window and event analytics operators against the KQL store. The target schema becomes the contract for downstream consumers. If the selected engine does not match the storage or query pattern, the transformation can become slow, unsupported, or semantically wrong.

Operational Skills Matrix

Task	Precise Command or Path	Verification Standard
Validate SQL output schema	Warehouse query: inspect columns and data types with supported metadata views	Column names and types match target contract
Validate PySpark write result	Notebook displays Delta table schema and count after write	Schema and row count match expected transformation output
Validate KQL time-window logic	KQL query with explicit time filter and summarize output	Returned events fall inside required window and aggregation grain
Validate quality exceptions	Query exception table or notebook validation output	Bad rows are isolated and not silently loaded

Shopping cart

Subtotal:

DP-700 Ingest and transform data

Detailed list of DP-700 knowledge points