Shopping cart

Subtotal:

$0.00

DP-700 Ingest and transform data

Ingest and transform data

Detailed list of DP-700 knowledge points

Ingest and transform data Detailed Explanation

Fast review map for this domain:

Exam signal First object to inspect Correct-answer pattern
Load only new or changed rows Watermark or source change signal Choose incremental loading only when a reliable boundary exists
Prepare facts and dimensions Grain, keys, dimensions, target model Design dimensional loading before selecting the transformation tool
Use PySpark over files and Delta tables Lakehouse, notebook, Spark runtime Match distributed transformation to Lakehouse and Delta workflow
Reference data without copying it OneLake shortcut Validate path, credential, permission, and shortcut support before troubleshooting transformations
Keep supported operational data synchronized Mirroring configuration Confirm source compatibility and replication status before querying target data
flowchart LR  
    A[Source change behavior] --> B[Loading pattern]  
    B --> C[Fabric store]  
    C --> D[Transformation engine]  
    D --> E[Target validation]  
    E --> F[Operational monitoring]  

Design and implement full, incremental, dimensional, and streaming loading patterns

Exam Radar

  • Core Priority: Loading pattern selection determines cost, latency, correctness, and recoverability before the tool choice is made.
  • High Frequency: DP-700 frequently asks candidates to choose between full load, incremental load, dimensional preparation, and streaming ingestion patterns.
  • Confusion Alert: Incremental loading is not simply a smaller full load. It requires a reliable change signal, watermark, timestamp, version, event offset, or source change mechanism.
  • Scenario Logic: Use full loads when volume is small or replacement is required; incremental loads when changes can be detected; dimensional preparation when facts and dimensions must be conformed; streaming when low-latency event processing is required.
  • Version Delta: The current guide includes full and incremental data loads, dimensional model preparation, and streaming load pattern design.
  • Failure Trigger: Duplicate facts, missed late-arriving rows, dimension key mismatch, unbounded stream backlog, or overwrite of historical data.
  • Operational Dependency: The source must provide a trustworthy change indicator or event ordering signal for incremental and streaming designs.
  • How the Exam Asks It: The stem gives source behavior and target modeling requirements, then asks for the best load pattern.
  • How Distractors Are Designed: Distractors often choose a faster tool without satisfying change detection, keys, or latency requirements.
  • Why the Correct Answer Works: The correct pattern satisfies the data-state dependency first, then can be implemented with a suitable Fabric tool.

Practice Question: A source system contains 900 million rows and exposes a reliable lastModified timestamp. The lakehouse target must load only new and changed rows each day without replacing the whole table. What pattern should be designed?
A. Full truncate-and-reload each night.
B. Incremental load using a persisted watermark based on lastModified.
C. Manual CSV export into a warehouse.
D. Streaming load without an event source.
Correct Answer: B.
Explanation: B is correct because the source provides a change signal and the target needs daily changed-row processing. A is expensive and risks long replacement windows. C is not operational. D lacks an event stream. Exam Takeaway: Incremental load requires a reliable change indicator and stored progress; distractors ignore the source-state dependency.

Atomic Deconstruction - Operational Level

A load pattern is a contract between source change behavior and target state. Full load replaces or rebuilds the target from the complete source. Incremental load reads only data beyond a stored watermark or change boundary. Dimensional loading prepares facts, dimensions, surrogate keys, slowly changing attributes, and conformed structures for analytical models. Streaming loading processes events continuously or micro-batches them with ordering and checkpoint logic.

The why-layer is correctness under change. A full load can be correct but expensive. Incremental loading is efficient but only if the watermark cannot miss updates. Dimensional preparation prevents analytical joins from using unstable operational keys. Streaming ingestion is necessary when the business question depends on event freshness and backlog control.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Watermark Change boundary Timestamp, monotonically increasing ID, version, offset Not stored Reliable source change column or event offset Duplicate or missed changed rows
Fact table Grain Transaction, snapshot, accumulating snapshot Undefined until modeled Stable business process definition Measures double count or join incorrectly
Dimension table Key strategy Natural key, surrogate key, SCD handling Source key only Conformed modeling rules Historical attributes overwrite incorrectly
Streaming source Event ordering and checkpoint Event time, processing time, offset, checkpoint No checkpoint Eventstream or supported streaming source Backlog grows or events are reprocessed
Target Delta table Merge behavior Append, overwrite, upsert, merge Depends on operation Table schema and key match Incorrect overwrite or duplicate rows

Step-by-Step Execution Path

  1. Inspect source volume, latency requirement, change signal, and target model. Pattern selection happens before selecting a Fabric item.
  2. For full load, validate that replacement windows, downstream dependencies, and target rebuild time are acceptable.
  3. For incremental load, define the watermark and persist it outside the transient run state. The next run must know what the previous run processed.
  4. For dimensional loading, define fact grain, dimension keys, late-arriving data handling, and attribute-change behavior before writing tables.
  5. For streaming, validate source event ordering, checkpoint behavior, and target latency requirement.
  6. Reconcile source counts, target counts, watermark values, and exception rows after each run.

Verification can use source query counts, target Delta table history, pipeline run output, notebook validation cells, warehouse reconciliation queries, and streaming backlog or checkpoint evidence.

Technical Chain

The source emits either a full data state, a detectable change set, or an event sequence. The load design chooses how to transform that source state into target state. A persisted watermark causes the next run to request only changes beyond the last successful boundary. Dimensional rules convert operational rows into analytics-ready facts and dimensions. Streaming checkpoints record progress through an event sequence. If the boundary is wrong, the target is not merely slow; it becomes logically incorrect.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate watermark continuity Query control table or run metadata for last successful watermark Current run starts after previous successful boundary
Validate target merge result Supported SQL or Spark count comparison against expected changed keys Inserted and updated row counts match source change set
Validate dimensional grain Warehouse or lakehouse SQL query grouping by candidate key No unintended duplicate fact rows at declared grain
Validate streaming progress Eventstream or notebook checkpoint/backlog view Offsets or backlog show forward progress without replay storm

Ingest and transform batch data with Lakehouse, Warehouse, Dataflows Gen2, notebooks, KQL, T-SQL, shortcuts, mirroring, and pipelines

Exam Radar

  • Core Priority: Batch ingestion questions test matching the data store and transformation engine to source format, engineering skill, target serving pattern, and governance requirements.
  • High Frequency: Expect choices between Lakehouse, Warehouse, Dataflows Gen2, notebooks, KQL, T-SQL, pipelines, shortcuts, and mirroring.
  • Confusion Alert: OneLake shortcuts reference data; they do not copy and transform it by themselves. Mirroring replicates supported source data into Fabric; it is not a generic notebook transformation.
  • Scenario Logic: Lakehouse fits open data engineering and Spark/Delta workflows. Warehouse fits relational SQL analytics. Dataflows Gen2 fits low-code transformations. Notebooks fit code-heavy transformation. KQL fits real-time analytics and Eventhouse-style query patterns. T-SQL fits warehouse transformations.
  • Version Delta: The current guide includes choosing data stores and transformation tools, managing shortcuts, implementing mirroring, using pipelines, and resolving shortcut errors.
  • Failure Trigger: Wrong engine for transformation language, shortcut authorization failure, unsupported mirrored source, or pipeline copy without schema reconciliation.
  • Operational Dependency: Source connectivity, credentials, OneLake permissions, supported mirroring source, and target schema must be validated before transformation.
  • How the Exam Asks It: The stem gives source, transformation language, and target consumption needs.
  • How Distractors Are Designed: Wrong answers select a familiar tool that cannot satisfy language, latency, storage, or support constraints.
  • Why the Correct Answer Works: The correct option aligns data store, transformation engine, and operational dependency.

Practice Question: A team wants to transform semi-structured files in OneLake with PySpark, store the curated result in Delta tables, and later expose some tables to SQL consumers. Which starting store is most appropriate?
A. Lakehouse.
B. Sensitivity label.
C. Deployment pipeline.
D. Audit log.
Correct Answer: A.
Explanation: A is correct because Lakehouse supports open data engineering and Spark/Delta workflows. B classifies data. C promotes artifacts. D provides evidence. Exam Takeaway: Match the store to the processing engine and target table format; distractors often name governance or lifecycle features rather than ingestion architecture.

Atomic Deconstruction - Operational Level

Batch ingestion in Fabric separates source connection, movement or reference, transformation engine, and target serving. Pipelines move and orchestrate data. Dataflows Gen2 provide low-code preparation. Notebooks execute PySpark or custom code. Warehouses use T-SQL for relational transformation. KQL is used where event or real-time analytics query patterns apply. Shortcuts expose external or internal data through OneLake without copying the data. Mirroring keeps supported operational data synchronized into Fabric.

The why-layer is tool fitness. A shortcut failure is usually a permission, path, or external location issue, not a Spark optimization problem. A T-SQL warehouse transformation should not be forced into KQL unless the target is an Eventhouse analytics pattern. Mirroring only works when the source is supported and configuration prerequisites are met.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
Lakehouse Table format Delta tables, files, shortcuts Empty until loaded OneLake and Spark permissions Files exist but tables are not queryable as expected
Warehouse Transformation language T-SQL Empty schema Warehouse item permissions SQL consumers cannot access curated relational model
Dataflow Gen2 Transformation mode Low-code Power Query-style steps No refresh Connector credentials and destination Refresh fails or schema drifts
OneLake shortcut Target path and credential Internal or supported external location Not created Source permissions and shortcut support Shortcut error, inaccessible files, broken path
Mirroring Source replication configuration Supported mirrored source types Not configured Source compatibility and permissions Replication lag or unsupported source error

Step-by-Step Execution Path

  1. Identify source type, target serving pattern, and transformation language. This determines the Fabric store and engine.
  2. Choose Lakehouse for Spark/Delta engineering, Warehouse for relational SQL serving, Dataflows Gen2 for low-code preparation, notebook for PySpark/custom logic, KQL for real-time analytics patterns, and pipeline for movement or orchestration.
  3. If using shortcuts, validate source path, credential, permission, and supported location before troubleshooting downstream transformations.
  4. If using mirroring, confirm the source is supported and replication status is healthy before querying target data.
  5. Use a pipeline when ingestion must be scheduled, parameterized, retried, or combined with other activities.
  6. Validate schema, row counts, error rows, refresh status, and target accessibility after ingestion.

Evidence should come from Fabric item run history, Dataflow Gen2 refresh details, pipeline activity output, Lakehouse table preview, Warehouse SQL query output, shortcut status, and mirroring replication status.

Technical Chain

The batch source is connected by credential or referenced by shortcut. A pipeline, dataflow, notebook, SQL script, or KQL query then transforms or moves data into a serving store. OneLake stores or references the data path. The target engine reads the data according to table metadata and permissions. If a shortcut credential fails, the engine never receives readable files. If the wrong transformation engine is chosen, the logic may be unsupported even though the data is present.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate shortcut health Fabric portal > Lakehouse > Shortcuts Shortcut resolves and files are browseable
Validate Dataflow Gen2 refresh Fabric portal > Dataflow Gen2 > Refresh history Latest refresh succeeds with expected row counts
Validate pipeline ingestion Fabric portal > Pipeline > Run history > Copy activity output Read/write counts and error rows match expectation
Validate warehouse serving Warehouse SQL query against curated table Schema and row counts match the batch load contract

Transform data with SQL, PySpark, and KQL while preserving schema, quality, and analytical intent

Exam Radar

  • Core Priority: DP-700 expects candidates to manipulate and transform data with SQL, PySpark, and KQL. The exam tests when each language is operationally appropriate.
  • High Frequency: Scenarios compare code-heavy transformation, relational warehouse shaping, and real-time analytics querying.
  • Confusion Alert: SQL, PySpark, and KQL can all filter or aggregate data, but their execution contexts and target stores differ.
  • Scenario Logic: Use PySpark for distributed file and Delta transformations, SQL for relational warehouse modeling and query serving, and KQL for high-performance event or log analytics patterns.
  • Version Delta: The official audience profile names SQL, PySpark, and KQL skills.
  • Failure Trigger: Schema mismatch, wrong data type conversion, skipped null handling, non-idempotent transformation, or language used outside its supported item context.
  • Operational Dependency: Transformation logic must match the Fabric item, table schema, security model, and refresh or run orchestration.
  • How the Exam Asks It: A stem describes a transformation objective and a Fabric context, then asks which language or pattern is best.
  • How Distractors Are Designed: Distractors select a language because it can express the calculation, while ignoring the item context or serving target.
  • Why the Correct Answer Works: The correct option fits both transformation semantics and execution location.

Practice Question: An Eventhouse contains high-volume operational events. A data engineer must query recent events by time window and aggregate failures by operation for monitoring analysis. Which language is most aligned?
A. KQL.
B. Dynamic data masking.
C. Git integration.
D. A deployment pipeline comparison.
Correct Answer: A.
Explanation: A is correct because KQL is designed for time-window event and log analytics patterns. B is a security feature. C and D are lifecycle features. Exam Takeaway: Select the language that belongs to the execution context; distractors often name valid Fabric features outside the transformation path.

Atomic Deconstruction - Operational Level

Transformation design includes input schema, data quality rule, transformation language, output schema, and validation query. SQL is strong for relational joins, dimensional shaping, constraints, and warehouse serving. PySpark is strong for scalable file processing, Delta operations, and programmable transformations. KQL is strong for time-series, event, telemetry, and log-style analytics in supported real-time analytics contexts.

The why-layer is preserving analytical intent. A transformation is not correct just because it returns rows. Data types, nulls, keys, timestamps, partition columns, and aggregation grain must match the intended analytical question. If a PySpark job writes strings for numeric measures, downstream SQL reports may silently misaggregate. If a KQL query uses processing time instead of event time, monitoring conclusions can be wrong.

Component Specifications

Object Attribute Value Range Default State Dependency Failure State
SQL transformation Join and aggregation grain T-SQL supported expressions Query-only until materialized Warehouse or SQL endpoint Duplicate joins or incorrect measures
PySpark transformation DataFrame schema and Delta write mode append, overwrite, merge-style logic In-memory until written Spark session and OneLake permissions Schema drift or duplicate output
KQL query Time filter and summarize clause Supported KQL operators Query-only Eventhouse or KQL database context Wrong time window or high query cost
Quality rule Null, range, uniqueness, referential check Pass, warn, fail Not enforced unless coded Source schema and business rule Bad rows enter curated layer
Output table Schema contract Column names, types, partitions Created by first write or DDL Target engine and permissions Downstream model breaks on type or name change

Step-by-Step Execution Path

  1. Read the target Fabric item and serving requirement before writing transformation logic. The item context narrows the language choices.
  2. Profile source schema, nulls, keys, and timestamp fields. This establishes quality checks before transformation.
  3. Choose SQL for warehouse relational transformations, PySpark for scalable lakehouse transformations, or KQL for event/log analytics queries.
  4. Implement transformation with explicit output schema, key behavior, and error handling. Avoid implicit type conversion where downstream analytics depend on precision.
  5. Validate output row counts, duplicate keys, null exceptions, and business-rule totals.
  6. Attach the transformation to a pipeline or schedule only after the transformation passes validation on representative data.

Use SQL query output, notebook cell output, KQL query results, table schema inspection, run history, and data-quality exception tables as verification evidence.

Technical Chain

The engine reads source data according to its metadata and execution model. SQL optimizes relational operations against warehouse structures. PySpark distributes transformations across Spark executors and writes table files and metadata. KQL evaluates time-window and event analytics operators against the KQL store. The target schema becomes the contract for downstream consumers. If the selected engine does not match the storage or query pattern, the transformation can become slow, unsupported, or semantically wrong.

Operational Skills Matrix

Task Precise Command or Path Verification Standard
Validate SQL output schema Warehouse query: inspect columns and data types with supported metadata views Column names and types match target contract
Validate PySpark write result Notebook displays Delta table schema and count after write Schema and row count match expected transformation output
Validate KQL time-window logic KQL query with explicit time filter and summarize output Returned events fall inside required window and aggregation grain
Validate quality exceptions Query exception table or notebook validation output Bad rows are isolated and not silently loaded

Frequently Asked Questions

When should an incremental load pattern be chosen instead of a full load pattern?

Answer:

Choose incremental loading when only new or changed data should be processed after an initial load.

Explanation:

Incremental loading reduces processing time, capacity consumption, and unnecessary data movement by using change indicators such as timestamps, watermarks, change tracking, or source-specific metadata. Full loads are simpler but can become inefficient or risky for large datasets. DP-700 scenarios commonly test whether the candidate can match load design to volume, freshness, and reliability requirements.

Demand Score: 94

Exam Relevance Score: 98

How should data be prepared before loading it into a dimensional model?

Answer:

Clean, conform, deduplicate, and shape the data into fact and dimension structures before loading it into the analytical model.

Explanation:

Dimensional modeling depends on reliable keys, consistent grain, clean attributes, and predictable relationships. If raw operational data is loaded without preparation, reports can show duplicate facts, inconsistent dimension values, or incorrect aggregations. The exam expects candidates to recognize that modeling readiness is part of the ingestion and transformation design, not an afterthought.

Demand Score: 89

Exam Relevance Score: 95

When should a Lakehouse be preferred over a Warehouse for Fabric batch data workloads?

Answer:

Prefer a Lakehouse when the workload needs open file-based storage, Spark processing, Delta tables, and flexible data engineering patterns.

Explanation:

A Lakehouse is well suited for Spark-based transformation, mixed structured and semi-structured data, and direct storage in OneLake. A Warehouse is better aligned with relational SQL modeling, T-SQL development, and traditional data warehousing experiences. DP-700 questions often provide workload clues that point to the correct data store.

Demand Score: 91

Exam Relevance Score: 96

What is the main value of OneLake shortcuts in a Fabric ingestion design?

Answer:

OneLake shortcuts let Fabric items reference data in another location without physically copying it into the target lakehouse.

Explanation:

Shortcuts reduce duplication and simplify access to existing data, but they still depend on correct permissions, supported source behavior, and shortcut resolution. They are not the same as copying data with a pipeline or mirroring a database. Exam scenarios often use shortcuts when the goal is unified access to existing files or tables through OneLake.

Demand Score: 88

Exam Relevance Score: 94

How should duplicate, missing, and late-arriving data be handled in a Fabric transformation pipeline?

Answer:

Apply explicit data quality logic such as deduplication keys, null handling rules, validation checks, and watermark or replay logic for late-arriving records.

Explanation:

Reliable analytics require deterministic handling of imperfect source data. Duplicate records can inflate measures, missing values can break joins or calculations, and late records can cause incomplete time-window results. DP-700 favors answers that preserve analytical correctness through repeatable transformation rules rather than manual cleanup.

Demand Score: 93

Exam Relevance Score: 97

When should Eventstreams or Spark Structured Streaming be considered instead of a batch pipeline?

Answer:

Use streaming options when data must be processed continuously or with low latency instead of waiting for scheduled batch loads.

Explanation:

Batch pipelines are appropriate for periodic movement and transformation, but streaming workloads require engines that can process events as they arrive. Eventstreams support event ingestion and routing patterns, while Spark Structured Streaming supports code-driven streaming transformations. The exam tests whether the candidate can match data freshness and processing requirements to the correct Fabric capability.

Demand Score: 90

Exam Relevance Score: 96

DP-700 Training Course