Prepare data

Prepare data Detailed Explanation

1. Definition & mental model

“Prepare data” in Microsoft Fabric means turning raw inputs into reliable, analysis-ready tables that other people (and semantic models) can trust. A useful mental model is a simple factory line:

Get data → bring it into your environment (often into OneLake-backed storage)
Transform data → clean, standardize, and shape it into stable tables
Query and analyze data → validate outcomes, explore, and confirm the data supports the questions you need to answer

Your goal is not “moving bytes.” Your goal is repeatable, auditable, and performant data preparation.

2. Key concepts & data flows

Most real Fabric solutions follow a familiar flow:

Source systems

Examples: operational databases, SaaS apps, files, APIs, on-prem systems.
Ingestion path (how data arrives)

Common Fabric items you’ll see:

Dataflow Gen2: great for Power Query-style ingestion and shaping.
Data Pipeline: orchestration and copy-style movement, scheduling, dependencies.
Notebook / Spark: code-driven ingestion, heavy transforms, complex parsing.
On-premises Data Gateway: the “bridge” when a source is on-prem or not directly reachable.

Storage and serving surfaces

Lakehouse: file + table experience (often great for bronze/silver style layers).
Warehouse: SQL-first experience (great for dimensional modeling and many BI workloads).
OneLake: the underlying unified data layer that helps centralize data access patterns.

Validation and consumption checks

You query what you produced to confirm:

The data is complete (row counts, expected date ranges).
The data is clean (types, null rates, key uniqueness).
The data is usable (joins behave, performance is reasonable).

3. Typical deployment and operations scenarios

Scenario A: Ingest SaaS + on-prem into a Lakehouse

You pull daily customer updates from a SaaS source and transaction logs from an on-prem SQL Server.

SaaS: use Dataflow Gen2 for easy connector + shaping.
On-prem: use On-premises Data Gateway and schedule ingestion.
Land raw data into “raw/bronze,” then transform into “clean/silver” tables.

Scenario B: Build a repeatable pipeline for a Warehouse

You have a Warehouse used by many analysts via SQL.

Use a Data Pipeline to orchestrate ingestion steps (copy, then transformations, then validations).
Transform in SQL for consistency and governance in a SQL-centric environment.
Add checks: row counts by day, rejected rows table, and a simple “load status” output.

Scenario C: Heavy transformations with code

Your data arrives as nested JSON with inconsistent fields.

Use a Notebook to parse, normalize, and write standardized tables.
Keep transformation logic readable and versionable (clear functions, clear outputs).
Validate results with targeted queries and data-quality metrics.

4. Common mistakes, risks, and troubleshooting hints

Choosing a tool that doesn’t match the work: trying to do heavy parsing in a low-code flow can become fragile; doing simple reshaping in a notebook can slow teams down. Pick the simplest tool that still stays maintainable.
Not designing for repeat runs: data preparation must be idempotent or at least predictable (re-running should not duplicate rows or corrupt partitions).
Schema drift surprises: sources change. Protect yourself with explicit typing, validation steps, and “reject/ quarantine” paths for unexpected columns/values.
Gateway issues feel like “random failures”: when on-prem data fails, verify gateway connectivity, credentials, and whether the scheduled identity has access to the source.
No validation layer: if you don’t query and validate after each load, you’ll only discover problems when reports are wrong—often days later.

5. Exam relevance & study checkpoints

At a high level, DP-600 will expect you to:

Choose an ingestion approach (Dataflow Gen2 vs Data Pipeline vs Notebook) based on constraints like complexity, maintainability, and connectivity.
Describe how data moves from sources into Lakehouse/Warehouse and how you’d validate the result.
Recognize symptoms: duplicate loads, missing days, schema drift, slow queries, and “works in dev but fails on schedule.”

6. Summary and suggested next steps

Preparing data is about building a reliable path from raw inputs to clean tables:

Use the right ingestion tool for the job.
Transform with clear, testable steps.
Validate by querying what you produced—every time.

Next, we’ll move into Implement and manage semantic models, where the focus shifts from “tables exist” to “business meaning, performance, and governed reuse.”

Prepare data (Additional Content)

Get data

Tool-selection: pick the simplest tool that still meets constraints

DP-600 scenarios often disguise the “right tool” behind constraints. Use these pivots:

Dataflow Gen2 when you need repeatable ingestion + shaping with Power Query-style steps, and the complexity stays moderate.
Data Pipeline when the problem is orchestration: dependencies, retries, scheduling, and multi-step movement/landing patterns (even if transforms are light).
Notebook (or Spark Job Definition) when parsing/normalization is complex (nested JSON, heavy enrichment, custom logic) or when you need scalable code-based processing.
On-premises Data Gateway when the source is on-prem or otherwise not directly reachable from the service and you must bridge connectivity.

Exam pattern: if the prompt emphasizes “schedule + retries + dependencies,” lean Data Pipeline; if it emphasizes “complex parsing or custom logic,” lean Notebook; if it emphasizes “business-friendly shaping and connectors,” lean Dataflow Gen2.

“Ingest” vs “access”: avoid unnecessary copies

When you “ingest,” you create a managed landing that you can re-run, audit, and optimize. When you “access,” you’re often pointing to an existing location or using shortcuts/patterns that reduce duplication.

A safe enterprise stance is:

Ingest when you need auditability, repeatability, schema control, or performance isolation.
Access when you need fast onboarding, shared data reuse, or single-source-of-truth behavior and governance already exists upstream.

If a scenario mentions “multiple teams reusing the same source,” check if “access” patterns can reduce duplicates—but keep validation and governance in mind.

Discovery: OneLake catalog and Real-Time hub as “where is my data?”

Two discovery modes show up:

OneLake catalog: “What data assets already exist in my tenant/workspaces, and how do I find the right tables/files?”
Real-Time hub: “What streaming/real-time data is available right now, and how do I subscribe or explore it?”

Exam trap: discovery is not ingestion. The prompt may ask you to “discover” before you “ingest,” meaning you should reference the catalog/hub rather than immediately choosing a pipeline.

Choosing between Lakehouse, Warehouse, or Eventhouse (KQL Database)

Use workload shape to choose the landing/serving surface:

Lakehouse when you want flexible file+table patterns, medallion layering, and a broad mix of engineering and analytics workflows.
Warehouse when you want SQL-first modeling, dimensional patterns, and consistent T-SQL governance for many BI consumers.
Eventhouse (KQL Database) when the core workload is log/telemetry/time-series or event analytics where KQL is a natural fit (fast filtering/aggregation on events).

A “wrong but tempting” exam choice is forcing everything into a Warehouse because it’s SQL; if the scenario is clearly event/log analytics, Eventhouse (KQL Database) is usually the intended answer.

OneLake integration for Eventhouse (KQL Database) and Semantic Model (Dataset)

Integration questions often aim at “how does this data become reusable?”

For Eventhouse (KQL Database): confirm your path to governed reuse (who can query with KQL, what is published/curated, and how downstream tools consume it).
For Semantic Model (Dataset): confirm the path from curated tables to a model that others can build on, without duplicating logic in each report.

Exam hint: if the prompt connects real-time/event data to BI consumption, you’ll likely need to mention both Eventhouse (KQL Database) + Semantic Model (Dataset) integration considerations, not just ingestion.

Transform data

Where transformations should live (and why exam questions care)

When a scenario asks “where should we do the transformation,” answer with maintainability and pushdown logic:

Dataflow Gen2 for transparent, stepwise transforms that business/BI teams can maintain.
Notebook / Spark Job Definition for scalable, code-heavy transforms and complex parsing.
Warehouse / SQL Analytics Endpoint for SQL-centric teams and governed, reviewable transformations (views, functions, stored procedures) that are easy to validate with queries.

Decision rule that holds in exam scenarios:

Put transformations closest to the team that must maintain them, as long as performance and governance are acceptable.
Prefer pushdown when possible (do work where it’s cheapest/fastest and easiest to govern).
Centralize transformations when many downstream consumers rely on the same standardized logic.

Star schema implementation: do it in the data layer first, then in the model

A common enterprise workflow:

Transform raw data into curated fact and dimension tables (Lakehouse/Warehouse).
Then build the semantic model on top with clear relationships and measures.

Why this matters: if you rely on the semantic model to “fix” messy shapes, performance and correctness issues multiply across reports.

Aggregation, denormalization, and joins: prevent “data explosion”

When a dataset grows 10x unexpectedly after a change, it’s usually one of these:

A many-to-many join was introduced without controls.
A dimension table lost uniqueness (duplicate keys).
A bridge/lookup table expanded the grain.

Advanced safety checks you should operationalize:

Enforce uniqueness on dimension keys (or at least validate it every load).
Validate row counts and distinct keys pre/post join.
Create “reject/quarantine” outputs for duplicates, missing keys, and type failures.

Data quality outputs that downstream consumers can trust

High-value patterns (and easy exam points):

Write load metrics: rows in/out, rejected count, null rate by key columns, max/min dates.
Persist reject tables (bad rows) rather than silently dropping them.
Create a small “data quality status” table that dashboards can read to show freshness and health.

This turns “we think it loaded” into “we can prove it loaded correctly.”

Query and analyze data

Use the right query surface for the question

DP-600 expects you to match the query tool to the analysis need:

Visual Query Editor: fast interactive exploration and validation when you want to see transformations and results without writing everything by hand.
SQL (Warehouse / SQL Analytics Endpoint): best for joins, dimensional validation, and reconciling counts/aggregates.
KQL (Eventhouse (KQL Database)): best for event/time-series style slicing, filtering, and summarization.
DAX (often via DAX Query View): best for validating semantic model measures and filter context behavior.

Exam trap: using SQL to validate a DAX measure can miss filter-context behavior; when the question is “why is the KPI wrong in the report,” you often need DAX-level validation, not just SQL-level aggregates.

A repeatable validation flow for “numbers are wrong”

Validate raw ingestion: expected date range, row counts, and key distribution.
Validate transforms: uniqueness of dimensions, join grain, null/duplicate checks.
Validate serving layer: SQL aggregates match business expectations at the table level.
Validate semantic layer: measures under filters behave (DAX context) and security filters (RLS) don’t distort results.

This is the fastest way to separate “data problem” vs “model problem” vs “report problem.”

Troubleshooting slow queries: isolate the bottleneck

When performance regresses after a transform:

First confirm if you created more data (row explosion, wider tables, higher cardinality columns).
Then confirm if the query path changed (more joins, less filtering, missing partition pruning patterns).
Finally confirm if you’re testing in the right surface (SQL vs KQL vs DAX), since each has different performance drivers.

For exam responses, always name the “likely cause category” (data volume, join grain, filter selectivity, calculation complexity) and a concrete verification (row count checks, distinct key checks, explain/plan style reasoning, or a targeted subset query).

Analysis edge cases: time zone and late-arriving data

Two common “why don’t totals match?” patterns:

Date boundary drift: source is in one time zone, reporting in another; daily totals shift across midnight boundaries.
Late-arriving events/transactions: “yesterday” changes after the fact.

The correct exam posture:

Define the business rule (which time zone, what is “day”).
Use a consistent “as-of” cutoff or watermarking logic in your ingestion/transforms.
Validate aggregates with the same cutoff rule on both source and destination.

Shopping cart

Subtotal:

DP-600 Prepare data

Detailed list of DP-600 knowledge points