“Prepare data” in Microsoft Fabric means turning raw inputs into reliable, analysis-ready tables that other people (and semantic models) can trust. A useful mental model is a simple factory line:
Get data → bring it into your environment (often into OneLake-backed storage)
Transform data → clean, standardize, and shape it into stable tables
Query and analyze data → validate outcomes, explore, and confirm the data supports the questions you need to answer
Your goal is not “moving bytes.” Your goal is repeatable, auditable, and performant data preparation.
Most real Fabric solutions follow a familiar flow:
Source systems
Examples: operational databases, SaaS apps, files, APIs, on-prem systems.
Ingestion path (how data arrives)
Common Fabric items you’ll see:
Dataflow Gen2: great for Power Query-style ingestion and shaping.
Data Pipeline: orchestration and copy-style movement, scheduling, dependencies.
Notebook / Spark: code-driven ingestion, heavy transforms, complex parsing.
On-premises Data Gateway: the “bridge” when a source is on-prem or not directly reachable.
Lakehouse: file + table experience (often great for bronze/silver style layers).
Warehouse: SQL-first experience (great for dimensional modeling and many BI workloads).
OneLake: the underlying unified data layer that helps centralize data access patterns.
You query what you produced to confirm:
The data is complete (row counts, expected date ranges).
The data is clean (types, null rates, key uniqueness).
The data is usable (joins behave, performance is reasonable).
Scenario A: Ingest SaaS + on-prem into a Lakehouse
You pull daily customer updates from a SaaS source and transaction logs from an on-prem SQL Server.
SaaS: use Dataflow Gen2 for easy connector + shaping.
On-prem: use On-premises Data Gateway and schedule ingestion.
Land raw data into “raw/bronze,” then transform into “clean/silver” tables.
Scenario B: Build a repeatable pipeline for a Warehouse
You have a Warehouse used by many analysts via SQL.
Use a Data Pipeline to orchestrate ingestion steps (copy, then transformations, then validations).
Transform in SQL for consistency and governance in a SQL-centric environment.
Add checks: row counts by day, rejected rows table, and a simple “load status” output.
Scenario C: Heavy transformations with code
Your data arrives as nested JSON with inconsistent fields.
Use a Notebook to parse, normalize, and write standardized tables.
Keep transformation logic readable and versionable (clear functions, clear outputs).
Validate results with targeted queries and data-quality metrics.
Choosing a tool that doesn’t match the work: trying to do heavy parsing in a low-code flow can become fragile; doing simple reshaping in a notebook can slow teams down. Pick the simplest tool that still stays maintainable.
Not designing for repeat runs: data preparation must be idempotent or at least predictable (re-running should not duplicate rows or corrupt partitions).
Schema drift surprises: sources change. Protect yourself with explicit typing, validation steps, and “reject/ quarantine” paths for unexpected columns/values.
Gateway issues feel like “random failures”: when on-prem data fails, verify gateway connectivity, credentials, and whether the scheduled identity has access to the source.
No validation layer: if you don’t query and validate after each load, you’ll only discover problems when reports are wrong—often days later.
At a high level, DP-600 will expect you to:
Choose an ingestion approach (Dataflow Gen2 vs Data Pipeline vs Notebook) based on constraints like complexity, maintainability, and connectivity.
Describe how data moves from sources into Lakehouse/Warehouse and how you’d validate the result.
Recognize symptoms: duplicate loads, missing days, schema drift, slow queries, and “works in dev but fails on schedule.”
Preparing data is about building a reliable path from raw inputs to clean tables:
Use the right ingestion tool for the job.
Transform with clear, testable steps.
Validate by querying what you produced—every time.
Next, we’ll move into Implement and manage semantic models, where the focus shifts from “tables exist” to “business meaning, performance, and governed reuse.”
DP-600 scenarios often disguise the “right tool” behind constraints. Use these pivots:
Dataflow Gen2 when you need repeatable ingestion + shaping with Power Query-style steps, and the complexity stays moderate.
Data Pipeline when the problem is orchestration: dependencies, retries, scheduling, and multi-step movement/landing patterns (even if transforms are light).
Notebook (or Spark Job Definition) when parsing/normalization is complex (nested JSON, heavy enrichment, custom logic) or when you need scalable code-based processing.
On-premises Data Gateway when the source is on-prem or otherwise not directly reachable from the service and you must bridge connectivity.
Exam pattern: if the prompt emphasizes “schedule + retries + dependencies,” lean Data Pipeline; if it emphasizes “complex parsing or custom logic,” lean Notebook; if it emphasizes “business-friendly shaping and connectors,” lean Dataflow Gen2.
When you “ingest,” you create a managed landing that you can re-run, audit, and optimize. When you “access,” you’re often pointing to an existing location or using shortcuts/patterns that reduce duplication.
A safe enterprise stance is:
Ingest when you need auditability, repeatability, schema control, or performance isolation.
Access when you need fast onboarding, shared data reuse, or single-source-of-truth behavior and governance already exists upstream.
If a scenario mentions “multiple teams reusing the same source,” check if “access” patterns can reduce duplicates—but keep validation and governance in mind.
Two discovery modes show up:
OneLake catalog: “What data assets already exist in my tenant/workspaces, and how do I find the right tables/files?”
Real-Time hub: “What streaming/real-time data is available right now, and how do I subscribe or explore it?”
Exam trap: discovery is not ingestion. The prompt may ask you to “discover” before you “ingest,” meaning you should reference the catalog/hub rather than immediately choosing a pipeline.
Use workload shape to choose the landing/serving surface:
Lakehouse when you want flexible file+table patterns, medallion layering, and a broad mix of engineering and analytics workflows.
Warehouse when you want SQL-first modeling, dimensional patterns, and consistent T-SQL governance for many BI consumers.
Eventhouse (KQL Database) when the core workload is log/telemetry/time-series or event analytics where KQL is a natural fit (fast filtering/aggregation on events).
A “wrong but tempting” exam choice is forcing everything into a Warehouse because it’s SQL; if the scenario is clearly event/log analytics, Eventhouse (KQL Database) is usually the intended answer.
Integration questions often aim at “how does this data become reusable?”
For Eventhouse (KQL Database): confirm your path to governed reuse (who can query with KQL, what is published/curated, and how downstream tools consume it).
For Semantic Model (Dataset): confirm the path from curated tables to a model that others can build on, without duplicating logic in each report.
Exam hint: if the prompt connects real-time/event data to BI consumption, you’ll likely need to mention both Eventhouse (KQL Database) + Semantic Model (Dataset) integration considerations, not just ingestion.
When a scenario asks “where should we do the transformation,” answer with maintainability and pushdown logic:
Dataflow Gen2 for transparent, stepwise transforms that business/BI teams can maintain.
Notebook / Spark Job Definition for scalable, code-heavy transforms and complex parsing.
Warehouse / SQL Analytics Endpoint for SQL-centric teams and governed, reviewable transformations (views, functions, stored procedures) that are easy to validate with queries.
Decision rule that holds in exam scenarios:
Put transformations closest to the team that must maintain them, as long as performance and governance are acceptable.
Prefer pushdown when possible (do work where it’s cheapest/fastest and easiest to govern).
Centralize transformations when many downstream consumers rely on the same standardized logic.
A common enterprise workflow:
Transform raw data into curated fact and dimension tables (Lakehouse/Warehouse).
Then build the semantic model on top with clear relationships and measures.
Why this matters: if you rely on the semantic model to “fix” messy shapes, performance and correctness issues multiply across reports.
When a dataset grows 10x unexpectedly after a change, it’s usually one of these:
A many-to-many join was introduced without controls.
A dimension table lost uniqueness (duplicate keys).
A bridge/lookup table expanded the grain.
Advanced safety checks you should operationalize:
Enforce uniqueness on dimension keys (or at least validate it every load).
Validate row counts and distinct keys pre/post join.
Create “reject/quarantine” outputs for duplicates, missing keys, and type failures.
High-value patterns (and easy exam points):
Write load metrics: rows in/out, rejected count, null rate by key columns, max/min dates.
Persist reject tables (bad rows) rather than silently dropping them.
Create a small “data quality status” table that dashboards can read to show freshness and health.
This turns “we think it loaded” into “we can prove it loaded correctly.”
DP-600 expects you to match the query tool to the analysis need:
Visual Query Editor: fast interactive exploration and validation when you want to see transformations and results without writing everything by hand.
SQL (Warehouse / SQL Analytics Endpoint): best for joins, dimensional validation, and reconciling counts/aggregates.
KQL (Eventhouse (KQL Database)): best for event/time-series style slicing, filtering, and summarization.
DAX (often via DAX Query View): best for validating semantic model measures and filter context behavior.
Exam trap: using SQL to validate a DAX measure can miss filter-context behavior; when the question is “why is the KPI wrong in the report,” you often need DAX-level validation, not just SQL-level aggregates.
Validate raw ingestion: expected date range, row counts, and key distribution.
Validate transforms: uniqueness of dimensions, join grain, null/duplicate checks.
Validate serving layer: SQL aggregates match business expectations at the table level.
Validate semantic layer: measures under filters behave (DAX context) and security filters (RLS) don’t distort results.
This is the fastest way to separate “data problem” vs “model problem” vs “report problem.”
When performance regresses after a transform:
First confirm if you created more data (row explosion, wider tables, higher cardinality columns).
Then confirm if the query path changed (more joins, less filtering, missing partition pruning patterns).
Finally confirm if you’re testing in the right surface (SQL vs KQL vs DAX), since each has different performance drivers.
For exam responses, always name the “likely cause category” (data volume, join grain, filter selectivity, calculation complexity) and a concrete verification (row count checks, distinct key checks, explain/plan style reasoning, or a targeted subset query).
Two common “why don’t totals match?” patterns:
Date boundary drift: source is in one time zone, reporting in another; daily totals shift across midnight boundaries.
Late-arriving events/transactions: “yesterday” changes after the fact.
The correct exam posture:
Define the business rule (which time zone, what is “day”).
Use a consistent “as-of” cutoff or watermarking logic in your ingestion/transforms.
Validate aggregates with the same cutoff rule on both source and destination.
When should a Lakehouse be used instead of a Warehouse in Microsoft Fabric?
A Lakehouse should be used when analytics workloads require flexible storage for structured and unstructured data using open formats such as Delta tables.
Lakehouses combine the scalability of a data lake with structured query capabilities. They store data in open formats like Delta Lake, allowing integration with Spark, notebooks, and machine learning workloads. This architecture supports large-scale ingestion pipelines and schema evolution scenarios. Warehouses, in contrast, are optimized for relational analytics and structured SQL workloads. They enforce stronger schema constraints and typically serve business intelligence scenarios where predictable query performance is required. Organizations often use Lakehouses as the raw and transformation layer while Warehouses support curated analytical datasets used by reports and dashboards.
Demand Score: 88
Exam Relevance Score: 92
What is the purpose of Dataflows Gen2 in Microsoft Fabric?
Dataflows Gen2 provide a low-code environment for ingesting and transforming data before storing it in Fabric destinations such as Lakehouses or Warehouses.
Dataflows Gen2 extend the Power Query engine into Fabric’s data engineering environment. They allow users to connect to multiple sources, perform transformations, and load results into scalable storage platforms. Unlike earlier Power BI dataflows that primarily produced datasets, Gen2 dataflows support enterprise data engineering scenarios and integrate with Fabric pipelines. Transformations can include filtering, joins, calculated columns, and schema reshaping. Because the transformations are declarative and reusable, they are often used for standardized ingestion processes across departments. This approach allows organizations to maintain consistent data preparation logic while reducing the need for custom code in notebooks.
Demand Score: 85
Exam Relevance Score: 90
How does the SQL analytics endpoint enable querying Lakehouse data?
The SQL analytics endpoint provides a T-SQL interface that allows users to query Delta tables stored in a Lakehouse.
Each Fabric Lakehouse automatically exposes a SQL endpoint that translates T-SQL queries into operations against the underlying Delta tables. This allows analysts who are familiar with SQL Server–style querying to analyze Lakehouse data without using Spark notebooks. The endpoint supports joins, aggregations, and other analytical queries while maintaining compatibility with tools such as Power BI and SQL clients. This capability bridges the gap between data engineering workflows and traditional business intelligence tooling. However, heavy transformations or large-scale processing tasks are typically better suited for Spark engines rather than the SQL endpoint.
Demand Score: 82
Exam Relevance Score: 90
What factors should determine whether data transformations occur in Dataflows Gen2 or notebooks?
Transformations should occur in Dataflows Gen2 for reusable, low-code data preparation, while notebooks are better suited for complex or large-scale programmatic transformations.
Dataflows Gen2 are optimized for declarative Power Query transformations and are ideal for standard ingestion pipelines that require repeatable logic. They are easier to maintain and accessible to analysts who may not write code. Notebooks, on the other hand, support Spark-based transformations using Python, Scala, or SQL. These are better suited for advanced processing tasks such as large-scale joins, machine learning preparation, or distributed data manipulation. Choosing between the two depends on complexity, scalability requirements, and the skill set of the team managing the transformation pipeline.
Demand Score: 80
Exam Relevance Score: 88
What is the role of Delta tables in Microsoft Fabric Lakehouses?
Delta tables provide transactional storage with schema enforcement and versioning for data stored in Fabric Lakehouses.
Delta tables are built on the Delta Lake format and enable reliable data storage in distributed environments. They maintain transaction logs that support ACID properties, allowing multiple processes to read and write data safely. This design ensures consistency even in high-concurrency environments common in analytics platforms. Delta tables also support features such as schema evolution, time travel, and optimized query performance through indexing and metadata management. These capabilities make them well suited for analytics pipelines where data is frequently updated and queried by multiple tools.
Demand Score: 84
Exam Relevance Score: 91
Why might an organization stage raw data before transforming it in Fabric?
Staging raw data allows organizations to preserve the original source data while enabling controlled transformation and cleansing processes.
In many analytics architectures, raw data is first ingested into a landing or bronze layer within a Lakehouse. This layer preserves the exact state of incoming data and acts as a historical record. Transformations then occur in subsequent layers, such as silver or curated datasets, where data quality rules and business logic are applied. This layered architecture improves traceability, enables rollback if transformation errors occur, and supports auditing requirements. It also allows multiple downstream transformations to reuse the same raw dataset without repeatedly extracting data from source systems.
Demand Score: 81
Exam Relevance Score: 87