Data transformation refers to modifying, enriching, combining, or restructuring data inside Snowflake.
Snowflake offers multiple transformation mechanisms, ranging from classic SQL to advanced declarative pipelines like Dynamic Tables.
The exam expects you to understand what each mechanism does, how they differ, and when each is appropriate.
Classic ELT (Extract → Load → Transform) is Snowflake’s recommended approach because Snowflake is built for in-database transformations, not traditional ETL engines.
Snowflake uses standard SQL for creating and managing objects:
CREATE SCHEMA
CREATE TABLE
CREATE VIEW
ALTER TABLE
DROP TABLE
DDL defines the structure of your transformed data.
Transformations frequently use DML:
INSERT
UPDATE
DELETE
MERGE (UPSERT operations)
COPY INTO for loading
Snowflake’s compute layer is optimized for these operations at scale.
CTAS is useful for:
Creating transformed tables in a single step
Materializing intermediate results
Fast prototyping of new models
Example:
CREATE TABLE sales_agg AS
SELECT region, SUM(amount) AS total_sales
FROM raw_sales
GROUP BY region;
Materialized views:
Store precomputed results based on a SQL query
Update incrementally as source data changes
Improve performance on expensive repetitive queries
Cost more (compute + storage)
Use cases:
Dashboards
Frequently accessed aggregates
Expensive joins/computations
Know the difference between:
Standard views (not stored; compute every time)
Materialized views (stored; incrementally refreshed)
Streams and Tasks allow incremental, automated, scheduled, or event-driven transformations.
A Stream tracks row-level changes (INSERT, UPDATE, DELETE) on a table.
A stream acts like a delta view:
It captures changes made to a table since the last time the stream was consumed.
Once read, the stream’s offset moves forward.
Streams contain metadata columns such as:
Operation type (INSERT, DELETE)
Row identifiers
Metadata about updates
Incremental ETL/ELT
Slowly Changing Dimensions
Partitioned ingestion
Audit tracking
A stream must be queried for its offset to advance.
Streams do not store raw data copies—they store change metadata.
Tasks allow Snowflake to run SQL on a schedule or as part of a dependency graph (DAG).
A Task executes:
A SQL statement
A stored procedure
A MERGE operation
An INSERT/UPDATE step
Tasks can be:
Time-based (cron-like schedules)
Dependency-based (triggered after another task finishes)
This is one of Snowflake’s most important transformation patterns:
Stream captures new or changed rows
Task picks up the changes and applies transformations
Example pipeline:
Stream detects changes in raw table
Task MERGEs changes into a cleaned table
Another Task aggregates the cleaned data
Incremental pipelines
Daily/hourly/near-real-time updates
Multi-step DAG-based transformations
Streams + Tasks are powerful but must be maintained, especially in complex dependency chains.
Dynamic Tables are Snowflake’s newer, declarative transformation feature, simplifying multi-step pipelines without requiring Streams + Tasks for every step.
You define:
A SQL query
A target freshness (e.g., data must be within 15 minutes of upstream sources)
Snowflake takes care of:
Change tracking
Dependency resolution
Scheduling
Incremental updates
Dynamic Tables: Snowflake manages your pipeline automatically
Streams + Tasks: You manually orchestrate schedules, offsets, dependencies
Dynamic Tables simplify pipelines like:
Raw → Clean → Modeled → Aggregated
without writing many Task and Stream definitions.
Each Dynamic Table has a target freshness, telling Snowflake:
“How up-to-date must the data be?”
Examples:
1 minute
5 minutes
15 minutes
1 hour
Snowflake then computes how often to update the table based on upstream changes.
Best for:
Multi-step transformation pipelines
Data models that must stay fresh
Simplifying DAG management
Best for:
Complex incremental logic
Custom scheduling
Workloads with branching, conditional logic
Best for:
Simple aggregates
Expensive queries that must be cached
Dashboards with frequent access
The exam will likely ask you to identify which mechanism is appropriate in different scenarios.
SQL is often enough for transformations, but sometimes you need code, procedural algorithms, or machine learning.
Snowflake offers several programmable components.
Snowpark provides APIs in Python, Java, and Scala to define data transformations programmatically.
Code executes inside Snowflake, close to the data
No need to move data into external compute environments
Great for:
Data engineering
Feature engineering
ML pipelines
Complex transformations
Snowpark supports:
DataFrame operations
UDFs written in Python
ML training (Snowpark ML)
User-defined table functions
Stored procedures enable imperative logic, such as:
Loops
Conditionals
Branching workflows
Complex orchestration logic
Written in:
JavaScript
Python
(Legacy Java-based SPs)
Useful when SQL alone is insufficient.
User-defined functions extend Snowflake’s compute logic:
Operate on one row → one output.
Used for:
Custom formatting
Mathematical calculations
Business logic
Return a set of rows, useful for:
Exploding arrays
Generating series
Complex parsing logic
UDFs/UDTFs can be written in:
SQL
JavaScript
Java
Python (via Snowpark)
SQL is the foundation of transformations.
Streams capture row-level changes; Tasks run scheduled logic.
Dynamic Tables automate pipelines with declarative freshness.
Materialized views precompute expensive queries.
Snowpark supports code-based transformations and ML.
UDFs and stored procedures extend transformation capabilities beyond SQL.
Streams provide Change Data Capture (CDC) tracking in Snowflake, but their behavior includes important constraints that commonly appear in exam questions.
Streams only retain change tracking for a limited retention window (commonly 14 days).
If a stream is not consumed within this window, its offset becomes stale and cannot be recovered.
Streams cannot be created on:
External Tables
Directory Tables
Views
Streams require Snowflake-managed storage formats to track micro-partition versions.
Standard Streams: Track INSERT, UPDATE, and DELETE operations.
Append-Only Streams: Track INSERT operations only.
Useful for log-style ingestion where updates are impossible.
A stream’s offset advances when it is queried.
This means:
Reading a stream consumes the change records
Re-querying does not return the same changes again
This behavior requires pipelines to be idempotent.
Streams may deliver changes in a form requiring reconciliation.
Using MERGE or deduplication logic ensures data correctness across retries.
Streams store metadata about changes, not copies of changed rows.
If the underlying table data ages out (due to retention expiration), the stream becomes invalid.
Tasks automate SQL execution in Snowflake but operate under several constraints.
Tasks must run using either:
A customer warehouse, or
Serverless task execution (if configured)
Serverless tasks consume Snowflake-managed compute credits.
Tasks can form Directed Acyclic Graphs (DAGs).
If a parent task is:
Suspended
Disabled
Erroring persistently
then downstream tasks also pause automatically.
Cron syntax is supported
Minimum scheduling intervals apply
Slight delays may occur due to system resource scheduling
Tasks may retry automatically depending on configuration and task type
Warehouse-backed tasks and serverless tasks follow different retry semantics.
Tasks can run:
SQL statements
MERGE operations
Stored procedures
They cannot run external OS commands or arbitrary system code.
Dynamic Tables provide declarative pipelines with automatic refresh rules but include important limitations.
Dynamic Tables use Snowflake-managed serverless compute.
Costs are driven by:
The frequency of refreshes
The complexity of upstream changes
The defined freshness requirement
You cannot perform:
INSERT
UPDATE
DELETE
MERGE
against a Dynamic Table.
They are fully managed by Snowflake.
Dynamic Tables cannot be used as upstream sources for streams.
This preserves clarity of dependency chains.
Non-deterministic functions are not allowed, including:
RANDOM
CURRENT_TIMESTAMP (unless deterministic semantics used)
Other functions that vary per execution
This ensures reproducible incremental updates.
Lower freshness values (for example, 1 minute) cause more frequent refresh cycles, increasing cost.
Higher freshness reduces compute consumption.
Dynamic Tables analyze dependency graphs automatically and determine:
Compute order
Required refresh level
Efficient incremental updates
This reduces the need for manual orchestration.
Materialized Views (MVs) accelerate repeated queries but must follow strict rules.
Materialized Views cannot reference:
Another Materialized View
Non-deterministic functions
Certain complex SQL constructs, depending on system constraints
Examples include:
Some window functions
Certain joins
Nested subqueries
This ensures the engine can maintain incremental refresh efficiently.
An MV must reference only one base table.
Joins to other tables are not allowed.
Refresh cost depends on:
Number of micro-partitions changed
Frequency of updates
Table size
Large or frequently updated tables increase MV maintenance cost.
Snowflake may skip MV refresh when:
Changes are insignificant
Cost is too high relative to impact
The MV would not materially change
This behavior optimizes platform efficiency.
Search Optimization improves performance for highly selective searches.
Search Optimization benefits:
LIKE '%text%' pattern searches
IN list queries
Searches on semi-structured data (for example, JSON fields)
These cases typically cannot benefit from clustering.
Search Optimization does not help range-based pruning, such as:
WHERE date BETWEEN '2023-01-01' AND '2023-01-31'
Clustering keys are more effective for these queries.
Search Optimization introduces:
Additional persistent index-like structures
Additional maintenance compute costs
More overhead when tables are frequently updated
This affects transformation performance, especially during large MERGE operations.
Stored Procedures (SPs) allow procedural logic but operate within Snowflake’s controlled execution environment.
Stored Procedures can be written in:
JavaScript
Python
SQL Scripting (Snowflake Scripting)
SPs do not natively return tabular results.
Developers must manually construct return values or return a variant array representing rows.
Two execution modes impact security:
EXECUTE AS CALLER (uses caller’s privileges)
EXECUTE AS OWNER (uses owner’s privileges)
This distinction is important for governance.
SPs use:
Warehouse compute, or
Serverless compute (depending on configuration)
SP cost is tied to underlying queries or operations.
Stored Procedures cannot make arbitrary external network calls, unless routed through:
External functions
Secure integrations
This prevents unauthorized data exfiltration.
User-defined Functions (UDFs) and User-defined Table Functions (UDTFs) extend SQL but with controlled behavior.
UDFs cannot maintain state across calls.
Each call is isolated and must return deterministic results based solely on input arguments.
Python UDFs:
Run in isolated containerized environments
Have restricted library availability
Cannot access external network resources
This strengthens security and isolation.
UDTFs can output multiple rows per input row, but:
Two modes define how UDFs treat NULL input:
RETURNS NULL ON NULL INPUT
CALLED ON NULL INPUT
This is frequently tested on SnowPro exams.
UDFs cannot make external API calls or access remote systems directly.
Snowflake enforces governance policies automatically across transformation pipelines.
Masking Policies and Row Access Policies automatically apply to all downstream objects, including:
Views
Materialized Views
Dynamic Tables (depending on configuration)
This preserves governance guarantees.
Tags applied to:
Databases
Schemas
Tables
Columns
may propagate through transformations such as:
Views
Materialized Views
Dynamic Tables
Propagation behavior can be configured.
If transformations violate governance rules, Snowflake may:
Reject the operation
Require explicit policy override
Enforce masking or filtering during execution
Understanding policy precedence and propagation is critical for real-world compliance scenarios.
What is a stream in Snowflake?
A stream is an object that tracks changes (inserts, updates, deletes) to a table using change data capture (CDC).
Streams record row-level changes since the last consumption. They do not store data themselves but reference table metadata. A common mistake is assuming streams persist historical data indefinitely—they only track changes until consumed.
Demand Score: 86
Exam Relevance Score: 94
What is a task in Snowflake?
A task is a scheduled or triggered job used to automate SQL execution, often for data transformations.
Tasks can run on a schedule or after another task completes. They are commonly used with streams for incremental processing. A common mistake is not assigning a warehouse, causing execution failure.
Demand Score: 83
Exam Relevance Score: 92
How do streams and tasks work together?
Streams capture data changes, and tasks process those changes automatically by running transformation queries.
This enables incremental ETL pipelines. Tasks query streams to process only new data. A common mistake is running tasks without consuming the stream, leading to repeated processing.
Demand Score: 85
Exam Relevance Score: 95
What is the difference between a task and a stored procedure?
A task schedules execution, while a stored procedure contains procedural logic.
Tasks trigger execution, whereas procedures define reusable logic. They are often combined. A common mistake is expecting tasks to contain complex logic without procedures.
Demand Score: 80
Exam Relevance Score: 90
What is incremental data processing in Snowflake?
It is the process of transforming only new or changed data instead of reprocessing entire datasets.
Streams enable tracking changes, while tasks automate processing. This improves efficiency and reduces compute cost. A common mistake is performing full refresh transformations unnecessarily.
Demand Score: 82
Exam Relevance Score: 93