Data Transformations

Data Transformations Detailed Explanation

Data transformation refers to modifying, enriching, combining, or restructuring data inside Snowflake.
Snowflake offers multiple transformation mechanisms, ranging from classic SQL to advanced declarative pipelines like Dynamic Tables.

The exam expects you to understand what each mechanism does, how they differ, and when each is appropriate.

1. Classic ELT with SQL

Classic ELT (Extract → Load → Transform) is Snowflake’s recommended approach because Snowflake is built for in-database transformations, not traditional ETL engines.

1.1 SQL for Transformations

1.1.1 DDL (Data Definition Language)

Snowflake uses standard SQL for creating and managing objects:

CREATE SCHEMA
CREATE TABLE
CREATE VIEW
ALTER TABLE
DROP TABLE

DDL defines the structure of your transformed data.

1.1.2 DML (Data Manipulation Language)

Transformations frequently use DML:

INSERT
UPDATE
DELETE
MERGE (UPSERT operations)
COPY INTO for loading

Snowflake’s compute layer is optimized for these operations at scale.

1.2 CTAS – CREATE TABLE AS SELECT

CTAS is useful for:

Creating transformed tables in a single step
Materializing intermediate results
Fast prototyping of new models

Example:

CREATE TABLE sales_agg AS
SELECT region, SUM(amount) AS total_sales
FROM raw_sales
GROUP BY region;

1.3 Materialized Views

Materialized views:

Store precomputed results based on a SQL query
Update incrementally as source data changes
Improve performance on expensive repetitive queries
Cost more (compute + storage)

Use cases:

Dashboards
Frequently accessed aggregates
Expensive joins/computations

Know the difference between:

Standard views (not stored; compute every time)
Materialized views (stored; incrementally refreshed)

2. Streams & Tasks (CDC and Scheduling)

Streams and Tasks allow incremental, automated, scheduled, or event-driven transformations.

2.1 Streams – Change Data Capture (CDC)

A Stream tracks row-level changes (INSERT, UPDATE, DELETE) on a table.

2.1.1 How Streams Work

A stream acts like a delta view:

It captures changes made to a table since the last time the stream was consumed.
Once read, the stream’s offset moves forward.
Streams contain metadata columns such as:
- Operation type (INSERT, DELETE)
- Row identifiers
- Metadata about updates

2.1.2 Use Cases

Incremental ETL/ELT
Slowly Changing Dimensions
Partitioned ingestion
Audit tracking

2.1.3 Important Rules

A stream must be queried for its offset to advance.
Streams do not store raw data copies—they store change metadata.

2.2 Tasks – Scheduling & DAG Execution

Tasks allow Snowflake to run SQL on a schedule or as part of a dependency graph (DAG).

2.2.1 How Tasks Work

A Task executes:

A SQL statement
A stored procedure
A MERGE operation
An INSERT/UPDATE step

Tasks can be:

Time-based (cron-like schedules)
Dependency-based (triggered after another task finishes)

2.2.2 Combining Streams + Tasks

This is one of Snowflake’s most important transformation patterns:

Stream captures new or changed rows
Task picks up the changes and applies transformations

Example pipeline:

Stream detects changes in raw table
Task MERGEs changes into a cleaned table
Another Task aggregates the cleaned data

2.2.3 Typical Use Cases

Incremental pipelines
Daily/hourly/near-real-time updates
Multi-step DAG-based transformations

Streams + Tasks are powerful but must be maintained, especially in complex dependency chains.

3. Dynamic Tables

Dynamic Tables are Snowflake’s newer, declarative transformation feature, simplifying multi-step pipelines without requiring Streams + Tasks for every step.

3.1 Key Characteristics

3.1.1 Declarative Definition

You define:

A SQL query
A target freshness (e.g., data must be within 15 minutes of upstream sources)

Snowflake takes care of:

Change tracking
Dependency resolution
Scheduling
Incremental updates

3.1.2 How They Differ from Streams + Tasks

Dynamic Tables: Snowflake manages your pipeline automatically
Streams + Tasks: You manually orchestrate schedules, offsets, dependencies

Dynamic Tables simplify pipelines like:

Raw → Clean → Modeled → Aggregated

without writing many Task and Stream definitions.

3.2 Target Freshness

Each Dynamic Table has a target freshness, telling Snowflake:

“How up-to-date must the data be?”

Examples:

1 minute
5 minutes
15 minutes
1 hour

Snowflake then computes how often to update the table based on upstream changes.

3.3 When to Use Dynamic Tables vs Streams + Tasks vs Materialized Views

3.3.1 Dynamic Tables

Best for:

Multi-step transformation pipelines
Data models that must stay fresh
Simplifying DAG management

3.3.2 Streams + Tasks

Best for:

Complex incremental logic
Custom scheduling
Workloads with branching, conditional logic

3.3.3 Materialized Views

Best for:

Simple aggregates
Expensive queries that must be cached
Dashboards with frequent access

The exam will likely ask you to identify which mechanism is appropriate in different scenarios.

4. Snowpark & Procedural Logic

SQL is often enough for transformations, but sometimes you need code, procedural algorithms, or machine learning.
Snowflake offers several programmable components.

4.1 Snowpark

Snowpark provides APIs in Python, Java, and Scala to define data transformations programmatically.

Key Benefits

Code executes inside Snowflake, close to the data
No need to move data into external compute environments
Great for:
- Data engineering
- Feature engineering
- ML pipelines
- Complex transformations

Snowpark supports:

DataFrame operations
UDFs written in Python
ML training (Snowpark ML)
User-defined table functions

4.2 Stored Procedures

Stored procedures enable imperative logic, such as:

Loops
Conditionals
Branching workflows
Complex orchestration logic

Written in:

JavaScript
Python
(Legacy Java-based SPs)

Useful when SQL alone is insufficient.

4.3 UDFs / UDTFs

User-defined functions extend Snowflake’s compute logic:

4.3.1 Scalar UDFs

Operate on one row → one output.

Used for:

Custom formatting
Mathematical calculations
Business logic

4.3.2 Table UDFs (UDTFs)

Return a set of rows, useful for:

Exploding arrays
Generating series
Complex parsing logic

4.3.3 Languages

UDFs/UDTFs can be written in:

SQL
JavaScript
Java
Python (via Snowpark)

Summary of Concepts to Know for the Exam

SQL is the foundation of transformations.
Streams capture row-level changes; Tasks run scheduled logic.
Dynamic Tables automate pipelines with declarative freshness.
Materialized views precompute expensive queries.
Snowpark supports code-based transformations and ML.
UDFs and stored procedures extend transformation capabilities beyond SQL.

Data Transformations (Additional Content)

1. Stream Limitations and Behavior

Streams provide Change Data Capture (CDC) tracking in Snowflake, but their behavior includes important constraints that commonly appear in exam questions.

1.1 Retention and Offset Expiration

Streams only retain change tracking for a limited retention window (commonly 14 days).
If a stream is not consumed within this window, its offset becomes stale and cannot be recovered.

1.2 Unsupported Source Objects

Streams cannot be created on:

External Tables
Directory Tables
Views

Streams require Snowflake-managed storage formats to track micro-partition versions.

1.3 Stream Types

Standard Streams: Track INSERT, UPDATE, and DELETE operations.
Append-Only Streams: Track INSERT operations only.
Useful for log-style ingestion where updates are impossible.

1.4 Stream Consumption

A stream’s offset advances when it is queried.
This means:

Reading a stream consumes the change records
Re-querying does not return the same changes again

This behavior requires pipelines to be idempotent.

1.5 No Exactly-Once Semantics

Streams may deliver changes in a form requiring reconciliation.
Using MERGE or deduplication logic ensures data correctness across retries.

1.6 Streams Store Metadata, Not Data

Streams store metadata about changes, not copies of changed rows.
If the underlying table data ages out (due to retention expiration), the stream becomes invalid.

2. Task Limitations and Scheduling Rules

Tasks automate SQL execution in Snowflake but operate under several constraints.

2.1 Execution Requirements

Tasks must run using either:

A customer warehouse, or
Serverless task execution (if configured)

Serverless tasks consume Snowflake-managed compute credits.

2.2 Task DAG Behavior

Tasks can form Directed Acyclic Graphs (DAGs).
If a parent task is:

Suspended
Disabled
Erroring persistently

then downstream tasks also pause automatically.

2.3 Scheduling Rules

Cron syntax is supported
Minimum scheduling intervals apply
Slight delays may occur due to system resource scheduling
Tasks may retry automatically depending on configuration and task type

Warehouse-backed tasks and serverless tasks follow different retry semantics.

2.4 Execution Capabilities

Tasks can run:

SQL statements
MERGE operations
Stored procedures

They cannot run external OS commands or arbitrary system code.

3. Dynamic Table Constraints and Behavior

Dynamic Tables provide declarative pipelines with automatic refresh rules but include important limitations.

3.1 Compute Model

Dynamic Tables use Snowflake-managed serverless compute.
Costs are driven by:

The frequency of refreshes
The complexity of upstream changes
The defined freshness requirement

3.2 Dynamic Tables Are Read-Only

You cannot perform:

INSERT
UPDATE
DELETE
MERGE

against a Dynamic Table.
They are fully managed by Snowflake.

3.3 Unsupported as Stream Sources

Dynamic Tables cannot be used as upstream sources for streams.
This preserves clarity of dependency chains.

3.4 Deterministic Requirement

Non-deterministic functions are not allowed, including:

RANDOM
CURRENT_TIMESTAMP (unless deterministic semantics used)
Other functions that vary per execution

This ensures reproducible incremental updates.

3.5 Freshness and Cost

Lower freshness values (for example, 1 minute) cause more frequent refresh cycles, increasing cost.
Higher freshness reduces compute consumption.

3.6 Automatic Dependency Resolution

Dynamic Tables analyze dependency graphs automatically and determine:

Compute order
Required refresh level
Efficient incremental updates

This reduces the need for manual orchestration.

4. Materialized View Restrictions

Materialized Views (MVs) accelerate repeated queries but must follow strict rules.

4.1 Unsupported References

Materialized Views cannot reference:

Another Materialized View
Non-deterministic functions
Certain complex SQL constructs, depending on system constraints
Examples include:
- Some window functions
- Certain joins
- Nested subqueries

This ensures the engine can maintain incremental refresh efficiently.

4.2 Single Base Table Requirement

An MV must reference only one base table.
Joins to other tables are not allowed.

4.3 Maintenance Cost Behavior

Refresh cost depends on:

Number of micro-partitions changed
Frequency of updates
Table size

Large or frequently updated tables increase MV maintenance cost.

4.4 Refresh Skipping

Snowflake may skip MV refresh when:

Changes are insignificant
Cost is too high relative to impact
The MV would not materially change

This behavior optimizes platform efficiency.

5. Search Optimization and Transformation Performance

Search Optimization improves performance for highly selective searches.

5.1 Appropriate Use Cases

Search Optimization benefits:

LIKE '%text%' pattern searches
IN list queries
Searches on semi-structured data (for example, JSON fields)

These cases typically cannot benefit from clustering.

5.2 Not a Replacement for Clustering

Search Optimization does not help range-based pruning, such as:

WHERE date BETWEEN '2023-01-01' AND '2023-01-31'

Clustering keys are more effective for these queries.

5.3 Storage and Compute Overhead

Search Optimization introduces:

Additional persistent index-like structures
Additional maintenance compute costs
More overhead when tables are frequently updated

This affects transformation performance, especially during large MERGE operations.

6. Stored Procedure Execution Behavior and Limits

Stored Procedures (SPs) allow procedural logic but operate within Snowflake’s controlled execution environment.

6.1 Supported Languages

Stored Procedures can be written in:

JavaScript
Python
SQL Scripting (Snowflake Scripting)

6.2 Returned Results

SPs do not natively return tabular results.
Developers must manually construct return values or return a variant array representing rows.

6.3 Execution Privileges

Two execution modes impact security:

EXECUTE AS CALLER (uses caller’s privileges)
EXECUTE AS OWNER (uses owner’s privileges)

This distinction is important for governance.

6.4 Compute Consumption

SPs use:

Warehouse compute, or
Serverless compute (depending on configuration)

SP cost is tied to underlying queries or operations.

6.5 Network Restrictions

Stored Procedures cannot make arbitrary external network calls, unless routed through:

External functions
Secure integrations

This prevents unauthorized data exfiltration.

7. UDF and UDTF Execution Rules

User-defined Functions (UDFs) and User-defined Table Functions (UDTFs) extend SQL but with controlled behavior.

7.1 Stateless Nature

UDFs cannot maintain state across calls.
Each call is isolated and must return deterministic results based solely on input arguments.

7.2 Python UDF Behavior

Python UDFs:

Run in isolated containerized environments
Have restricted library availability
Cannot access external network resources

This strengthens security and isolation.

7.3 UDTF Row Emission

UDTFs can output multiple rows per input row, but:

Output ordering is not guaranteed unless explicitly ordered downstream

7.4 NULL-Handling Modes

Two modes define how UDFs treat NULL input:

RETURNS NULL ON NULL INPUT
CALLED ON NULL INPUT

This is frequently tested on SnowPro exams.

7.5 External Call Restrictions

UDFs cannot make external API calls or access remote systems directly.

8. Governance and Policy Interaction with Transformations

Snowflake enforces governance policies automatically across transformation pipelines.

8.1 Automatic Policy Application

Masking Policies and Row Access Policies automatically apply to all downstream objects, including:

Views
Materialized Views
Dynamic Tables (depending on configuration)

This preserves governance guarantees.

8.2 Tag Propagation

Tags applied to:

Databases
Schemas
Tables
Columns

may propagate through transformations such as:

Views
Materialized Views
Dynamic Tables

Propagation behavior can be configured.

8.3 Enforcement

If transformations violate governance rules, Snowflake may:

Reject the operation
Require explicit policy override
Enforce masking or filtering during execution

Understanding policy precedence and propagation is critical for real-world compliance scenarios.

Shopping cart

Subtotal:

COF-C02 Data Transformations

Detailed list of COF-C02 knowledge points

Data Transformations Detailed Explanation

1. Classic ELT with SQL

1.1 SQL for Transformations

1.1.1 DDL (Data Definition Language)

1.1.2 DML (Data Manipulation Language)

1.2 CTAS – CREATE TABLE AS SELECT

1.3 Materialized Views

2. Streams & Tasks (CDC and Scheduling)

2.1 Streams – Change Data Capture (CDC)

2.1.1 How Streams Work

2.1.2 Use Cases

2.1.3 Important Rules

2.2 Tasks – Scheduling & DAG Execution

2.2.1 How Tasks Work

2.2.2 Combining Streams + Tasks

2.2.3 Typical Use Cases

3. Dynamic Tables

3.1 Key Characteristics

3.1.1 Declarative Definition

3.1.2 How They Differ from Streams + Tasks

3.2 Target Freshness

3.3 When to Use Dynamic Tables vs Streams + Tasks vs Materialized Views

3.3.1 Dynamic Tables

3.3.2 Streams + Tasks

3.3.3 Materialized Views

4. Snowpark & Procedural Logic

4.1 Snowpark

Key Benefits

4.2 Stored Procedures

4.3 UDFs / UDTFs

4.3.1 Scalar UDFs

4.3.2 Table UDFs (UDTFs)

4.3.3 Languages

Summary of Concepts to Know for the Exam

Data Transformations (Additional Content)

1. Stream Limitations and Behavior

1.1 Retention and Offset Expiration

1.2 Unsupported Source Objects

1.3 Stream Types

1.4 Stream Consumption

1.5 No Exactly-Once Semantics

1.6 Streams Store Metadata, Not Data

2. Task Limitations and Scheduling Rules

2.1 Execution Requirements

2.2 Task DAG Behavior

2.3 Scheduling Rules

2.4 Execution Capabilities

3. Dynamic Table Constraints and Behavior

3.1 Compute Model

3.2 Dynamic Tables Are Read-Only

3.3 Unsupported as Stream Sources

3.4 Deterministic Requirement

3.5 Freshness and Cost

3.6 Automatic Dependency Resolution

4. Materialized View Restrictions

4.1 Unsupported References

4.2 Single Base Table Requirement

4.3 Maintenance Cost Behavior

4.4 Refresh Skipping

5. Search Optimization and Transformation Performance

5.1 Appropriate Use Cases

5.2 Not a Replacement for Clustering

5.3 Storage and Compute Overhead

6. Stored Procedure Execution Behavior and Limits

6.1 Supported Languages

6.2 Returned Results

6.3 Execution Privileges

6.4 Compute Consumption

6.5 Network Restrictions

7. UDF and UDTF Execution Rules

7.1 Stateless Nature

7.2 Python UDF Behavior

7.3 UDTF Row Emission

7.4 NULL-Handling Modes

7.5 External Call Restrictions