Shopping cart

Subtotal:

$0.00

COF-C02 Data Transformations

Data Transformations

Detailed list of COF-C02 knowledge points

Data Transformations Detailed Explanation

Data transformation refers to modifying, enriching, combining, or restructuring data inside Snowflake.
Snowflake offers multiple transformation mechanisms, ranging from classic SQL to advanced declarative pipelines like Dynamic Tables.

The exam expects you to understand what each mechanism does, how they differ, and when each is appropriate.

1. Classic ELT with SQL

Classic ELT (Extract → Load → Transform) is Snowflake’s recommended approach because Snowflake is built for in-database transformations, not traditional ETL engines.

1.1 SQL for Transformations

1.1.1 DDL (Data Definition Language)

Snowflake uses standard SQL for creating and managing objects:

  • CREATE SCHEMA

  • CREATE TABLE

  • CREATE VIEW

  • ALTER TABLE

  • DROP TABLE

DDL defines the structure of your transformed data.

1.1.2 DML (Data Manipulation Language)

Transformations frequently use DML:

  • INSERT

  • UPDATE

  • DELETE

  • MERGE (UPSERT operations)

  • COPY INTO for loading

Snowflake’s compute layer is optimized for these operations at scale.

1.2 CTAS – CREATE TABLE AS SELECT

CTAS is useful for:

  • Creating transformed tables in a single step

  • Materializing intermediate results

  • Fast prototyping of new models

Example:

CREATE TABLE sales_agg AS
SELECT region, SUM(amount) AS total_sales
FROM raw_sales
GROUP BY region;

1.3 Materialized Views

Materialized views:

  • Store precomputed results based on a SQL query

  • Update incrementally as source data changes

  • Improve performance on expensive repetitive queries

  • Cost more (compute + storage)

Use cases:

  • Dashboards

  • Frequently accessed aggregates

  • Expensive joins/computations

Know the difference between:

  • Standard views (not stored; compute every time)

  • Materialized views (stored; incrementally refreshed)

2. Streams & Tasks (CDC and Scheduling)

Streams and Tasks allow incremental, automated, scheduled, or event-driven transformations.

2.1 Streams – Change Data Capture (CDC)

A Stream tracks row-level changes (INSERT, UPDATE, DELETE) on a table.

2.1.1 How Streams Work

A stream acts like a delta view:

  • It captures changes made to a table since the last time the stream was consumed.

  • Once read, the stream’s offset moves forward.

  • Streams contain metadata columns such as:

    • Operation type (INSERT, DELETE)

    • Row identifiers

    • Metadata about updates

2.1.2 Use Cases
  • Incremental ETL/ELT

  • Slowly Changing Dimensions

  • Partitioned ingestion

  • Audit tracking

2.1.3 Important Rules
  • A stream must be queried for its offset to advance.

  • Streams do not store raw data copies—they store change metadata.

2.2 Tasks – Scheduling & DAG Execution

Tasks allow Snowflake to run SQL on a schedule or as part of a dependency graph (DAG).

2.2.1 How Tasks Work

A Task executes:

  • A SQL statement

  • A stored procedure

  • A MERGE operation

  • An INSERT/UPDATE step

Tasks can be:

  • Time-based (cron-like schedules)

  • Dependency-based (triggered after another task finishes)

2.2.2 Combining Streams + Tasks

This is one of Snowflake’s most important transformation patterns:

  • Stream captures new or changed rows

  • Task picks up the changes and applies transformations

Example pipeline:

  1. Stream detects changes in raw table

  2. Task MERGEs changes into a cleaned table

  3. Another Task aggregates the cleaned data

2.2.3 Typical Use Cases
  • Incremental pipelines

  • Daily/hourly/near-real-time updates

  • Multi-step DAG-based transformations

Streams + Tasks are powerful but must be maintained, especially in complex dependency chains.

3. Dynamic Tables

Dynamic Tables are Snowflake’s newer, declarative transformation feature, simplifying multi-step pipelines without requiring Streams + Tasks for every step.

3.1 Key Characteristics

3.1.1 Declarative Definition

You define:

  • A SQL query

  • A target freshness (e.g., data must be within 15 minutes of upstream sources)

Snowflake takes care of:

  • Change tracking

  • Dependency resolution

  • Scheduling

  • Incremental updates

3.1.2 How They Differ from Streams + Tasks
  • Dynamic Tables: Snowflake manages your pipeline automatically

  • Streams + Tasks: You manually orchestrate schedules, offsets, dependencies

Dynamic Tables simplify pipelines like:

Raw → Clean → Modeled → Aggregated

without writing many Task and Stream definitions.

3.2 Target Freshness

Each Dynamic Table has a target freshness, telling Snowflake:

“How up-to-date must the data be?”

Examples:

  • 1 minute

  • 5 minutes

  • 15 minutes

  • 1 hour

Snowflake then computes how often to update the table based on upstream changes.

3.3 When to Use Dynamic Tables vs Streams + Tasks vs Materialized Views

3.3.1 Dynamic Tables

Best for:

  • Multi-step transformation pipelines

  • Data models that must stay fresh

  • Simplifying DAG management

3.3.2 Streams + Tasks

Best for:

  • Complex incremental logic

  • Custom scheduling

  • Workloads with branching, conditional logic

3.3.3 Materialized Views

Best for:

  • Simple aggregates

  • Expensive queries that must be cached

  • Dashboards with frequent access

The exam will likely ask you to identify which mechanism is appropriate in different scenarios.

4. Snowpark & Procedural Logic

SQL is often enough for transformations, but sometimes you need code, procedural algorithms, or machine learning.
Snowflake offers several programmable components.

4.1 Snowpark

Snowpark provides APIs in Python, Java, and Scala to define data transformations programmatically.

Key Benefits
  • Code executes inside Snowflake, close to the data

  • No need to move data into external compute environments

  • Great for:

    • Data engineering

    • Feature engineering

    • ML pipelines

    • Complex transformations

Snowpark supports:

  • DataFrame operations

  • UDFs written in Python

  • ML training (Snowpark ML)

  • User-defined table functions

4.2 Stored Procedures

Stored procedures enable imperative logic, such as:

  • Loops

  • Conditionals

  • Branching workflows

  • Complex orchestration logic

Written in:

  • JavaScript

  • Python

  • (Legacy Java-based SPs)

Useful when SQL alone is insufficient.

4.3 UDFs / UDTFs

User-defined functions extend Snowflake’s compute logic:

4.3.1 Scalar UDFs

Operate on one row → one output.

Used for:

  • Custom formatting

  • Mathematical calculations

  • Business logic

4.3.2 Table UDFs (UDTFs)

Return a set of rows, useful for:

  • Exploding arrays

  • Generating series

  • Complex parsing logic

4.3.3 Languages

UDFs/UDTFs can be written in:

  • SQL

  • JavaScript

  • Java

  • Python (via Snowpark)

Summary of Concepts to Know for the Exam

  • SQL is the foundation of transformations.

  • Streams capture row-level changes; Tasks run scheduled logic.

  • Dynamic Tables automate pipelines with declarative freshness.

  • Materialized views precompute expensive queries.

  • Snowpark supports code-based transformations and ML.

  • UDFs and stored procedures extend transformation capabilities beyond SQL.

Data Transformations (Additional Content)

1. Stream Limitations and Behavior

Streams provide Change Data Capture (CDC) tracking in Snowflake, but their behavior includes important constraints that commonly appear in exam questions.

1.1 Retention and Offset Expiration

Streams only retain change tracking for a limited retention window (commonly 14 days).
If a stream is not consumed within this window, its offset becomes stale and cannot be recovered.

1.2 Unsupported Source Objects

Streams cannot be created on:

  • External Tables

  • Directory Tables

  • Views

Streams require Snowflake-managed storage formats to track micro-partition versions.

1.3 Stream Types

  • Standard Streams: Track INSERT, UPDATE, and DELETE operations.

  • Append-Only Streams: Track INSERT operations only.
    Useful for log-style ingestion where updates are impossible.

1.4 Stream Consumption

A stream’s offset advances when it is queried.
This means:

  • Reading a stream consumes the change records

  • Re-querying does not return the same changes again

This behavior requires pipelines to be idempotent.

1.5 No Exactly-Once Semantics

Streams may deliver changes in a form requiring reconciliation.
Using MERGE or deduplication logic ensures data correctness across retries.

1.6 Streams Store Metadata, Not Data

Streams store metadata about changes, not copies of changed rows.
If the underlying table data ages out (due to retention expiration), the stream becomes invalid.

2. Task Limitations and Scheduling Rules

Tasks automate SQL execution in Snowflake but operate under several constraints.

2.1 Execution Requirements

Tasks must run using either:

  • A customer warehouse, or

  • Serverless task execution (if configured)

Serverless tasks consume Snowflake-managed compute credits.

2.2 Task DAG Behavior

Tasks can form Directed Acyclic Graphs (DAGs).
If a parent task is:

  • Suspended

  • Disabled

  • Erroring persistently

then downstream tasks also pause automatically.

2.3 Scheduling Rules

  • Cron syntax is supported

  • Minimum scheduling intervals apply

  • Slight delays may occur due to system resource scheduling

  • Tasks may retry automatically depending on configuration and task type

Warehouse-backed tasks and serverless tasks follow different retry semantics.

2.4 Execution Capabilities

Tasks can run:

  • SQL statements

  • MERGE operations

  • Stored procedures

They cannot run external OS commands or arbitrary system code.

3. Dynamic Table Constraints and Behavior

Dynamic Tables provide declarative pipelines with automatic refresh rules but include important limitations.

3.1 Compute Model

Dynamic Tables use Snowflake-managed serverless compute.
Costs are driven by:

  • The frequency of refreshes

  • The complexity of upstream changes

  • The defined freshness requirement

3.2 Dynamic Tables Are Read-Only

You cannot perform:

  • INSERT

  • UPDATE

  • DELETE

  • MERGE

against a Dynamic Table.
They are fully managed by Snowflake.

3.3 Unsupported as Stream Sources

Dynamic Tables cannot be used as upstream sources for streams.
This preserves clarity of dependency chains.

3.4 Deterministic Requirement

Non-deterministic functions are not allowed, including:

  • RANDOM

  • CURRENT_TIMESTAMP (unless deterministic semantics used)

  • Other functions that vary per execution

This ensures reproducible incremental updates.

3.5 Freshness and Cost

Lower freshness values (for example, 1 minute) cause more frequent refresh cycles, increasing cost.
Higher freshness reduces compute consumption.

3.6 Automatic Dependency Resolution

Dynamic Tables analyze dependency graphs automatically and determine:

  • Compute order

  • Required refresh level

  • Efficient incremental updates

This reduces the need for manual orchestration.

4. Materialized View Restrictions

Materialized Views (MVs) accelerate repeated queries but must follow strict rules.

4.1 Unsupported References

Materialized Views cannot reference:

  • Another Materialized View

  • Non-deterministic functions

  • Certain complex SQL constructs, depending on system constraints
    Examples include:

    • Some window functions

    • Certain joins

    • Nested subqueries

This ensures the engine can maintain incremental refresh efficiently.

4.2 Single Base Table Requirement

An MV must reference only one base table.
Joins to other tables are not allowed.

4.3 Maintenance Cost Behavior

Refresh cost depends on:

  • Number of micro-partitions changed

  • Frequency of updates

  • Table size

Large or frequently updated tables increase MV maintenance cost.

4.4 Refresh Skipping

Snowflake may skip MV refresh when:

  • Changes are insignificant

  • Cost is too high relative to impact

  • The MV would not materially change

This behavior optimizes platform efficiency.

5. Search Optimization and Transformation Performance

Search Optimization improves performance for highly selective searches.

5.1 Appropriate Use Cases

Search Optimization benefits:

  • LIKE '%text%' pattern searches

  • IN list queries

  • Searches on semi-structured data (for example, JSON fields)

These cases typically cannot benefit from clustering.

5.2 Not a Replacement for Clustering

Search Optimization does not help range-based pruning, such as:

WHERE date BETWEEN '2023-01-01' AND '2023-01-31'

Clustering keys are more effective for these queries.

5.3 Storage and Compute Overhead

Search Optimization introduces:

  • Additional persistent index-like structures

  • Additional maintenance compute costs

  • More overhead when tables are frequently updated

This affects transformation performance, especially during large MERGE operations.

6. Stored Procedure Execution Behavior and Limits

Stored Procedures (SPs) allow procedural logic but operate within Snowflake’s controlled execution environment.

6.1 Supported Languages

Stored Procedures can be written in:

  • JavaScript

  • Python

  • SQL Scripting (Snowflake Scripting)

6.2 Returned Results

SPs do not natively return tabular results.
Developers must manually construct return values or return a variant array representing rows.

6.3 Execution Privileges

Two execution modes impact security:

  • EXECUTE AS CALLER (uses caller’s privileges)

  • EXECUTE AS OWNER (uses owner’s privileges)

This distinction is important for governance.

6.4 Compute Consumption

SPs use:

  • Warehouse compute, or

  • Serverless compute (depending on configuration)

SP cost is tied to underlying queries or operations.

6.5 Network Restrictions

Stored Procedures cannot make arbitrary external network calls, unless routed through:

  • External functions

  • Secure integrations

This prevents unauthorized data exfiltration.

7. UDF and UDTF Execution Rules

User-defined Functions (UDFs) and User-defined Table Functions (UDTFs) extend SQL but with controlled behavior.

7.1 Stateless Nature

UDFs cannot maintain state across calls.
Each call is isolated and must return deterministic results based solely on input arguments.

7.2 Python UDF Behavior

Python UDFs:

  • Run in isolated containerized environments

  • Have restricted library availability

  • Cannot access external network resources

This strengthens security and isolation.

7.3 UDTF Row Emission

UDTFs can output multiple rows per input row, but:

  • Output ordering is not guaranteed unless explicitly ordered downstream

7.4 NULL-Handling Modes

Two modes define how UDFs treat NULL input:

  • RETURNS NULL ON NULL INPUT

  • CALLED ON NULL INPUT

This is frequently tested on SnowPro exams.

7.5 External Call Restrictions

UDFs cannot make external API calls or access remote systems directly.

8. Governance and Policy Interaction with Transformations

Snowflake enforces governance policies automatically across transformation pipelines.

8.1 Automatic Policy Application

Masking Policies and Row Access Policies automatically apply to all downstream objects, including:

  • Views

  • Materialized Views

  • Dynamic Tables (depending on configuration)

This preserves governance guarantees.

8.2 Tag Propagation

Tags applied to:

  • Databases

  • Schemas

  • Tables

  • Columns

may propagate through transformations such as:

  • Views

  • Materialized Views

  • Dynamic Tables

Propagation behavior can be configured.

8.3 Enforcement

If transformations violate governance rules, Snowflake may:

  • Reject the operation

  • Require explicit policy override

  • Enforce masking or filtering during execution

Understanding policy precedence and propagation is critical for real-world compliance scenarios.

Frequently Asked Questions

What is a stream in Snowflake?

Answer:

A stream is an object that tracks changes (inserts, updates, deletes) to a table using change data capture (CDC).

Explanation:

Streams record row-level changes since the last consumption. They do not store data themselves but reference table metadata. A common mistake is assuming streams persist historical data indefinitely—they only track changes until consumed.

Demand Score: 86

Exam Relevance Score: 94

What is a task in Snowflake?

Answer:

A task is a scheduled or triggered job used to automate SQL execution, often for data transformations.

Explanation:

Tasks can run on a schedule or after another task completes. They are commonly used with streams for incremental processing. A common mistake is not assigning a warehouse, causing execution failure.

Demand Score: 83

Exam Relevance Score: 92

How do streams and tasks work together?

Answer:

Streams capture data changes, and tasks process those changes automatically by running transformation queries.

Explanation:

This enables incremental ETL pipelines. Tasks query streams to process only new data. A common mistake is running tasks without consuming the stream, leading to repeated processing.

Demand Score: 85

Exam Relevance Score: 95

What is the difference between a task and a stored procedure?

Answer:

A task schedules execution, while a stored procedure contains procedural logic.

Explanation:

Tasks trigger execution, whereas procedures define reusable logic. They are often combined. A common mistake is expecting tasks to contain complex logic without procedures.

Demand Score: 80

Exam Relevance Score: 90

What is incremental data processing in Snowflake?

Answer:

It is the process of transforming only new or changed data instead of reprocessing entire datasets.

Explanation:

Streams enable tracking changes, while tasks automate processing. This improves efficiency and reduces compute cost. A common mistake is performing full refresh transformations unnecessarily.

Demand Score: 82

Exam Relevance Score: 93

COF-C02 Training Course
$68$29.99
COF-C02 Training Course