Shopping cart

Subtotal:

$0.00

COF-C02 Data Loading and Unloading

Data Loading and Unloading

Detailed list of COF-C02 knowledge points

Data Loading and Unloading Detailed Explanation

Loading and unloading are essential Snowflake operations.
They define how data enters Snowflake and how data leaves Snowflake.
The exam expects you to understand stages, file formats, COPY INTO behavior, and Snowpipe.

1. Stages

Stages are Snowflake’s file-storage areas used for:

  • Loading data into Snowflake

  • Unloading data from Snowflake

  • Staging files for pipelines

  • Interacting with cloud storage

Snowflake divides stages into internal and external.

1.1 Internal Stages

Internal stages are fully managed by Snowflake.
You do not need cloud credentials to use them.

1.1.1 User Stage (@~)
  • Each user automatically gets a personal stage.

  • Accessible only to that user.

  • Useful for quick tests, small file uploads, or development tasks.

1.1.2 Table Stage (@%table_name)
  • Every table has its own table stage.

  • Path pattern: @%mytable

  • Useful for:

    • Temporary exports

    • Unloading for table-specific workflows

    • Loading data tied to a single table

1.1.3 Named Internal Stage (@mystage)

Created using:

CREATE STAGE mystage;

Advantages:

  • Can be shared across teams

  • Can specify default file formats

  • Can apply encryption or compression options

  • Preferred for production pipelines

1.2 External Stages

External stages point to cloud storage:

  • Amazon S3

  • Google Cloud Storage (GCS)

  • Azure Blob Storage

They require:

  • Storage location URL

  • Credentials or storage integration

  • Optional file format defaults

Example: S3 External Stage
CREATE STAGE my_s3_stage
URL='s3://my-bucket/path/'
CREDENTIALS=(AWS_KEY_ID='xxx' AWS_SECRET_KEY='yyy');

External stages are ideal when:

  • Data already exists in cloud storage

  • Multiple systems share the data lake

  • You want Snowflake to load from ETL outputs stored in the cloud

1.3 Key Exam Concepts About Stages

  • Internal vs external stages:

    • Internal = Snowflake-managed

    • External = Cloud-storage-based

  • Where files physically live:

    • Internal → inside Snowflake-managed storage

    • External → inside AWS/Azure/GCP buckets

  • Creating stages:
    Use CREATE STAGE with internal or external parameters.

2. File Formats

Snowflake must understand how to parse incoming or outgoing data.
File formats tell Snowflake how to interpret fields, delimiters, compression, and more.

2.1 Creating File Formats

You may define reusable file formats:

CREATE FILE FORMAT my_csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1;

Or you can specify formats inline in COPY INTO commands.

2.2 Common File Types

  • CSV (comma-delimited or other delimiters)

  • JSON

  • Parquet

  • Avro

  • ORC

  • XML

Snowflake supports structured and semi-structured formats equally well.

2.3 Important Options

File format options include:

  • FIELD_DELIMITER (e.g., ,, |, \t)

  • QUOTE (for quoted fields)

  • ESCAPE

  • NULL handling

  • Encoding

  • Date/time formats

  • Compression (e.g., GZIP, BZIP2, ZSTD)

These options determine how Snowflake parses files during COPY.

3. Bulk Loading with COPY INTO

COPY INTO is Snowflake’s primary command for bulk data ingestion.

3.1 Typical Copy Pattern

COPY INTO my_table
FROM @mystage/path/
FILE_FORMAT = (FORMAT_NAME = my_csv_format)
ON_ERROR = 'CONTINUE';

COPY INTO:

  • Reads files from a stage

  • Applies the file format

  • Loads parsed rows into a table

3.2 COPY INTO Highlights

3.2.1 Parallel Loading

If there are many files:

  • Snowflake loads them in parallel

  • Data ingestion is fast and scalable

  • Parallelism depends on warehouse size

3.2.2 Loading Semi-Structured Data

Semi-structured formats (JSON, Parquet) can be loaded directly into:

  • VARIANT columns

  • Snowflake tables with automatic column parsing

3.2.3 Error Handling

ON_ERROR controls what happens during ingestion:

  • 'ABORT_STATEMENT' (default) → stop on error

  • 'CONTINUE' → skip bad rows

  • 'SKIP_FILE' → skip whole files with errors

3.2.4 Validation Mode

Useful for checking data before loading:

VALIDATION_MODE = 'RETURN_ERRORS';

Shows exactly which rows or files have issues.

3.2.5 Column Mapping

Two modes:

  • By position (default)

  • By name (MATCH_BY_COLUMN_NAME)

Mapping by name makes COPY more robust for changing schemas.

4. Continuous Loading – Snowpipe / Snowpipe Streaming

Snowflake offers two continuous-ingestion mechanisms:

  • Snowpipe (classic)

  • Snowpipe Streaming (lower latency)

4.1 Snowpipe (Classic)

4.1.1 What Snowpipe Does

Snowpipe loads data automatically after files land in a stage.

Key characteristics:

  • Serverless (no warehouse needed)

  • Continuous ingestion

  • Delay usually seconds to minutes

  • Ideal for micro-batch architectures

4.1.2 Pipe Object

Snowpipe uses a pipe, which contains a COPY INTO command:

CREATE PIPE mypipe AS
COPY INTO raw_events
FROM @mystage/events/
FILE_FORMAT = (TYPE=JSON);
4.1.3 Trigger Methods

Snowpipe can be triggered by:

  • Cloud storage notifications (S3 Events, GCS PubSub, Azure Events)

  • REST API calls

  • Manual refresh (ALTER PIPE ... REFRESH)

4.2 Snowpipe Streaming

4.2.1 What It Is

Snowpipe Streaming provides:

  • Row-level ingestion

  • Very low latency (milliseconds → seconds)

  • Best for real-time pipelines

4.2.2 Usage Scenarios

Commonly used with:

  • Kafka connector

  • Custom streaming applications

  • Event-driven architectures

Snowpipe Streaming writes data directly into a target table without waiting for batch files.

5. Data Unloading (Export)

Unloading exports data from Snowflake into files in an internal or external stage.

5.1 COPY INTO (Unloading)

A typical unload pattern:

COPY INTO @mystage/unload/path/
FROM (SELECT * FROM my_table)
FILE_FORMAT = (TYPE = CSV);

This writes a set of output CSV files into the stage.

5.2 What You Do After Unloading

5.2.1 If It’s an Internal Stage

You download files using:

GET @mystage/unload/path/ file:///localfolder/

(The GET command works using SnowSQL.)

5.2.2 If It’s an External Stage

You download files using:

  • AWS CLI (for S3)

  • Azure CLI / Storage Explorer

  • GCP Storage commands

Snowflake simply writes files to the bucket path; you manage them from the cloud provider tools.

Data Loading and Unloading (Additional Content)

1. Key COPY INTO Options (Advanced)

Snowflake’s COPY INTO <table> command supports several advanced options that give fine control over load behavior. These options are frequently tested because they influence both correctness and performance in production pipelines.

1.1 FORCE = TRUE

  • Forces Snowflake to reload files even if they are already recorded in the load history.

  • Normally, Snowflake prevents reloading previously loaded files to ensure idempotency.

  • FORCE overrides this safeguard.

  • Useful when reprocessing files after schema changes or corrections.

1.2 PURGE = TRUE

  • Deletes files from the stage after successful load.

  • Helps prevent duplicate processing.

  • Typically used with external stages when files are only needed once.

1.3 PATTERN = ''

  • Filters which files Snowflake loads from a stage path.

  • Supports full regular expressions.

  • Important when stage contains many files but only certain subsets should be loaded.

1.4 FILES = ('file1', 'file2')

  • Restricts the load to specific files.

  • Overrides directory scanning and pattern matching.

  • Ideal for retrying or testing a small subset of files.

1.5 TRUNCATECOLUMNS = TRUE

  • Automatically truncates values that exceed the target column size.

  • Prevents load failures and allows partial ingestion.

  • Useful in semi-trusted data ingestion scenarios.

1.6 ERROR_ON_COLUMN_COUNT_MISMATCH

Controls behavior when the number of fields in a row does not match the target table schema.
Common settings:

  • TRUE: error out

  • FALSE: pad with NULL or truncate depending on file format settings

This is required knowledge for handling CSV and semi-structured ingestion inconsistencies.

2. Load Monitoring and History

Snowflake provides extensive tooling to analyze, monitor, and validate bulk loading operations.

2.1 COPY_HISTORY Table Function

Returns historical load operations for a given table or stage.
Includes:

  • File names

  • Load status

  • Rows loaded

  • Load timestamps

Useful for debugging pipeline failures.

2.2 LOAD_HISTORY Views

Available via:

  • ACCOUNT_USAGE

  • INFORMATION_SCHEMA

These views store system-wide load audit data:

  • File-level status

  • Number of errors

  • Staged file metadata

  • First load and last load times

Often used in governance and monitoring dashboards.

2.3 VALIDATE() Function

Allows validating staged data without loading.
Returns details such as:

  • Parsing errors

  • Invalid rows

  • Incorrect column counts

Used during schema evolution and load testing.

2.4 RESULT_SCAN() After COPY

After a COPY INTO operation, results can be retrieved using:

SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

Provides row counts, skipped rows, and summary details.
Critical for automated ingestion monitoring.

3. Storage Integrations for External Stages

Storage integrations define secure connections between Snowflake and cloud storage platforms.

3.1 Supported Providers

  • Amazon S3

  • Azure Blob / ADLS

  • Google Cloud Storage

3.2 Purpose

  • Provide secure, credential-free access from Snowflake to external storage.

  • Remove the need to embed hard-coded credentials (keys, secrets).

  • Required when using event-based ingestion (Snowpipe auto-notifications).

3.3 Behavior

  • Snowflake stores and manages the integration metadata.

  • The cloud provider grants Snowflake IAM-level permissions via trust relationships.

  • Ensures secure, auditable, least-privilege access.

Storage integrations are essential for enterprises with strong security policies.

4. Transformations During LOAD

Snowflake allows transformations to be embedded inside the COPY INTO command through an inline SELECT.

4.1 Supported Transformations

  • CAST or TRY_CAST

  • Date/time conversions such as TO_TIMESTAMP

  • JSON field extraction using $1:property syntax

  • Column reordering or selective extraction

  • Data type normalization

4.2 Pattern Example

COPY INTO target_table
FROM (
  SELECT
    $1::NUMBER,
    TO_TIMESTAMP($2),
    $3:field::STRING
  FROM @stage
)
FILE_FORMAT = (TYPE = CSV);

This lets you preprocess data during ingestion, reducing or eliminating the need for post-load transformation steps.

5. MERGE for Upsert Pipelines

MERGE is a foundational command for incremental ingestion patterns.

5.1 Typical Pattern

  1. Load raw files into a staging table via COPY

  2. Execute MERGE into the final modeled table

5.2 What MERGE Handles

  • Inserts

  • Updates

  • Deduplication

  • Slowly Changing Dimensions (SCD) scenarios

5.3 Integration with Other Features

  • Can be paired with Streams and Tasks for automation

  • Works well for incremental ELT pipelines

  • Commonly used in production data warehousing workloads

MERGE is one of the most important transformation commands to know for the exam.

6. Snowpipe Cost and Behavior Details

Snowpipe provides continuous, serverless ingestion of staged files.

6.1 Compute Model

  • Does not use customer warehouses

  • Uses Snowflake-managed serverless compute

  • Credits billed per file loaded

6.2 Triggering Methods

  • Cloud provider event notifications

  • REST API calls

  • Manual refresh commands

6.3 Latency Profile

  • Designed for micro-batch ingestion

  • Typical latency: seconds to minutes

Snowpipe is ideal for continuous ingestion of small files.

7. Snowpipe Streaming Cost Model

Snowpipe Streaming provides ultra-low-latency ingestion without requiring staged files.

7.1 Characteristics

  • Ingests rows directly into Snowflake tables

  • Latency measured in milliseconds to seconds

  • Suitable for Kafka, CDC systems, and event buses

7.2 Cost Model

  • Billed based on rows inserted and commit volume

  • No per-file cost

  • No staging operations

Snowpipe Streaming is the preferred option for real-time pipelines.

8. Unloading Advanced Options

Snowflake supports flexible data export via COPY INTO <location>.

8.1 OVERWRITE = TRUE

  • Replaces existing files in the target stage

  • Useful for periodic full exports

  • Prevents directory accumulation

8.2 SINGLE = TRUE

  • Produces a single output file

  • Useful for downstream systems requiring single-file loads

  • Not recommended for very large datasets due to size limits

8.3 MAX_FILE_SIZE

  • Controls how large each output file may become

  • Enables fine control over parallelism and downstream file consumption

8.4 HEADER = TRUE

  • Includes column names in exported CSV files

  • Important for analysts or external consumption

8.5 COMPRESSION

Supports:

  • GZIP

  • NONE

  • AUTO

Compression affects both file size and export performance.

Frequently Asked Questions

What is the purpose of the COPY INTO command?

Answer:

COPY INTO is used to load data from staged files into Snowflake tables or unload data from tables into external locations.

Explanation:

It is the primary bulk data loading mechanism. It supports multiple file formats and parallel loading. A common mistake is confusing load vs unload syntax.

Demand Score: 83

Exam Relevance Score: 93

What is a stage in Snowflake?

Answer:

A stage is a location where data files are stored for loading or unloading.

Explanation:

Stages can be internal (managed by Snowflake) or external (e.g., S3, Azure Blob). They act as intermediaries between storage and tables. A common mistake is thinking data loads directly from local systems without staging.

Demand Score: 82

Exam Relevance Score: 92

What is the difference between internal and external stages?

Answer:

Internal stages are managed by Snowflake, while external stages reference cloud storage like S3 or Azure.

Explanation:

Internal stages are easier to use but less flexible. External stages enable integration with existing data lakes. A common mistake is not configuring credentials correctly for external stages.

Demand Score: 80

Exam Relevance Score: 91

What file formats does Snowflake support for loading?

Answer:

Snowflake supports formats like CSV, JSON, Avro, ORC, and Parquet.

Explanation:

Semi-structured formats like JSON are automatically parsed into VARIANT columns. A common mistake is not defining proper file format options.

Demand Score: 78

Exam Relevance Score: 88

What happens if data loading fails partially?

Answer:

Snowflake can continue loading valid rows while logging errors, depending on COPY options.

Explanation:

Options like ON_ERROR control behavior. A common mistake is not reviewing load error logs.

Demand Score: 79

Exam Relevance Score: 90

COF-C02 Training Course
$68$29.99
COF-C02 Training Course