Data Loading and Unloading

Data Loading and Unloading Detailed Explanation

Loading and unloading are essential Snowflake operations.
They define how data enters Snowflake and how data leaves Snowflake.
The exam expects you to understand stages, file formats, COPY INTO behavior, and Snowpipe.

1. Stages

Stages are Snowflake’s file-storage areas used for:

Loading data into Snowflake
Unloading data from Snowflake
Staging files for pipelines
Interacting with cloud storage

Snowflake divides stages into internal and external.

1.1 Internal Stages

Internal stages are fully managed by Snowflake.
You do not need cloud credentials to use them.

1.1.1 User Stage (`@~`)

Each user automatically gets a personal stage.
Accessible only to that user.
Useful for quick tests, small file uploads, or development tasks.

1.1.2 Table Stage (`@%table_name`)

Every table has its own table stage.
Path pattern: @%mytable
Useful for:
- Temporary exports
- Unloading for table-specific workflows
- Loading data tied to a single table

1.1.3 Named Internal Stage (`@mystage`)

Created using:

CREATE STAGE mystage;

Advantages:

Can be shared across teams
Can specify default file formats
Can apply encryption or compression options
Preferred for production pipelines

1.2 External Stages

External stages point to cloud storage:

Amazon S3
Google Cloud Storage (GCS)
Azure Blob Storage

They require:

Storage location URL
Credentials or storage integration
Optional file format defaults

Example: S3 External Stage

CREATE STAGE my_s3_stage
URL='s3://my-bucket/path/'
CREDENTIALS=(AWS_KEY_ID='xxx' AWS_SECRET_KEY='yyy');

External stages are ideal when:

Data already exists in cloud storage
Multiple systems share the data lake
You want Snowflake to load from ETL outputs stored in the cloud

1.3 Key Exam Concepts About Stages

Internal vs external stages:
- Internal = Snowflake-managed
- External = Cloud-storage-based
Where files physically live:
- Internal → inside Snowflake-managed storage
- External → inside AWS/Azure/GCP buckets
Creating stages:
Use CREATE STAGE with internal or external parameters.

2. File Formats

Snowflake must understand how to parse incoming or outgoing data.
File formats tell Snowflake how to interpret fields, delimiters, compression, and more.

2.1 Creating File Formats

You may define reusable file formats:

CREATE FILE FORMAT my_csv_format
TYPE = CSV
FIELD_DELIMITER = ','
SKIP_HEADER = 1;

Or you can specify formats inline in COPY INTO commands.

2.2 Common File Types

CSV (comma-delimited or other delimiters)
JSON
Parquet
Avro
ORC
XML

Snowflake supports structured and semi-structured formats equally well.

2.3 Important Options

File format options include:

FIELD_DELIMITER (e.g., ,, |, \t)
QUOTE (for quoted fields)
ESCAPE
NULL handling
Encoding
Date/time formats
Compression (e.g., GZIP, BZIP2, ZSTD)

These options determine how Snowflake parses files during COPY.

3. Bulk Loading with COPY INTO

COPY INTO is Snowflake’s primary command for bulk data ingestion.

3.1 Typical Copy Pattern

COPY INTO my_table
FROM @mystage/path/
FILE_FORMAT = (FORMAT_NAME = my_csv_format)
ON_ERROR = 'CONTINUE';

COPY INTO:

Reads files from a stage
Applies the file format
Loads parsed rows into a table

3.2 COPY INTO Highlights

3.2.1 Parallel Loading

If there are many files:

Snowflake loads them in parallel
Data ingestion is fast and scalable
Parallelism depends on warehouse size

3.2.2 Loading Semi-Structured Data

Semi-structured formats (JSON, Parquet) can be loaded directly into:

VARIANT columns
Snowflake tables with automatic column parsing

3.2.3 Error Handling

ON_ERROR controls what happens during ingestion:

'ABORT_STATEMENT' (default) → stop on error
'CONTINUE' → skip bad rows
'SKIP_FILE' → skip whole files with errors

3.2.4 Validation Mode

Useful for checking data before loading:

VALIDATION_MODE = 'RETURN_ERRORS';

Shows exactly which rows or files have issues.

3.2.5 Column Mapping

Two modes:

By position (default)
By name (MATCH_BY_COLUMN_NAME)

Mapping by name makes COPY more robust for changing schemas.

4. Continuous Loading – Snowpipe / Snowpipe Streaming

Snowflake offers two continuous-ingestion mechanisms:

Snowpipe (classic)
Snowpipe Streaming (lower latency)

4.1 Snowpipe (Classic)

4.1.1 What Snowpipe Does

Snowpipe loads data automatically after files land in a stage.

Key characteristics:

Serverless (no warehouse needed)
Continuous ingestion
Delay usually seconds to minutes
Ideal for micro-batch architectures

4.1.2 Pipe Object

Snowpipe uses a pipe, which contains a COPY INTO command:

CREATE PIPE mypipe AS
COPY INTO raw_events
FROM @mystage/events/
FILE_FORMAT = (TYPE=JSON);

4.1.3 Trigger Methods

Snowpipe can be triggered by:

Cloud storage notifications (S3 Events, GCS PubSub, Azure Events)
REST API calls
Manual refresh (ALTER PIPE ... REFRESH)

4.2 Snowpipe Streaming

4.2.1 What It Is

Snowpipe Streaming provides:

Row-level ingestion
Very low latency (milliseconds → seconds)
Best for real-time pipelines

4.2.2 Usage Scenarios

Commonly used with:

Kafka connector
Custom streaming applications
Event-driven architectures

Snowpipe Streaming writes data directly into a target table without waiting for batch files.

5. Data Unloading (Export)

Unloading exports data from Snowflake into files in an internal or external stage.

5.1 COPY INTO (Unloading)

A typical unload pattern:

COPY INTO @mystage/unload/path/
FROM (SELECT * FROM my_table)
FILE_FORMAT = (TYPE = CSV);

This writes a set of output CSV files into the stage.

5.2 What You Do After Unloading

5.2.1 If It’s an Internal Stage

You download files using:

GET @mystage/unload/path/ file:///localfolder/

(The GET command works using SnowSQL.)

5.2.2 If It’s an External Stage

You download files using:

AWS CLI (for S3)
Azure CLI / Storage Explorer
GCP Storage commands

Snowflake simply writes files to the bucket path; you manage them from the cloud provider tools.

Data Loading and Unloading (Additional Content)

1. Key COPY INTO Options (Advanced)

Snowflake’s COPY INTO <table> command supports several advanced options that give fine control over load behavior. These options are frequently tested because they influence both correctness and performance in production pipelines.

1.1 FORCE = TRUE

Forces Snowflake to reload files even if they are already recorded in the load history.
Normally, Snowflake prevents reloading previously loaded files to ensure idempotency.
FORCE overrides this safeguard.
Useful when reprocessing files after schema changes or corrections.

1.2 PURGE = TRUE

Deletes files from the stage after successful load.
Helps prevent duplicate processing.
Typically used with external stages when files are only needed once.

1.3 PATTERN = ''

Filters which files Snowflake loads from a stage path.
Supports full regular expressions.
Important when stage contains many files but only certain subsets should be loaded.

1.4 FILES = ('file1', 'file2')

Restricts the load to specific files.
Overrides directory scanning and pattern matching.
Ideal for retrying or testing a small subset of files.

1.5 TRUNCATECOLUMNS = TRUE

Automatically truncates values that exceed the target column size.
Prevents load failures and allows partial ingestion.
Useful in semi-trusted data ingestion scenarios.

1.6 ERROR_ON_COLUMN_COUNT_MISMATCH

Controls behavior when the number of fields in a row does not match the target table schema.
Common settings:

TRUE: error out
FALSE: pad with NULL or truncate depending on file format settings

This is required knowledge for handling CSV and semi-structured ingestion inconsistencies.

2. Load Monitoring and History

Snowflake provides extensive tooling to analyze, monitor, and validate bulk loading operations.

2.1 COPY_HISTORY Table Function

Returns historical load operations for a given table or stage.
Includes:

File names
Load status
Rows loaded
Load timestamps

Useful for debugging pipeline failures.

2.2 LOAD_HISTORY Views

Available via:

ACCOUNT_USAGE
INFORMATION_SCHEMA

These views store system-wide load audit data:

File-level status
Number of errors
Staged file metadata
First load and last load times

Often used in governance and monitoring dashboards.

2.3 VALIDATE() Function

Allows validating staged data without loading.
Returns details such as:

Parsing errors
Invalid rows
Incorrect column counts

Used during schema evolution and load testing.

2.4 RESULT_SCAN() After COPY

After a COPY INTO operation, results can be retrieved using:

SELECT * FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

Provides row counts, skipped rows, and summary details.
Critical for automated ingestion monitoring.

3. Storage Integrations for External Stages

Storage integrations define secure connections between Snowflake and cloud storage platforms.

3.1 Supported Providers

Amazon S3
Azure Blob / ADLS
Google Cloud Storage

3.2 Purpose

Provide secure, credential-free access from Snowflake to external storage.
Remove the need to embed hard-coded credentials (keys, secrets).
Required when using event-based ingestion (Snowpipe auto-notifications).

3.3 Behavior

Snowflake stores and manages the integration metadata.
The cloud provider grants Snowflake IAM-level permissions via trust relationships.
Ensures secure, auditable, least-privilege access.

Storage integrations are essential for enterprises with strong security policies.

4. Transformations During LOAD

Snowflake allows transformations to be embedded inside the COPY INTO command through an inline SELECT.

4.1 Supported Transformations

CAST or TRY_CAST
Date/time conversions such as TO_TIMESTAMP
JSON field extraction using $1:property syntax
Column reordering or selective extraction
Data type normalization

4.2 Pattern Example

COPY INTO target_table
FROM (
  SELECT
    $1::NUMBER,
    TO_TIMESTAMP($2),
    $3:field::STRING
  FROM @stage
)
FILE_FORMAT = (TYPE = CSV);

This lets you preprocess data during ingestion, reducing or eliminating the need for post-load transformation steps.

5. MERGE for Upsert Pipelines

MERGE is a foundational command for incremental ingestion patterns.

5.1 Typical Pattern

Load raw files into a staging table via COPY
Execute MERGE into the final modeled table

5.2 What MERGE Handles

Inserts
Updates
Deduplication
Slowly Changing Dimensions (SCD) scenarios

5.3 Integration with Other Features

Can be paired with Streams and Tasks for automation
Works well for incremental ELT pipelines
Commonly used in production data warehousing workloads

MERGE is one of the most important transformation commands to know for the exam.

6. Snowpipe Cost and Behavior Details

Snowpipe provides continuous, serverless ingestion of staged files.

6.1 Compute Model

Does not use customer warehouses
Uses Snowflake-managed serverless compute
Credits billed per file loaded

6.2 Triggering Methods

Cloud provider event notifications
REST API calls
Manual refresh commands

6.3 Latency Profile

Designed for micro-batch ingestion
Typical latency: seconds to minutes

Snowpipe is ideal for continuous ingestion of small files.

7. Snowpipe Streaming Cost Model

Snowpipe Streaming provides ultra-low-latency ingestion without requiring staged files.

7.1 Characteristics

Ingests rows directly into Snowflake tables
Latency measured in milliseconds to seconds
Suitable for Kafka, CDC systems, and event buses

7.2 Cost Model

Billed based on rows inserted and commit volume
No per-file cost
No staging operations

Snowpipe Streaming is the preferred option for real-time pipelines.

8. Unloading Advanced Options

Snowflake supports flexible data export via COPY INTO <location>.

8.1 OVERWRITE = TRUE

Replaces existing files in the target stage
Useful for periodic full exports
Prevents directory accumulation

8.2 SINGLE = TRUE

Produces a single output file
Useful for downstream systems requiring single-file loads
Not recommended for very large datasets due to size limits

8.3 MAX_FILE_SIZE

Controls how large each output file may become
Enables fine control over parallelism and downstream file consumption

8.4 HEADER = TRUE

Includes column names in exported CSV files
Important for analysts or external consumption

8.5 COMPRESSION

Supports:

GZIP
NONE
AUTO

Compression affects both file size and export performance.

Frequently Asked Questions

What is the purpose of the COPY INTO command?

Answer:

COPY INTO is used to load data from staged files into Snowflake tables or unload data from tables into external locations.

Explanation:

It is the primary bulk data loading mechanism. It supports multiple file formats and parallel loading. A common mistake is confusing load vs unload syntax.

Demand Score: 83

Exam Relevance Score: 93

What is a stage in Snowflake?

Answer:

A stage is a location where data files are stored for loading or unloading.

Explanation:

Stages can be internal (managed by Snowflake) or external (e.g., S3, Azure Blob). They act as intermediaries between storage and tables. A common mistake is thinking data loads directly from local systems without staging.

Demand Score: 82

Exam Relevance Score: 92

What is the difference between internal and external stages?

Answer:

Internal stages are managed by Snowflake, while external stages reference cloud storage like S3 or Azure.

Explanation:

Internal stages are easier to use but less flexible. External stages enable integration with existing data lakes. A common mistake is not configuring credentials correctly for external stages.

Demand Score: 80

Exam Relevance Score: 91

What file formats does Snowflake support for loading?

Answer:

Snowflake supports formats like CSV, JSON, Avro, ORC, and Parquet.

Explanation:

Semi-structured formats like JSON are automatically parsed into VARIANT columns. A common mistake is not defining proper file format options.

Demand Score: 78

Exam Relevance Score: 88

What happens if data loading fails partially?

Answer:

Snowflake can continue loading valid rows while logging errors, depending on COPY options.

Explanation:

Options like ON_ERROR control behavior. A common mistake is not reviewing load error logs.

Demand Score: 79

Exam Relevance Score: 90

COF-C02 Study Plan COF-C02 Study Methods And Exam Tips COF-C02 Training Details

COF-C02 Training Course

Shopping cart

Subtotal:

COF-C02 Data Loading and Unloading

Detailed list of COF-C02 knowledge points

Data Loading and Unloading Detailed Explanation

1. Stages

1.1 Internal Stages

1.1.1 User Stage (@~)

1.1.2 Table Stage (@%table_name)

1.1.3 Named Internal Stage (@mystage)

1.2 External Stages

Example: S3 External Stage

1.3 Key Exam Concepts About Stages

2. File Formats

2.1 Creating File Formats

2.2 Common File Types

2.3 Important Options

3.1 Typical Copy Pattern

3.2 COPY INTO Highlights

3.2.1 Parallel Loading

3.2.2 Loading Semi-Structured Data

3.2.3 Error Handling

3.2.4 Validation Mode

3.2.5 Column Mapping

4. Continuous Loading – Snowpipe / Snowpipe Streaming

4.1 Snowpipe (Classic)

4.1.1 What Snowpipe Does

4.1.2 Pipe Object

4.1.3 Trigger Methods

4.2 Snowpipe Streaming

4.2.1 What It Is

4.2.2 Usage Scenarios

5. Data Unloading (Export)

5.1 COPY INTO (Unloading)

5.2 What You Do After Unloading

5.2.1 If It’s an Internal Stage

5.2.2 If It’s an External Stage

Data Loading and Unloading (Additional Content)

1. Key COPY INTO Options (Advanced)

1.1 FORCE = TRUE

1.2 PURGE = TRUE

1.3 PATTERN = ''

1.4 FILES = ('file1', 'file2')

1.5 TRUNCATECOLUMNS = TRUE

1.6 ERROR_ON_COLUMN_COUNT_MISMATCH

2. Load Monitoring and History

2.1 COPY_HISTORY Table Function

2.2 LOAD_HISTORY Views

2.3 VALIDATE() Function

2.4 RESULT_SCAN() After COPY

3. Storage Integrations for External Stages

3.1 Supported Providers

3.2 Purpose

3.3 Behavior

4. Transformations During LOAD

4.1 Supported Transformations

4.2 Pattern Example

5. MERGE for Upsert Pipelines

5.1 Typical Pattern

5.2 What MERGE Handles

5.3 Integration with Other Features

6. Snowpipe Cost and Behavior Details

6.1 Compute Model

6.2 Triggering Methods

6.3 Latency Profile

7. Snowpipe Streaming Cost Model

7.1 Characteristics

7.2 Cost Model

8. Unloading Advanced Options

8.1 OVERWRITE = TRUE

8.2 SINGLE = TRUE

8.3 MAX_FILE_SIZE

8.4 HEADER = TRUE

8.5 COMPRESSION

Frequently Asked Questions

1.1.1 User Stage (`@~`)

1.1.2 Table Stage (`@%table_name`)

1.1.3 Named Internal Stage (`@mystage`)