Snowflake AI Data Cloud Features & Architecture

Snowflake AI Data Cloud Features & Architecture Detailed Explanation

1. What “AI Data Cloud” Means

1.1 Overview: Why Snowflake Calls It an “AI Data Cloud”

Snowflake uses the term AI Data Cloud to describe a unified, cloud-native platform that supports:

Data storage
Data processing
Data sharing
Data governance
Machine learning
AI capabilities

all in one single logical system, even when running on multiple physical cloud providers such as AWS, Azure, and Google Cloud.

This vision is based on the idea that AI requires high-quality, well-governed, highly accessible data, and the best way to support this is by unifying all data workloads under one architecture.

1.2 Unified Platform Across Clouds

“Unified platform” means:

Your Snowflake experience is the same everywhere
Snowflake abstracts away differences between AWS/Azure/GCP
You do not deal with cloud-specific services; Snowflake provides a standard interface

Key benefits:

Portability: replicate data across clouds or regions
Consistency: same SQL, same security model, same architecture
Reduced complexity: Snowflake manages infrastructure differences for you

A single company might run:

Sales analytics in AWS
Marketing data pipelines in Azure
AI models in GCP

…but Snowflake makes them feel like they’re running in one system.

1.3 Workloads Supported by Snowflake

Snowflake supports multiple workloads on the same platform, eliminating the need for disconnected tools.

1.3.1 Data Warehousing

Analytical SQL workloads
Reporting and dashboarding
Star/snowflake schema analysis
Large aggregations

1.3.2 Data Lake

Snowflake supports semi-structured data formats such as:

JSON
Avro
Parquet
ORC
XML

You can load raw data and query it directly through VARIANT and dedicated SQL functions.

1.3.3 Data Engineering

Snowflake helps data engineers build pipelines using:

SQL (DDL, DML, CTAS)
Streams & Tasks (change data capture + scheduling)
Dynamic Tables (declarative pipelines)
Snowpark (Python/Scala/Java)

1.3.4 Data Sharing & Exchange

Snowflake supports zero-copy Secure Data Sharing, letting users share live data:

Without copying
Without ETL
With strict governance

Used for internal data sharing or external partners.

1.3.5 Data Science & Machine Learning

Snowflake enables ML workflows with:

Snowpark for Python
Feature engineering in SQL
Storing features/predictions
Integrations with ML platforms
In-database inference (depending on region/edition)

1.3.6 Application Development

Developers can build applications backed by Snowflake, including:

Analytical applications
Data-intensive services
Snowflake Native Apps (installed into customer accounts)

1.4 AI-Focused Capabilities

Snowflake is adding native AI and ML features to keep AI next to the data.

1.4.1 Snowflake Cortex

Cortex offers:

Built-in LLM functions
Vector search for semantic retrieval
Embeddings generation
AI-powered SQL assistance

Must be understood conceptually (not deeply) for the exam.

1.4.2 Snowpark for ML

Snowpark allows:

Python-based ML workflows
Pushing compute to Snowflake
Eliminating the need to move data out

A powerful tool for governed ML pipelines.

1.4.3 Marketplace & Partners

Snowflake Marketplace provides:

External datasets
Data applications
AI/ML services
Data enrichment tools

Allows easy integration of 3rd-party intelligence into your workflows.

1.5 Exam Guidance

For the exam:

Prioritize architecture and platform design
Know Snowflake supports multiple workloads in one system
Understand multi-cloud, governance, and unified access
AI-specific product features are less important

2. Multi-Cluster, Shared Data Architecture

2.1 Core Concept: Shared Storage + Independent Compute

Snowflake’s architectural foundation is:

Storage is shared. Compute is independent and elastic.

Meaning:

All compute clusters read/write the same data
Compute can scale up/down/out independently
Workloads do not compete for local resources
You don't manage storage layout or indexes

This separation is essential for Snowflake’s performance, simplicity, and elasticity.

2.2 Three-Layer Architecture

Snowflake consists of three logical layers, each with a clear role.

2.2.1 Database Storage Layer

2.2.1.1 Micro-Partitions

Snowflake stores data in micro-partitions, which are:

Immutable
Columnar
Compressed
Typically around 16MB
Automatically created

Each partition contains metadata:

Min/max column values
Distinct count
Null count
Other statistics

Used for partition pruning, allowing Snowflake to skip irrelevant partitions during query execution.

2.2.1.2 Automatic Management

Snowflake handles:

Partitioning
Compression
Statistics
Metadata organization
File lifecycle

Users do not manage:

Indexes
Vacuuming
Physical layout
Partition definitions

Snowflake fully abstracts these responsibilities.

2.2.2 Compute Layer (Virtual Warehouses)

2.2.2.1 What a Warehouse Does

A virtual warehouse is a compute engine responsible for:

Running queries
Executing DML (INSERT, UPDATE, DELETE, MERGE)
Performing COPY INTO loads
Running Tasks (scheduled operations)

2.2.2.2 Independence

Warehouses:

Do not share memory
Do not share local disk
Each has its own caching layer
All access the same central storage

This allows complete workload isolation.

2.2.2.3 Scaling

Warehouses can scale:

Up → change size (XS → S → M, etc.)
Out → add clusters (multi-cluster warehouse)

Scale up = faster single queries
Scale out = better concurrency

2.2.3 Cloud Services Layer

This layer is the “control plane”, managing:

Authentication
Authorization
Query parsing
Query optimization
Metadata
Transactions
Result cache
Billing
Orchestration

Runs independently of warehouses, allowing some operations to proceed without compute.

2.3 Multi-Cluster, Shared Data Mechanics

2.3.1 Shared Data Access

All warehouses across the account:

Access the same micro-partitions
Operate on one version of truth
Avoid data duplication

2.3.2 No Local Disk Contention

Because compute does not store data locally:

No competition for I/O
Easier to scale compute elastically

2.3.3 Concurrency Handling

Multi-cluster warehouses solve concurrency bottlenecks.

When queues form:

Cluster #1 is busy
Snowflake automatically adds Cluster #2
Then Cluster #3 if needed

When load decreases:

Extra clusters shut down automatically

Ideal for:

BI dashboards
Shared analyst workloads
Spiky workloads

3. Key Platform Features (You Should Know by Name)

3.1 Automatic Scaling & Elasticity

Multiple warehouses for different workloads
Each warehouse scales independently
Auto-suspend and auto-resume optimize cost

3.2 Caching

Three caches:

Result Cache: stored in Cloud Services; reused if SQL/data unchanged
Metadata Cache: stored in Cloud Services; enables pruning
Data Cache: stored on warehouse local SSD; lost on suspend

You must know where each cache lives.

3.3 Zero-copy Cloning

Clones databases/schemas/tables instantly:

No data copied
New objects reference micro-partitions
Only changed data creates new partitions

Used for:

Dev/test
What-if analysis
Point-in-time recovery

3.4 Time Travel & Fail-safe

Time Travel: restore/query data as of past time; 1–90 days
Fail-safe: extra 7 days; Snowflake-managed only

Know the difference.

3.5 Data Sharing & Replication

Secure Data Sharing: live data sharing without copying
Replication: cross-account, cross-region, cross-cloud
Failover/Failback: DR capability

3.6 Ecosystem Features

Snowflake Marketplace
Snowgrid for cross-cloud interoperability
Integrations with ETL, BI, ML tools

Snowflake AI Data Cloud Features & Architecture (Additional Content)

1. Snowgrid (Cross-Cloud and Cross-Region Control Layer)

Snowgrid is Snowflake’s global control and coordination layer that operates across all supported cloud providers and regions. It is one of the least understood but most important architectural components of the platform.

1.1 Purpose and Role

Snowgrid provides the metadata, governance, and orchestration backbone that enables Snowflake to function as a unified AI Data Cloud despite running on different clouds and regions.

Key capabilities include:

Global metadata orchestration
Cross-region and cross-cloud replication management
Governance consistency across all regions
Support for global services such as data sharing, Marketplace distribution, and application deployment

1.2 Why Snowgrid Matters

Without Snowgrid, Snowflake would behave like isolated deployments in each cloud. Snowgrid ensures:

Consistent semantics and APIs across AWS, Azure, and GCP
Interoperability for global organizations
The ability to seamlessly replicate data across clouds
Centralized policy enforcement and governance

1.3 Services Enabled by Snowgrid

Snowgrid is the foundation for:

Cross-cloud data sharing
Cross-region database and share replication
Failover and failback management
Snowflake Marketplace global distribution
Native Application Framework app deployment

For the SnowPro exam, it is essential to know that Snowgrid is the underlying layer that makes Snowflake a true multi-cloud unified platform.

2. Transaction Model (ACID and MVCC)

Snowflake implements a modern transaction system that guarantees full ACID compliance without using locking mechanisms typical in traditional databases.

2.1 ACID Compliance

Snowflake guarantees:

Atomicity
Consistency
Isolation
Durability

All transactions operate on consistent snapshots of data.

2.2 Multi-Version Concurrency Control (MVCC)

Snowflake’s concurrency model relies on MVCC, which allows:

Multiple readers and writers to operate concurrently
Readers to see a consistent snapshot of data without being blocked by writers
Writers to generate new versions of micro-partitions

2.3 No Locks

Snowflake does not use:

Row locks
Table locks
Page locks

Instead, updates create new micro-partitions, and queries read the correct version based on transaction timestamps.

2.4 Implications

MVCC enables:

High concurrency for analytics workloads
Isolation without blocking
Support for Time Travel by retaining old versions of partitions
Fast cloning using metadata pointers

Understanding MVCC is essential for interpreting Snowflake’s performance and behavior under concurrent workloads.

3. External Tables

External Tables allow Snowflake to query data stored in external cloud storage without loading it into internal Snowflake-managed storage.

3.1 Purpose

External Tables are used primarily in cloud data lake architectures where data remains in:

Amazon S3
Azure Data Lake Storage (ADLS)
Google Cloud Storage (GCS)

3.2 How External Tables Work

External Tables rely on:

External file metadata stored in Snowflake
A metadata cache for file characteristics
The external storage location for actual file content

3.3 Metadata Refresh Requirement

External Tables do not automatically detect new files unless specifically refreshed:

ALTER EXTERNAL TABLE my_table REFRESH;

This updates Snowflake’s metadata cache to recognize new or removed data files.

3.4 Use Cases

Querying data lakes without ingesting data
Blending data lake and warehouse architectures
Gradual migration to Snowflake from legacy lake architectures
Combining data from internal and external tables

External Tables support both structured and semi-structured data.

4. Iceberg Table Support

Snowflake provides support for Apache Iceberg, a high-performance table format widely used in modern data lake and lakehouse systems.

4.1 Types of Iceberg Integration

Snowflake supports two operational modes:

4.1.1 External Iceberg Tables

Snowflake reads Iceberg metadata maintained outside of Snowflake
Data remains in external storage
Snowflake acts as a query engine without managing the table lifecycle

4.1.2 Snowflake-Managed Iceberg Tables

Iceberg metadata and table lifecycle are fully managed by Snowflake
Data is stored in customer cloud storage
Provides consistent performance and Snowflake-level governance

4.2 Why Iceberg Support Matters

Iceberg tables allow Snowflake to:

Interoperate with lakehouse ecosystems like Delta Lake and Hudi
Serve as a central data access layer for existing data lakes
Offer ACID-compliant operations on open formats

Iceberg support enables Snowflake to operate seamlessly in mixed architectures.

5. Materialized Views (Advanced Details)

Materialized Views (MVs) are stored query results that are automatically refreshed by Snowflake.

5.1 How Materialized Views Work

Snowflake maintains MVs by:

Tracking changes at the micro-partition level
Incrementally updating the MV as source data changes
Storing precomputed results for fast access

5.2 Benefits

Significantly faster query performance
Ideal for dashboards
Reduced compute for frequently repeated queries

5.3 Costs and Limitations

MVs consume storage for materialized results
Maintenance consumes compute credits
MVs have limitations:
- Cannot reference another MV
- Must reference a single base table
- Limited support for complex SQL constructs

Understanding MV limitations is important for exam questions about architecture and cost.

6. Search Optimization Service

Search Optimization Service (SOS) accelerates highly selective queries that would otherwise require scanning large numbers of micro-partitions.

6.1 What It Does

SOS builds additional persistent search structures to enable faster evaluation of:

Equality predicates
IN list queries
Highly selective filters
Certain semi-structured search conditions

6.2 Key Characteristics

Improves performance without traditional indexing
Fully managed by Snowflake
Adds both compute and storage cost
Does not replace clustering for range-based pruning

6.3 When to Use Search Optimization

Appropriate for:

High-selectivity lookups
Large tables frequently queried on low-cardinality filters
Text searches or semi-structured field lookups

Not appropriate for:

Range queries (improved by clustering keys instead)
Full-scan analytical queries

Search Optimization is a powerful but optional performance service.

Shopping cart

Subtotal:

COF-C02 Snowflake AI Data Cloud Features & Architecture

Detailed list of COF-C02 knowledge points