• Home  >  
  • Perspectives  >  
  • Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms  
Blog June 25, 2020
7 min read

Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms

Examine the internal workings of the Spark-Snowflake Connector with a clear breakdown of how the connector integrates Apache Spark with Snowflake for enhanced data processing capabilities. Gain insights into its architecture, key components, and techniques for seamlessly optimizing performance during large-scale data operations.

Introduction

The Snowflake Connector for Spark enables the use of Snowflake as a Spark data source – similar to other data sources like PostgreSQL, HDFS, S3, etc. Though most data engineers use Snowflake, what happens internally is a mystery to many. But only if one understands the underlying architecture and its functioning, can they figure out how to fine-tune their performance and troubleshoot issues that might arise. This blog thus aims to explaining in detail the internal architecture and functioning of the Snowflake Connector.

Before getting into the details, let us understand what happens when one does not use the Spark-Snowflake Connector.

Data Loading to Snowflake without Spark- Snowflake Connector

Create Staging Area -> Load local files -> Staging area in cloud -> Create file format -> Load to Snowflake from staging area using the respective file format

Loading Data from Local to Snowflake

Data Loading using Spark-Snowflake Connector

When we use the Spark Snowflake connector to load the data into Snowflake, it does a lot of things that are abstract to us. The connector takes care of all the heavy lifting tasks.

Spark Snowflake Connector

Spark Snowflake Connector (Source:https://docs.snowflake.net/manuals/user-guide/spark-connector-overview.html#interaction-between-snowflake-and-spark)

This blog illustrates one such example where the Spark-Snowflake Connector is used to read and write data in Databricks. Databricks has integrated the Snowflake Connector for Spark into the Databricks Unified Analytics Platform to provide native connectivity between Spark and Snowflake.

The Snowflake Spark Connector supports Internal (temp location managed by Snowflake automatically) and External (temp location for data transfer managed by user) transfer modes. Here is a brief description of the two modes of transfer-

Internal Data Transfer

This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

External Data Transfer

External data transfer, on the other hand, is facilitated through a storage location that the user specifies. The storage location must be created and configured as part of the Spark connector installation/configuration.

Further, the files created by the connector during external transfer are intended to be temporary, but the connector does not automatically delete these files from the storage location. This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

Use Cases

Below are the use cases we are going to run on Spark and see how the Spark Snowflake connector works internally-

1. Initial Loading from Spark to Snowflake

2. Loading the same Snowflake table in Overwrite mode

3. Loading the same Snowflake table in Append mode

4. Read the Snowflake table

Snowflake Connection Parameters

Snowflake Connection Parameters

1. Initial Loading from Spark to Snowflake

When a new table is loaded for the very first time from Spark to Snowflake, the following command will be running on Spark. This command, in turn, starts to execute a set of SQL queries in Snowflake using the connector.

spark snowflake overwrite mode

The single Spark command above triggers the following 9 SQL queries in Snowflake

Snowflake Initial Load Query History

Snowflake Initial Load Query History

i) Spark, by default, uses the local time zone. This can be changed by using the sfTimezone option in the connector

Initial Load Query 1

ii) The below query creates a temporary internal stage in Snowflake. We can use other cloud providers that we can configure in Spark.

Initial Load Query 2

iii) The GET command downloads data files from one of the following Snowflake stages to a local directory/folder on a client machine. We have metadata checks at this stage.

Initial Load Query 3

iv) The PUT command uploads (i.e., stages) data files from a local directory/folder onto a client machine to one of the Snowflake strategies

Initial Load Query 4

v) The DESC command failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It won’t throw any error in the Spark Job

Initial Load Query 5

vi) The IDENTIFIER keyword is used to identify objects by name, using string literals, session variables, or bind variables.

Initial Load Query 9

vii) The command below loads data into Snowflake’s temporary table to maintain consistency. By doing so, Spark follows Write All or Write Nothing architecture.

Initial Load Query 6

viii) The DESC command below failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It didn’t stop the process. The metadata check at this point is to see whether to use RENAME TO or SWAP WITH

Initial Load Query 7

ix) The RENAME TO command renames the specified table with a new identifier that is not currently used by any other table in the schema. If the table is already present, we have to use SWAP WITH and then drop the identifier

Inital Load Query 8

2. Loading the same Snowflake table in Overwrite mode

Overwrite Mode

The above Spark command triggers the following 10 SQL queries in Snowflake. This time there is no failure when we ran the overwrite command a second time because this time, the table already exists.

Overwrite Mode Query History

Overwrite Mode Query 1

Overwrite Mode Query 2

i) We have metadata checks at the internal stages

Overwrite Mode Query 3

Overwrite Mode Query 4

Overwrite Mode Query 5

ii) The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. It also swaps all access control privilege grants. The two tables are essentially renamed in a single transaction.

The RENAME TO command is used when the table is not present because it is faster than renaming and dropping the intermediate table. But this can only be used when the table does not exist in Snowflake. This means that RENAME TO is only performed during the Initial Load.

Overwrite Mode Query 6

iii) The DROP command drops the intermediate staging table

Overwrite Mode Query 7

3. Loading the same Spark table in Append mode

Append Mode

The above Spark command triggers the following 7 SQL queries in Snowflake.

Append Mode Query History

Append Mode Query 1

Append Mode Query 2

Append Mode Query 3

Append Mode Query 4

Append Mode Query 5

Note: When we use OVERWRITE mode, the data is copied into the intermediate staged table, but during APPEND, the data is loaded into the actual table in Snowflake.

Append Mode Query 6

i) In order to maintain the ACID compliance, this mode uses all the transactions inside the BEGIN and COMMIT. If anything goes wrong, it uses ROLLBACK so that the previous state of the table is untouched.

Append Mode Query 7

4. Reading the Snowflake Table

Read Mode

The above Spark command triggers the following SQL query in Snowflake. The reason for this is that Spark follows the Lazy Execution pattern. So until an action is performed, it will not read the actual data. Spark internally maintains lineage to process it effectively. The following query is to check whether the table is present or not and to retrieve only the schema of the table.

Read Mode Query 1

The Spark action below triggers 5 SQL queries in Snowflake

Read Mode Query 2

i) First, it creates a temporary internal stage to load the read data from Snowflake.

Read Mode Query 3

ii) Next, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Read Mode Query 4

iii) The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ.

Read Mode Query 5

iv) The data is then copied from Snowflake to the internal stage.

Read Mode Query 6

v) Finally, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Read Mode Query 7

Wrapping Up

Spark Snowflake connector comes with lots of benefits like query pushdown, column mapping, etc. This acts as an abstract layer and does a lot of groundwork in the back end.

Happy Learning!!

Explore more blogs

Automating Data Quality Using Deequ With Apache Spark
5 min read
BLOG
September 3, 2020
Automating Data Quality: Using Deequ with Apache Spark
Readshp-arrow-topright-large
How To Implement ML Models Azure And Jupyter For Production
3 min read
BLOG
May 28, 2020
How to Implement ML Models: Azure and Jupyter for Production
Readshp-arrow-topright-large
Koalas Library Integrating Pandas With PySpark For Data Handling
4 min read
BLOG
February 25, 2020
Koalas Library: Integrating Pandas with PySpark for Data Handling
Readshp-arrow-topright-large
Copyright © 2025 Tiger Analytics | All Rights Reserved