Home >
Perspectives >
Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms

Blog June 25, 2020

7 min read

Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms

Examine the internal workings of the Spark-Snowflake Connector with a clear breakdown of how the connector integrates Apache Spark with Snowflake for enhanced data processing capabilities. Gain insights into its architecture, key components, and techniques for seamlessly optimizing performance during large-scale data operations.

TAGS:

Karthikeyan Siva Baskaran

The AI Engineer’s Guide to Content Summarization at Scale on AWS Bedrock From Data Cloud to Intelligence Cloud: A Comprehensive Guide to Harnessing Agentic AI with Snowflake Intelligence Powering Data & Analytics Modernization with Tiger Analytics’ Intelligent Data Express (IDX) on Databricks TigerMLCore and the Future of Model Risk Management for Enterprise AI Teams Choosing the Right Agile Framework: A Data Professional’s Guide for 12 Key Projects

Introduction

The Snowflake Connector for Spark enables the use of Snowflake as a Spark data source – similar to other data sources like PostgreSQL, HDFS, S3, etc. Though most data engineers use Snowflake, what happens internally is a mystery to many. But only if one understands the underlying architecture and its functioning, can they figure out how to fine-tune their performance and troubleshoot issues that might arise. This blog thus aims to explaining in detail the internal architecture and functioning of the Snowflake Connector.

Before getting into the details, let us understand what happens when one does not use the Spark-Snowflake Connector.

Data Loading to Snowflake without Spark- Snowflake Connector

Create Staging Area -> Load local files -> Staging area in cloud -> Create file format -> Load to Snowflake from staging area using the respective file format

Data Loading using Spark-Snowflake Connector

When we use the Spark Snowflake connector to load the data into Snowflake, it does a lot of things that are abstract to us. The connector takes care of all the heavy lifting tasks.

Spark Snowflake Connector

Spark Snowflake Connector (Source:https://docs.snowflake.net/manuals/user-guide/spark-connector-overview.html#interaction-between-snowflake-and-spark)

This blog illustrates one such example where the Spark-Snowflake Connector is used to read and write data in Databricks. Databricks has integrated the Snowflake Connector for Spark into the Databricks Unified Analytics Platform to provide native connectivity between Spark and Snowflake.

The Snowflake Spark Connector supports Internal (temp location managed by Snowflake automatically) and External (temp location for data transfer managed by user) transfer modes. Here is a brief description of the two modes of transfer-

Internal Data Transfer

This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

External Data Transfer

External data transfer, on the other hand, is facilitated through a storage location that the user specifies. The storage location must be created and configured as part of the Spark connector installation/configuration.

Further, the files created by the connector during external transfer are intended to be temporary, but the connector does not automatically delete these files from the storage location. This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

Use Cases

Below are the use cases we are going to run on Spark and see how the Spark Snowflake connector works internally-

1. Initial Loading from Spark to Snowflake

2. Loading the same Snowflake table in Overwrite mode

3. Loading the same Snowflake table in Append mode

4. Read the Snowflake table

Snowflake Connection Parameters

1. Initial Loading from Spark to Snowflake

When a new table is loaded for the very first time from Spark to Snowflake, the following command will be running on Spark. This command, in turn, starts to execute a set of SQL queries in Snowflake using the connector.

spark snowflake overwrite mode

The single Spark command above triggers the following 9 SQL queries in Snowflake

Snowflake Initial Load Query History

i) Spark, by default, uses the local time zone. This can be changed by using the sfTimezone option in the connector

Initial Load Query 1

ii) The below query creates a temporary internal stage in Snowflake. We can use other cloud providers that we can configure in Spark.

Initial Load Query 2

iii) The GET command downloads data files from one of the following Snowflake stages to a local directory/folder on a client machine. We have metadata checks at this stage.

Initial Load Query 3

iv) The PUT command uploads (i.e., stages) data files from a local directory/folder onto a client machine to one of the Snowflake strategies

Initial Load Query 4

v) The DESC command failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It won’t throw any error in the Spark Job

Initial Load Query 5

vi) The IDENTIFIER keyword is used to identify objects by name, using string literals, session variables, or bind variables.

Initial Load Query 9

vii) The command below loads data into Snowflake’s temporary table to maintain consistency. By doing so, Spark follows Write All or Write Nothing architecture.

Initial Load Query 6

viii) The DESC command below failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It didn’t stop the process. The metadata check at this point is to see whether to use RENAME TO or SWAP WITH

Initial Load Query 7

ix) The RENAME TO command renames the specified table with a new identifier that is not currently used by any other table in the schema. If the table is already present, we have to use SWAP WITH and then drop the identifier

Inital Load Query 8

2. Loading the same Snowflake table in Overwrite mode

Overwrite Mode

The above Spark command triggers the following 10 SQL queries in Snowflake. This time there is no failure when we ran the overwrite command a second time because this time, the table already exists.

Overwrite Mode Query History

Overwrite Mode Query 1

Overwrite Mode Query 2

i) We have metadata checks at the internal stages

Overwrite Mode Query 3

Overwrite Mode Query 4

Overwrite Mode Query 5

ii) The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. It also swaps all access control privilege grants. The two tables are essentially renamed in a single transaction.

The RENAME TO command is used when the table is not present because it is faster than renaming and dropping the intermediate table. But this can only be used when the table does not exist in Snowflake. This means that RENAME TO is only performed during the Initial Load.

Overwrite Mode Query 6

iii) The DROP command drops the intermediate staging table

Overwrite Mode Query 7

3. Loading the same Spark table in Append mode

Append Mode

The above Spark command triggers the following 7 SQL queries in Snowflake.

Append Mode Query History

Append Mode Query 1

Append Mode Query 2

Append Mode Query 3

Append Mode Query 4

Append Mode Query 5

Note: When we use OVERWRITE mode, the data is copied into the intermediate staged table, but during APPEND, the data is loaded into the actual table in Snowflake.

Append Mode Query 6

i) In order to maintain the ACID compliance, this mode uses all the transactions inside the BEGIN and COMMIT. If anything goes wrong, it uses ROLLBACK so that the previous state of the table is untouched.

Append Mode Query 7

4. Reading the Snowflake Table

Read Mode

The above Spark command triggers the following SQL query in Snowflake. The reason for this is that Spark follows the Lazy Execution pattern. So until an action is performed, it will not read the actual data. Spark internally maintains lineage to process it effectively. The following query is to check whether the table is present or not and to retrieve only the schema of the table.

Read Mode Query 1

The Spark action below triggers 5 SQL queries in Snowflake

Read Mode Query 2

i) First, it creates a temporary internal stage to load the read data from Snowflake.

Read Mode Query 3

ii) Next, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Read Mode Query 4

iii) The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ.

Read Mode Query 5

iv) The data is then copied from Snowflake to the internal stage.

Read Mode Query 6

v) Finally, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Read Mode Query 7

Wrapping Up

Spark Snowflake connector comes with lots of benefits like query pushdown, column mapping, etc. This acts as an abstract layer and does a lot of groundwork in the back end.

Happy Learning!!

Karthikeyan Siva Baskaran

Spark-Snowflake Connector: In-Depth Analysis of Internal Mechanisms

Introduction

Data Loading to Snowflake without Spark- Snowflake Connector

Data Loading using Spark-Snowflake Connector

Internal Data Transfer

External Data Transfer

Use Cases

1. Initial Loading from Spark to Snowflake

2. Loading the same Snowflake table in Overwrite mode

3. Loading the same Spark table in Append mode

4. Reading the Snowflake Table

Wrapping Up

Explore more blogs

Thank you!

Thank you!