The Internal Workings of Spark-Snowflake Connector

Authors: Karthikeyan Siva Baskaran

Introduction

The Snowflake Connector for Spark enables using Snowflake as a Spark data source – similar to other data sources like PostgreSQL, HDFS, S3, etc. Though most data engineers use Snowflake, what happens internally is a mystery to many. But only if one understands the underlying architecture and its functioning, can they figure out how to fine-tune their performance and troubleshoot issues that might arise. This blog thus aims at explaining in detail the internal architecture and functioning of the Snowflake Connector.

Before getting into the details, let us understand what happens when one does not use the Spark-Snowflake Connector.

Data Loading to Snowflake without Spark- Snowflake Connector

Create Staging Area -> Load local files -> Staging area in cloud -> Create file format -> Load to Snowflake from staging area using the respective file format

Loading Data from Local to Snowflake

 

Data Loading using Spark-Snowflake Connector

When we use the Spark Snowflake connector to load the data into Snowflake, it does a lot of things that are abstract to us. The connector takes care of all the heavy lifting tasks.

This blog illustrates one such example where the Spark-Snowflake Connector is used to read and write data in databricks. Databricks has integrated the Snowflake Connector for Spark into the Databricks Unified Analytics Platform to provide native connectivity between Spark and Snowflake.
The Snowflake Spark Connector supports Internal (temp location managed by Snowflake automatically) and External (temp location for data transfer managed by user) transfer modes. Here is a brief description of the two modes of transfer-

Internal Data Transfer

This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

External Data Transfer

External data transfer, on the other hand, is facilitated through a storage location that the user specifies. The storage location must be created and configured as part of the Spark connector installation/configuration.

Further, the files created by the connector during external transfer are intended to be temporary, but the connector does not automatically delete these files from the storage location. This type of data transfer is facilitated through a Snowflake internal stage that is automatically created and managed by the Snowflake Connector.

Use Cases

Below are the use cases we are going to run on Spark and see how the Spark Snowflake connector works internally-
1.  Initial Loading from Spark to Snowflake
2. Loading the same Snowflake table in Overwrite mode
3. Loading the same Snowflake table in Append mode
4. Read the Snowflake table

Snowflake Connection Parameters

Snowflake Connection Parameters

1. Initial Loading from Spark to Snowflake

When a new table is loaded for the very first time from Spark to Snowflake, the following command will be running on Spark. This command, in turn, starts to execute a set of SQL queries in Snowflake using the connector.

spark snowflake overwrite mode
The single Spark command above triggers the following 9 SQL queries in Snowflake

Snowflake Initial Load Query History

Snowflake Initial Load Query History

 

i) Spark, by default, uses the local time zone. This can be changed by using the sfTimezone option in the connector

ii) The below query creates a temporary internal stage in Snowflake. We can use other cloud providers, which we can configure in Spark.

iii) The GET command downloads data files from one of the following Snowflake stages to a local directory/folder on a client machine. We have metadata checks at this stage.


iv) The PUT command uploads (i.e., stages) data files from a local directory/folder onto a client machine to one of the Snowflake stages

v) The DESC command failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It won’t throw any error in the Spark Job

vi) The IDENTIFIER keyword is used to identify objects by name, using string literals, session variables, or bind variables.

vii) The command below loads data into Snowflake’s temporary table to maintain consistency. By doing so, Spark follows Write All or Write Nothing architecture.

viii) The DESC command below failed as the table did not exist previously, but this is now taken care of by the Snowflake connector internally. It didn’t stop the process. The metadata check at this point is to see whether to use RENAME TO or SWAP WITH


ix) The RENAME TO command renames the specified table with a new identifier that is not currently used by any other table in the schema. If the table is already present, we have to use SWAP WITH and then drop the identifier

 

2. Loading the same Snowflake table in Overwrite mode

The above Spark command triggers the following 10 SQL queries in Snowflake. This time there is no failure when we ran the overwrite command a second time because this time, the table already exists.

i) We have metadata checks at the internal stages

ii) The SWAP WITH command swaps all content and metadata between two specified tables, including any integrity constraints defined for the tables. It also swaps all access control privilege grants. The two tables are essentially renamed in a single transaction.
The RENAME TO command is used when the table is not present because it is faster than renaming and dropping the intermediate table. But this can only be used when the table does not exist in Snowflake. This means that RENAME TO is only performed during the Initial Load.

iii) The DROP command drops the intermediate staging table

 

3. Loading the same Spark table in Append mode

The above Spark command triggers the following 7 SQL queries in Snowflake.

Note: When we use OVERWRITE mode, the data is copied into the intermediate staged table, but during APPEND, the data is loaded into the actual table in Snowflake.


i) In order to maintain the ACID compliance, this mode uses all the transactions inside the BEGIN and COMMIT. If anything goes wrong, it uses ROLLBACK so that the previous state of the table is untouched.

4. Reading the Snowflake Table

The above Spark command triggers the following SQL query in Snowflake. The reason for this is that Spark follows the Lazy Execution pattern. So until an action is performed, it will not read the actual data. Spark internally maintains lineage to process it effectively. The following query is to check whether the table is present or not and to retrieve only the schema of the table.

The Spark action below trigger 5 SQL queries in Snowflake


i) First, it creates a temporary internal stage to load the read data from Snowflake.


ii) Next, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

iii) The default timestamp data type mapping is TIMESTAMP_NTZ (no time zone), so you must explicitly set the TIMESTAMP_TYPE_MAPPING parameter to use TIMESTAMP_LTZ.


iv) The data is then copied from Snowflake to the internal stage.

v) Finally, it downloads data files from the Snowflake internal stage to a local directory/folder on a client machine.

Wrapping Up

Spark Snowflake connector comes with lots of benefits like Query push down, column mapping, etc. This acts as an abstract layer and does a lot of groundwork in the back end.

Happy Learning!!

Tags:
2 Comments
  1. Sonal 2 months ago

    How can the schema conversion take place if we need to migrate Teradata database to Snowflake using Spark Snowflake connector

  2. Karthikeyan 2 months ago

    Usually, if we need to migrate data between two different database, datatype mapping is a hard thing. This datatype mapping is automatically handled by Spark Snowflake connector when loading the data to Snowflake. Once Spark reads the data from Teradata, the datatype will be inferred from the data. Based on the data inference, you can do initial load to Snowflake, where the schema will be automatically mapped based on the Spark Dataframe.

    But, there may be a scenario where the complete column may be null when you do initial load to Snowflake from Teradata, in this scenario Spark can’t infer the datatype from the data. Also, some datatype mapping done by connector between Spark dataframe(Teradata data) and snowflake cannot be as expected.

    These edge cases can be handled using Spark JDBC option – createTableOptions, you can specify the specific column datatype which overrides spark inferred datatype or create a table in Snowflake using python connector before loading the data.

    Snowflake customers can get the useful scritps from Snowflake like codes for datatype mapping between other DBs to Snowflake, SQL query conversion between different database to Snowflake SQL,etc…

    Using this script, you can leverage the exact datatype mapping and use this in createTableOptions(Spark) or create table statements before loading data.

    Hope it helps.

Leave a reply

Your email address will not be published. Required fields are marked *

*

©2017 Tiger Analytics. All rights reserved.

Log in with your credentials

Forgot your details?