Integrating Data from Amazon Redshift to Snowflake using AWS Glue’s Snowflake Native Connector
Amazon has launched a native Snowflake connector within AWS Glue, further strengthening the integration between Amazon’s data and analytics services and Snowflake. This connector streamlines the process, optimizing data integration tasks. The official AWS documentation underscores its ETL capabilities, which are pivotal for consolidating datasets from Amazon Data sources and merging them into a Snowflake Database, or for executing direct data migrations and synchronizations. In this blog post, we will offer a detailed, step-by-step guide on how to harness this connector.
This new addition affords the ability to utilize Redshift’s Spectrum feature, a bridge to the AWS Data Lake. While we’ve chosen Redshift for illustration in this guide, it’s essential to note that the Snowflake connector’s adaptability in AWS Glue isn’t confined to just this. It’s designed to seamlessly merge with a multitude of AWS data sources, from DynamoDB and RDS to various AWS purpose-built databases and data lake configurations.
For the context of this guide, we’re assuming the presence of an AWS Redshift Serverless Instance. However, the outlined process remains equally applicable to a Redshift managed cluster.
Step 1: Redshift Setup
Before we begin the integration, let’s make sure everything is ready. First, open the Redshift Query Editor V2. It’s a straightforward tool that lets you see your Spectrum setup and other database items. In this example, we’re using a table named “Dev.public.city” that has some sample data. You can find more details about this data in the reference section.
To add sample data from an S3 bucket to the Redshift database, use this code:
COPY dev.public.city FROM 's3://curated/CityMaster.csv' IAM_ROLE 'arn:aws:iam::1xxx6xx:role/service-role/AmazonRedshift-CommandsAccessRole-20230923T011958' FORMAT AS CSV DELIMITER ',' QUOTE '"' IGNOREHEADER 1 REGION AS 'us-east-1'
Step 2: Create AWS Glue Job
Go to AWS Glue Service → Click on Jobs → ETL jobs. For an easy setup, choose the “Visual Editor” in Glue. However, if you have more advanced needs, you might want to pick the Spark script editor to utilize the Snowflake Native library. In most real-world scenarios, this script editor is handy for handling complex tasks. Then, choose Amazon Redshift as your Source and Snowflake as your Target. Click on Create
to proceed.
Step3: Configuring AWS Glue Job
- Open the Visual Editor.
- You’ll see boxes for ‘source’ and ‘target’. Click on the ‘source’ box.
- On the right side, pick ‘Amazon Redshift’.
- You’ll see two choices: “Direct data connection” and “Glue data catalog tables”. Choose “Direct data connection”.
- If you’re using AWS’s Data Lake and have everything set up, you might want to pick “Glue data catalog tables” instead.
- Click “Create a new connection” to set up the source and target.
Step 4: Create Redshift Connection
- In the “Create connection” screen, pick Amazon Redshift as the connection type.
- Fill in the connection name and select the instance and Database details.
- Click “Save”.
You’ve now made a connection to the Redshift instance. For this example, the connection is named “Redshift-Snowflake”.
Step 5: Create Snowflake Connection
- Next, set up a connection for Snowflake.
- Type in your Snowflake account details and target database and table.
- Click “Save” to finalize the connection.
- Here it is named as connection “RD-SNF”.
Step 6: Merge / Append Source to Target:
- Go back to the Glue interface.
- Click on the Amazon Redshift box labeled ‘source’ on the right.
- Pick the “Redshift-Snowflake” connection we made earlier.
- Next, click on the ‘sink’ box on the right.
- Choose the “RD-SNF” connection on the left.
- Make sure to pick the right tables for your source and target.
- You’ll see different ways to move your data.
You can Append, Merge, Truncate, and load or Drop and Re-create the target objects. For this blog, we’re just adding data using the “Append” option. But in real-world use cases most of the time we use “Merge” to keep data synced between source and target. - After setting things up, Glue will give you a script. You can change this script if needed. But, if you start editing it, you can’t use the easy visual setup for this job again. The script lets you make more detailed changes if you want.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
AmazonRedshift_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="redshift",
connection_options={
"redshiftTmpDir": "s3://aws-glue-assets-1xxx3-us-east-1/temporary/",
"useConnectionProperties": "true",
"dbtable": "public.city",
"connectionName": "Redshift Connection",
},
transformation_ctx="AmazonRedshift_node1",
)
SnowflakeSink_node2 = glueContext.write_dynamic_frame.from_options(
frame=AmazonRedshift_node1,
connection_type="snowflake",
connection_options={
"autopushdown": "on",
"dbtable": "city",
"connectionName": "RD-SNF",
"preactions": "CREATE TABLE IF NOT EXISTS ETL.city (city VARCHAR, no_records VARCHAR, state VARCHAR, city_no VARCHAR, city_code VARCHAR);",
"sfDatabase": "BI",
"sfSchema": "ETL",
},
transformation_ctx="SnowflakeSink_node2",
)
job.commit()
Step 7: Running the Job
- Click on the “Job details” tab at the top.
- Choose the right IAM role(Create one if needed). This role should have permission to access and do the operations on source and target instances and objects.
- Now, pick a Glue worker type. Think of it as the compute node, which does the actual work.
Glue charges are based on the DPU (Data Processing Unit). One DPU gives you 16MB of memory and 1 vCPU. If your job is big or complex or has time constraints, you might need to scale up the compute. - Click “Save”.
- Finally, hit “Run”. If you want, you can also set up this job to start at a certain time or when something else happens (event-triggered).
Finally, Open your Snowsight UI (Snowflake account) and look for the database you set up with AWS Glue Sink Connection. This database will now contain the Redshift data. You’ve effectively set up a continuous data transfer from Redshift to Snowflake.
Conclusion:
Using AWS Glue’s native Snowflake connector, integrating data between various AWS sources and Snowflake becomes a straightforward task. We demonstrated this using Redshift, but the possibilities don’t end there. By understanding and applying these steps, you can expand this capability to integrate other AWS data sources like DynamoDB or RDS. The seamless data syncing capability ensures that your datasets remain updated and offers immense potential for various data management and analytics tasks.
Reference: