Deploy A Data Lake With MongoDB Atlas And AWS S3 (Step By Step)
This is a tutorial to guide you through the process of creating and connecting to a Data Lakes
Before we get started with the nitty-gritty details, let me share some background to help set the context. If you are in a rush, feel free to skip the next 4 paragraphs and jump right to the Pre-requisites
MongoDB Atlas Data Lake is a low-cost solution that is easy and fast to set up for querying data stored in Amazon S3 allowing analytics applications to make use of archived data.
Bear in mind Atlas Data Lake solution is still in the Beta phase and is slower than Amazon Redshift with S3 native solution. For those who are familiar with MongoDB, this will open the door for other hybrid innovations for an affordable cost,I don’t see it as a production-ready solution at this time.
Even though MongoDB is an easily configurable solution, the gap in the help/ user guide set me back a few times when I started out and I had to end up contacting MongoDB support. They were ready to offer assistance pointing me to some helpful resources online that I’m sharing at the end in case you are interested in. Now that I’ve successfully configured DataLake, I’m creating this step-by-step user guide to help my fellow data professionals who are googling like me to learn😊.I hope this helps. I look forward to hearing your questions and comments if any too.
Prerequisites
To create a data lake, you’ll need:
- At-least one AWS S3 bucket.
- Access to the AWS Management Console with permission to create IAM roles.
Load sample data in S3 Bucket
Here I created bucket “my-data-feed” and loaded the sample weather data from the below link
https://atlas-data-lake.s3.amazonaws.com/json/sample_weatherdata/data.json
Create Mongo DB Atlas Account
Open a new tab in your browser to create an Atlas account. Keep the AWS console open in another tab for we will come back to it in the middle of Data Lake configuration.
Create a MongoDB Cloud Account.
a. https://www.mongodb.com/cloud
All you need is your work email.
Create A Data Lake
- Click “Configure a New Data Lake”
- Name your Data Lake.
- Connect the S3 bucket created in step 1.
I prefer to store my query result in S3. This makes it easy to access result set and schedule queries.
- Here is the important part to note. I used External ID to connect my S3 bucket to the data lake rather than AWS CLI.
- Copy the “Your unique External ID” and “Atlas AWS IAM User ARN”. Keep this info handy as we will need this in the next step.
Now its time to jump to AWS console you have open in the other tab
Create IAM Role AWS
To obtain the role ARN from the AWS console:
- Click the Services dropdown menu on the upper left-hand side of the console.
- Under Security, Identity, & Compliance, select IAM.
- Select Roles from the left-hand navigation panel.
- Create a new Role
- Select another AWS account.
- Type in the Account ID and the External ID
External ID = “Your unique External ID” copied in the previous step
Account ID = Atlas account ID with AWS. Highlighted below
Now it is time to attach the policy to S3. I would suggest you assign full access to test the functionality. Once you are comfortable with the configuration and flow you can change the policy to read-only.
- Filter the policies by S3
- Select either Read Only or Full Access
Now click on the newly created role and copy the Role ARN.
Data Lake Configuration
Jump back to the Data Lake configuration tab. Paste the ARN you copied in the previous step
- Now, you’ve successfully created the Data Lake and have it connected to S3. So, you can now proceed to configuration.
- Copy the following storageSetConfig command to your preferred text editor and update your account and region information as commented in the script. Replace
- <S3_FolderName> with the folder name you have in S3. collectionName() Method will get all the subfolders. For region code check AWS link “https://docs.aws.amazon.com/general/latest/gr/rande.html”
use admin;
db.runCommand( { "storageSetConfig": {
"stores": [{
"name": "s3store", // Creates an S3 store
"provider": "s3", // Specifies the provider
"region":"", // Update with the bucket region code
"bucket": "" // Update with your bucket name
}],
"databases": [{
"name": "sample", // Creates a database named sample
"collections": [{
"name": "*", // Creates a collection for each directory //
"dataSources": [{
"storeName": "s3store", // Links to the S3 store above
"path": "//{collectionName()}"
}]
}]
}]
}})
Download MongoShell from the below link.
https://downloads.mongodb.org/win32/mongodb-shell-win32-x86_64-2012plus-4.2.6.zip
Extract the downloaded file and add the extract location to system PATH for easy access.
Open your Command prompt (Windows) and copy-paste the below the line. Change the <username> to your MongoDb account username.
mongo "mongodb+srv://cluster0-cah4i.mongodb.net/test" --username <username>
You will be prompted for the password.
In the Mongo Shell paste the storageSetConfig command you have updated from above step.
Upon successful configuration, the mongo shell outputs the following:
{ "ok" : 1 }
Verify your database and collection mapping.
Run the following command to display the mapped database:
showdbs
Upon successful configuration, the mongo shell outputs the following:
sample (empty)
The (empty) in the output is expected.
Switch to the sample
database:
usesample
Run the following command to display the mapped collections:
show collections
SampleQuery
If you loaded the weather sample data in step 1. Here are the sample queries to find instances in the weather collection where pressure is higher than 900 millibars. Sort by timestamp and limit the number of documents returned:
db.weather.find({"pressure": {$gt: 900}}).limit(5).sort({ "ts": 1})
Congratulations! You just set up an Atlas Data Lake, created a database and collections from data stored in an S3 bucket, and queried the data using MQL commands.
Links:
https://docs.mongodb.com/manual/tutorial/deploy-shard-cluster/
https://docs.aws.amazon.com/cli/latest/reference/s3/