Step 2 - Curated Datasets

AmazonS3 AmazonRedshift

EcommCo’s Data Lake leverages Amazon S3 and Amazon Redshift for dataset management and transformation of data ingested from Data Providers. EcommCo’s internal Data Providers include its CRM system and e-commerce platform. Additionally, EcommCo uses publicly available demographics datasets to gain additional insights about their customers.

a. Understand principles of data organization within EcommCo’s Data Lake

Submissions

S3 Bucket

Submissions are provided to the Data Lake in their native format to keep the “price of entry” low for Data Providers. This enables business users to commission transforms on an as-needed basis for analytics that offer business value. They need not know all of their requirements up front.

Submissions can be added to S3 Buckets via a variety of submission mechanisms, which can differ from provider to provider (i.e.,SFTP, API, etc.).

Curated Datasets

S3 Bucket

Datasets under management are stored in S3 Buckets.

Curated Datasets are a key concept of Data Lakes. They can include:

  • Data that has been lightly transformed, so they can be easily consumed by down-stream analytics
  • Results of analytics

b. Review datasets contributed by Data Providers

The diagram below illustrates submissions from ECommCo’s Data Providers and how they are transformed into Curated Datasets.

c. Transform submissions into Curated Datasets

{% include 'error_box.html' %}

When you click this button, the following steps will be performed within your AWS account:

  • Create Curated Datasets by copying data submissions into Curated Dataset Bucket:
    • Orders
    • Customers
    • Products
  • Create Curated Dataset by transforming and copying Demographics data submission into Curated Dataset Bucket
  • Load Curated Datasets to Redshift tables:
    • Orders to orders table
    • Customers to customers table

d. Observe Curated Datasets in S3

Visit S3 in your AWS Management Console and review the following Buckets: