Step 6 - Redshift Spectrum

AmazonRedshift

EcommCo’s Data Lake leverages Amazon Redshift and Amazon Redshift Spectrum capabilities to transform, aggregate and analyze Curated Datasets. Redshift Spectrum allows data to be analyzed directly from S3, thus it can scale to support querying exabytes of data.

a. Using Spectrum for queries

The diagram below illustrates how ECommCo uses Redshift Spectrum to join data that reside in S3 and Redshift.

Curated Datasets in S3 and Curated Datasets in Redshift can be transformed, joined, analyzed using Redshift Spectrum. Partitioned data on S3 can help optimize query performance by only scanning through partitions of interest.

b. Demonstrate Spectrum analytics

Customer Lifetime Value analytics

One of ECommCo’s business stakeholder’s key questions is, “What is our expected Customer Lifetime Value (CLV)?”

An analytics process queries orders and products to aggregate per customer revenue and show averages across different customer segments:

Product distributions across recent orders

EcommCo product catalogs are released annually. The products table is partitioned by date of a catalog release. EcommCo uses Redshift Spectrum to only analyze products for the most recent product catalog.

The Products table has 4 partitions. Since ECommCo’s business users are most interested in the most recent product catalog released at the end of 2016 year, Redshift Spectrum only analyzes 1 partition. This cuts down the time and required resources required to process a query.

Sales over time analytics

ECommCo’s business users are very interested in seeing sales over time. Spectrum is used to analyze and aggregate orders on a year and month basis to generate sales over time analytics.

Popular SKU analytics

ECommCo’s business users are very interested in tracking their most popular SKUs. This Spectrum analytics process takes advantage of data partitioning of products in order to access only most recent product catalog.

c. Run analytics using Amazon Redshift Spectrum

{% include 'error_box.html' %}

When you click this button, the following steps will be performed within your AWS account:

  • Amazon Redshift Spectrum external schema is created from AWS Glue Data Catalog database {{ curated_datasets_database_name }}
  • Orders, Customers and Demographics Curated Datasets are consumed by the analytics process
  • New Curated Datasets are created and added to the Data Lake:
    • Customer Lifetime Value
    • Spend Distribution
    • SKU Distribution
    • Product Category Distribution - This queries products data using Redshift Spectrum
    • State Population - This queries public data using Redshift Spectrum
    • Orders by Month
  • New Curated Datasets are copied from Redshift to a Curated Datasets S3 Bucket

d. Observe analytics output stored as Curated Datasets in S3

Visit S3 in the AWS Management Console to see Spectrum analytics outputs stored as new datasets in {{ curated_bucket_name }} Bucket.

Navigate to Elasticsearch in the AWS Management Console and review at Top Curated Datasets Summary chart to see new Curated Datasets resulting from the Redshift Spectrum analytics.