Business Use Case

Are the most viewed products also the most sold, or most searched for?

This insight can be derived from processing unstructured data stored in web logs and correlating it with structured data.

Since you can store unstructured and semi-structured data alongside structured data in the cloud without needing to remodel an entire database, you can easily ingest, store and process web log events.

In this section, we will:

Correlate Structured Data with Unstructured Data

Review web logs stored in the cloud

  1. Go to the browser and look at:
    {{ config['qubole_web_logs_s3_url'] }}
  2. Review structure of web logs

Now let’s create a Hive table and query the data using Analyze interface in QDS. Here we’ll take advantage of Hive's flexible SerDes (serializers / deserializers) to parse the unstructured logs into individual fields using a regular expression. Once the data is in this table, we will be able query it much faster and more interactively using QDS.

Create Hive table for semi-structured web logs

  1. Run query by clicking this button (this query does not provide any results, but just creates a table):
    USE {{ config['qubole_database_name'] }};
    
    CREATE EXTERNAL TABLE web_logs (
      ip STRING,
      `date` STRING,
      method STRING,
      url STRING,
      http_version STRING,
      code1 STRING,
      code2 STRING,
      dash STRING,
      user_agent STRING
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
      'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
      'output.format.string' = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
    )
    LOCATION '{{ config['qubole_web_logs_s3_path'] }}';
    
  2. Switch to Analyze interface and to confirm that the Hive table web_logs was created successfully, click on Tables tab on the top left and refresh {{ config['qubole_database_name'] }}. You can also run Hive commands ‘use {{ config['qubole_database_name'] }}’ followed by ‘show tables;’

Now let’s execute a SQL to find out which products have been viewed the most by inspecting the URLs found in web logs.

Top 10 most viewed products

  1. Run query by clicking this button:
    USE {{ config['qubole_database_name'] }};
    
    SELECT url AS URL, count(*) AS Views
    FROM web_logs
    WHERE url LIKE '%\/product\/%'
    GROUP BY url 
    ORDER BY Views DESC
    LIMIT 10;
    
  2. Switch to Analyze History tab and click on Results tab at the bottom to see query results

What you’ve learned

There is risk in looking for answers within partial data. Correlating two data sets for the same business use case shows value, and being able to do so within the same platform makes it easier for data team members and for the organization.