Are the most viewed products also the most sold, or most searched for?
This insight can be derived from processing unstructured data stored in web logs and correlating it with structured data.
Since you can store unstructured and semi-structured data alongside structured data in the cloud without needing to remodel an entire database, you can easily ingest, store and process web log events.
In this section, we will:
Now let’s create a Hive table and query the data using Analyze interface in QDS. Here we’ll take advantage of Hive's flexible SerDes (serializers / deserializers) to parse the unstructured logs into individual fields using a regular expression. Once the data is in this table, we will be able query it much faster and more interactively using QDS.
USE {{ config['qubole_database_name'] }}; CREATE EXTERNAL TABLE web_logs ( ip STRING, `date` STRING, method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash STRING, user_agent STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"', 'output.format.string' = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) LOCATION '{{ config['qubole_web_logs_s3_path'] }}';
Now let’s execute a SQL to find out which products have been viewed the most by inspecting the URLs found in web logs.
USE {{ config['qubole_database_name'] }}; SELECT url AS URL, count(*) AS Views FROM web_logs WHERE url LIKE '%\/product\/%' GROUP BY url ORDER BY Views DESC LIMIT 10;
There is risk in looking for answers within partial data. Correlating two data sets for the same business use case shows value, and being able to do so within the same platform makes it easier for data team members and for the organization.