Advanced Analytics using Apache Spark

For business and marketing teams it would be interesting to know which products are most commonly purchased together. For instance:

To answer such questions and gain insights, in this section, we will use Notebooks in QDS to:

Notebooks are great for developing applications in Scala, Python, R, running ETL jobs in Apache Spark, as well as visualizing the results of SQL in a single, collaborative environment.

Note: The default input language for a Spark Notebook is Scala and the default context is SparkContext (sc).

Insights on Product Relationships

The engine best suited for quick analytics on object relationships is Apache Spark, which is available as a service on QDS.

Products purchased together

  1. Switch over to Notebooks interface
  2. Click on {{ config['spark_notebook_name'] }} notebook which we created for you. You can find the notebook on the right in the Common space.
  3. Execute the paragraph named Initialize by clicking button placed on the top right corner of the paragraph. This paragraph doesn’t provide any results but initializes the Notebook
  4. Execute the paragraph named Products purchased together by clicking button placed on the top right corner of the paragraph. Ths paragraph doesn’t provide any result and just creates ProductCombinations that will be used later for further analysis
    When you run this paragraph, Qubole executes the following actions:
    • Execute a Hive query to group order items by their key (order_id) and use built-in aggregation function collect_set to process products purchased within each order as a group. This will allow us to determine all the combinations of products across orders.
    • Import org.apache.spark.mllib.fpm.FPGrowth
    • Import import scala.collection.mutable._
      • This will let us work with WrappedArray created by Hive collect_set aggregation function
    • Create a Scala class ProductCombinations with two attributes products and freq
      • products will hold comma separated list of products
      • freq will hold the number of times they appear together across all orders
      • For convenience, toString() method has been overridden to output formatted string
      • To allow for sorting product combinations by their frequency, this class extends Scala Ordered class and implements compare() method
    • Create an RDD of the result set from the Hive query run in step 1 to convert the list of products of type WrappedArray (produced by Hive collect_set aggregation function) into format that is suitable for FPGrowth data mining algorithm.
    • Use the new RDD created in step 5 to create model and run FPGrowth data mining algorithm. For more details on setMinSupport, visit https://spark.apache.org/docs/1.5.0/mllib-frequent-pattern-mining.html
    • Create an empty list of type ProductCombinations - this will hold list of product combinations and the number of times the combination occurs across orders.
    • Loop through the resultset (RDD of FreqItemset) produced by FPGrowth data mining algorithm and create instances of ProductCombinations.

Top 10 Two-product Combinations

  1. Execute the paragraph named Top 10 Two-product Combinations by clicking button placed on the top right corner of the paragraph
    When you run this paragraph, Qubole executes the following actions:
    • Filter the list of ProductCombinations to exclude those instances where number of products is not equal to two.
    • Sort the resulting list by frequency in descending order using sortWith() which internally uses our implementation of compare() method in ProductCombinations.
    • Loop through the first 10 entries and print them out to the console.

Note: This is one approach to creating a model that can be used for product recommendations.