In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/logical-rythm-2k20-game-price-prediction/steam_test.csv
/kaggle/input/logical-rythm-2k20-game-price-prediction/steam_sample_submission.csv
/kaggle/input/logical-rythm-2k20-game-price-prediction/steam_train.csv

Importing important libraries

In [2]:
import datetime
import xgboost as xgb
from scipy import stats
import plotly.express as px
import pandas_profiling as pp
import plotly.graph_objects as go
import plotly.figure_factory as ff
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV

Loading the training dataset

In [3]:
df_train = pd.read_csv('/kaggle/input/logical-rythm-2k20-game-price-prediction/steam_train.csv')
df_train.describe()
Out[3]:
appid english required_age achievements positive_ratings negative_ratings average_playtime median_playtime price
count 2.504400e+04 25044.000000 25044.000000 25044.000000 2.504400e+04 25044.000000 25044.000000 25044.000000 25044.000000
mean 5.959530e+05 0.981113 0.355574 45.303865 1.004463e+03 215.739818 146.385202 139.477599 6.055217
std 2.514216e+05 0.136128 2.408379 350.998233 1.956267e+04 4447.824914 1783.038628 2114.615241 7.284552
min 1.000000e+01 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000
25% 4.014875e+05 1.000000 0.000000 0.000000 6.000000e+00 2.000000 0.000000 0.000000 1.690000
50% 5.988250e+05 1.000000 0.000000 7.000000 2.400000e+01 9.000000 0.000000 0.000000 3.990000
75% 7.987450e+05 1.000000 0.000000 23.000000 1.240000e+02 41.000000 0.000000 0.000000 7.190000
max 1.069460e+06 1.000000 18.000000 5394.000000 2.644404e+06 487076.000000 190625.000000 190625.000000 209.990000
In [4]:
df_train.columns
Out[4]:
Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'platforms', 'required_age', 'categories', 'genres', 'steamspy_tags',
       'achievements', 'positive_ratings', 'negative_ratings',
       'average_playtime', 'median_playtime', 'owners', 'price'],
      dtype='object')

Creating a Profile Report

In [5]:
#pp.ProfileReport(df_train)

Studying the profile report for various variables we can comment on many of the variables:

  1. appid: Unique field only for identification and doesn't contribute to price.

  2. name,developer and publisher: High cardinality fields that can't possibly be used for any kind of feature extraction which might be useful for price prediction.

  3. median_playtime: Has a very high correlation with average_playtime and therefore one of them can be safely eliminated.

  4. platforms, categories, genres and steamspy_tags: These columns need to be separated for the values given in these columns using ';' as separater.

Data preprocessing

Checking and removal of null values

In [6]:
for i in df_train.columns:
    print( i+" \t: " +str(df_train[i].isnull().sum()))
appid 	: 0
name 	: 0
release_date 	: 0
english 	: 0
developer 	: 0
publisher 	: 0
platforms 	: 0
required_age 	: 0
categories 	: 0
genres 	: 0
steamspy_tags 	: 0
achievements 	: 0
positive_ratings 	: 0
negative_ratings 	: 0
average_playtime 	: 0
median_playtime 	: 0
owners 	: 0
price 	: 0

Since there are no null values in the dataset so we do not need to do any kind of data cleaning. Thus we will move on to feature engineering.

Data Visualisation and removing of outliers:

Price:

Box Plot of the price over the whole dataset.

In [7]:
column = 'appid'
name = 'App-ID'
a= []
for i in df_train.index:
    a.append(name +' : '+ str(df_train[column][i]))
df_train[column+'_visual'] = a
In [8]:
fig = px.box(data_frame = df_train.reset_index(),hover_name = 'appid_visual',
             y = 'price',hover_data = ['name'],height = 800,color = 'owners',
             width = 1000,labels = {'price':'Sale Price in "$"',"owners":"Total Downloads", 
                                    'name':'Name'},
             title = 'Box plot of the sale price(Hover for details)')
fig.show()

As we can see from the box plot there are a lot of outliers so we set a threshold for the price.

Here I have taken it to be $50 (Purely arbitary choice)

Removing of outliers in Sale Price

In [9]:
removed = 0
threshold = 30
for i in df_train.index:
    if df_train['price'][i]>threshold:
        df_train = df_train.drop(i)
        removed+=1
print('Total number of data points removed till now are: '+str(removed))
Total number of data points removed till now are: 296

Since we do not want to remove too many data ponts, we will keep a track of how many data points have been removed. Here we have removed less than 1% of the most total dataset which seem like outliers.

In [10]:
fig = px.box(data_frame = df_train.reset_index(),hover_name = 'appid_visual',
             y = 'price',hover_data = ['name'],height = 800,color = 'owners',
             width = 1000,labels = {'price':'Sale Price in "$"',"owners":"Total Downloads", 
                                    'name':'Name'},
             title = 'Box plot of the sale price(Hover for details)')
fig.show()

As you can see the resulting plot looks much more well scaled over the dataset.

Categorical data fields:

One hot encoding of all purely categorical data columns

There are mainly 6 categorical columns that we really care about so we do preprocessing on those only.

  1. The four columns platforms, categories, genres and steamspy_tags need to be separated for the values given in these columns.

  2. Owners: For this column we can take the number of owners as the average of upper and lower limits of the class. Although this will not be completely accurate but will definitely give us more flexibility to work with the column.

  3. Release_date: For this column what we can do is take the difference in number of days from today which will give us a number of how many days the app has been on the play store.

In [11]:
encoding_columns = ['platforms', 'categories','genres','steamspy_tags']
unique_sets = []

for column in encoding_columns:
    unique_vals = set()
    for i in df_train[column]:
        for j in i.split(";"):
            unique_vals.add(j)
    unique_vals = list(unique_vals)
    unique_set = []
    for i in unique_vals:
        encode = []
        for j in df_train[column]:
            if i in j.split(";"):
                encode.append(1)
            else:
                encode.append(0)
        if sum(encode)>250 and sum(encode)< 24500:
            unique_set.append(column+"_"+i)
            df_train[column+"_"+i] = encode
    unique_sets.append(unique_set)
encode = []
for i in df_train['owners']:
    x = i.split('-')
    encode.append(sum([int(i) for i in x])/len(x))
df_train['Owners'] = encode
encode = []
for i in df_train['release_date']:
    x,y,z = i.split('-')
    i = datetime.date(int(x),int(y),int(z))
    diff = (datetime.date.today() - i).days
    encode.append(diff)
df_train['days_in_store'] = encode
df_train
Out[11]:
appid name release_date english developer publisher platforms required_age categories genres ... steamspy_tags_VR steamspy_tags_Racing steamspy_tags_Multiplayer steamspy_tags_Puzzle steamspy_tags_Casual steamspy_tags_Gore steamspy_tags_Visual Novel steamspy_tags_Free to Play Owners days_in_store
0 708830 The Armament Project 2017-09-28 1 DysTop DysTop windows 0 Single-player;Partial Controller Support Action;Indie ... 0 0 0 0 0 0 0 0 10000.0 1118
1 538000 Hop Step Sing! Kisekiteki Shining! (HQ Edition) 2017-06-29 1 Kodansha;Polygon Pictures Inc.;Lantis Degica windows 0 Single-player;Partial Controller Support Casual ... 1 0 0 0 1 0 0 0 10000.0 1209
2 603530 A Robot Named Fight! 2017-09-07 1 Matt Bitner Games Matt Bitner Games windows;mac;linux 0 Single-player;Local Co-op;Shared/Split Screen;... Violent;Gore;Action;Indie ... 0 0 0 0 0 0 0 0 35000.0 1139
3 564040 ZANGEKI WARP 2017-02-15 1 ASTRO PORT Nyu Media windows;linux 0 Single-player;Steam Achievements;Steam Trading... Action;Indie ... 0 0 0 0 0 0 0 0 10000.0 1343
4 951940 Almost There: The Platformer 2019-02-19 1 Bony Yousuf The Quantum Astrophysicists Guild windows;mac;linux 0 Single-player;Steam Achievements;Full controll... Action;Casual;Indie;Strategy ... 0 0 0 0 1 0 0 0 10000.0 609
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25039 838490 Theorem 2018-04-24 1 Geckoo1337 Geckoo1337 windows 0 Single-player;Steam Achievements Casual;Indie ... 0 0 0 1 1 0 0 0 10000.0 910
25040 363970 Clicker Heroes 2015-05-13 1 Playsaurus Playsaurus windows;mac 0 Single-player;Steam Achievements;Steam Trading... Adventure;Casual;Free to Play;Indie;RPG;Simula... ... 0 0 0 0 1 0 0 1 7500000.0 1987
25041 46540 Trapped Dead 2011-07-15 1 Headup Games / Crenetic Headup Games windows 0 Single-player;Multi-player;Co-op;Steam Trading... Action;Strategy ... 0 0 0 0 0 0 0 0 150000.0 3385
25042 671650 Mad Age & This Guy 2017-10-24 1 Atomic Wolf Atomic Wolf windows 0 Single-player;Steam Achievements;Partial Contr... Action;Casual;Indie ... 0 0 0 0 1 0 0 0 10000.0 1092
25043 900270 Lonk's Adventure 2018-10-22 1 Pixelatto Pixelatto windows 0 Single-player;Steam Achievements;Full controll... Adventure;Indie;Early Access ... 0 0 0 0 0 0 0 0 10000.0 729

24748 rows × 84 columns

We have removed any fields where more than 99% values are the same as these are not useful for the model

In [12]:
df_train.columns
useful_classes = set() #A list to store classes that show significant relation with price.
correlation_threshold = 0 #Any class with a correlation with more than 0.05 will be considered significant.

Correlation matrix for main varibles in the dataset

In [13]:
height = 550
title = '<b>Correlation Matrix for the dataset:</b>'
colors = 'Viridis'
#-------------------------------------------------------------------------#
df_train = df_train
classes = ['required_age', 'achievements', 'days_in_store', 'positive_ratings',
           'negative_ratings','average_playtime','Owners','price']
#-------------------------------------------------------------------------#
correlation = df_train[classes].corr()
correlation_mat = df_train[classes].corr().to_numpy()
correlation_mat = (correlation_mat//0.0001)/10000
correlation_mat_norm = (correlation_mat//0.01)/100

fig = ff.create_annotated_heatmap(correlation_mat, x=classes, y=classes,
                                  annotation_text=correlation_mat_norm,
                                  colorscale=colors,text = correlation_mat,
                                  hovertemplate='Column: %{x}<br>'+
                                                'Row: %{y}<br>'+
                                                'Correlation: %{text}<extra></extra>')


fig.update_layout(title_text= title,width = (height*(1.618))//1,height = height,
                  xaxis = {'title':'Columns'},
                  yaxis = {'title':'Rows','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

Now we can see that only two columns show a significant correlation with price which are required_age and days_in_store. The reason that these show correlation are:

  1. required_age: The reson behng this correlation is that the people who are older and earn are more likely to spend on the app and thus the game companies keep the price of such apps more.

  2. days_in_store: At first it might seem counter-intuitive for this column to show a positive correlation considering inflation but it seems obvious considering the rise of free-to-play games which according to many studies make more money through in-app-purchases compared to pay-to-play games.

The columns positive_ratings and negative_ratings do not seem to have any significant correlation woth price but have a huge correlation with number of owners so it seems reasonable to normalise these columns by dividing them with the owners.

We can also introduce a new variable likeliness defined as the difference of the positive and negative ratings and normalised likeliness when likeliness is divided by the number of owners

In [14]:
df_train['positive_ratings_norm'] = df_train['positive_ratings']/df_train['Owners']
df_train['negative_ratings_norm'] = df_train['negative_ratings']/df_train['Owners']

df_train['likeliness'] = (df_train['positive_ratings']-df_train['negative_ratings'])/df_train['Owners']
df_train['likeliness_norm'] = df_train['likeliness']/df_train['Owners']
In [15]:
height = 700
title = '<b>Correlation Matrix for the dataset:</b>'
colors = 'Viridis'
#-------------------------------------------------------------------------#
df_train = df_train
classes = ['required_age', 'achievements', 'days_in_store', 'positive_ratings', 
           'negative_ratings','positive_ratings_norm','negative_ratings_norm',
           'average_playtime','Owners','likeliness','likeliness_norm','price']
#-------------------------------------------------------------------------#
correlation = df_train[classes].corr()
correlation_mat = df_train[classes].corr().to_numpy()
correlation_mat = (correlation_mat//0.0001)/10000
correlation_mat_norm = (correlation_mat//0.01)/100

fig = ff.create_annotated_heatmap(correlation_mat, x=classes, y=classes,
                                  annotation_text=correlation_mat_norm,
                                  colorscale=colors,text = correlation_mat,
                                  hovertemplate='Column: %{x}<br>'+
                                                'Row: %{y}<br>'+
                                                'Correlation: %{text}<extra></extra>')

fig.update_layout(title_text= title,width = (height*(1.618))//1,height = height,
                  xaxis = {'title':'Columns'},
                  yaxis = {'title':'Rows','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

for i in classes:
    if abs(correlation[i]['price'])>= correlation_threshold:
        useful_classes.add(i)

We see 4 more columns that show a significant correlation with price which are positive_ratings_norm, negative_ratings_norm, difference, difference_norm out of which the correlation for normalised positive ratings was expected. For normalised negative ratings it is positive because games that are popular have more of both positive and negative ratigs.

The positive correlation for normalised likeliness I think has more to do with the derived nature of the variable than its significance as a whole but the positive correlation for likeliness is something new as neither positive_ratings nor negative_ratings have any significant correlation with price directly.

In [16]:
values = sorted(list(df_train['Owners'].unique()))
encoding = []
for i in df_train['Owners']:
    encoding.append(values.index(i)+1)
df_train['owners_color'] = encoding    
In [17]:
fig = go.Figure()
classes = [x for x in classes if x != 'price']
for step in range (len(classes)):
    fig.add_trace(
        go.Scattergl(
            visible = False,mode = 'markers',
            marker = {'color' : df_train['owners_color']},
            text = df_train[['appid_visual']],x = df_train[classes[step]],y = df_train['price'],
            hovertemplate = '<b>%{text}</b><br>Sale Price in "$": %{y}<br>'+
                            " ".join([x.capitalize() for x in classes[step].split("_")]) + 
                            ': %{x}<extra></extra>' 
        ))
fig.data[0].visible = True

steps = []
for i in range(len(fig.data)):
    step = dict(
        method = "update",
        label  = " ".join([x.capitalize() for x in classes[i].split("_")]),
        args=[{"visible": [False] * len(fig.data)},
              {"title": "Slider switched to step: " + str(classes[i])}],
    )
    step["args"][0]["visible"][i] = True
    steps.append(step)
    
sliders = [dict(
    active=0,
    currentvalue={"prefix": "Scatter plot for Price vs "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders
)

fig.show()

Correlation matrix for platform variables

In [18]:
height = 500
title = '<b>Correlation Matrix for the dataset:</b>'
colors = 'Viridis'
#-------------------------------------------------------------------------#
df_train = df_train
classes = unique_sets[0]+['price']
#-------------------------------------------------------------------------#
correlation = df_train[classes].corr()
correlation_mat = df_train[classes].corr().to_numpy()
correlation_mat = (correlation_mat//0.0001)/10000
correlation_mat_norm = (correlation_mat//0.01)/100

fig = ff.create_annotated_heatmap(correlation_mat, x=classes, y=classes,
                                  annotation_text=correlation_mat_norm,
                                  colorscale=colors,text = correlation_mat,
                                  hovertemplate='Column: %{x}<br>'+
                                                'Row: %{y}<br>'+
                                                'Correlation: %{text}<extra></extra>')


fig.update_layout(title_text= title,width = (height*(1.618))//1,height = height,
                  xaxis = {'title':'Columns'},
                  yaxis = {'title':'Rows','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

for i in classes:
    if abs(correlation[i]['price'])>= correlation_threshold:
        useful_classes.add(i)

The variable price does not show any significant correlation with any of the platforms.

Correlation matrix for Categories

In [19]:
height = 700
title = '<b></b>'
colors = 'Viridis'
#-------------------------------------------------------------------------#
df_train = df_train
classes = unique_sets[1]+['price']
#-------------------------------------------------------------------------#
correlation = df_train[classes].corr()
correlation_mat = df_train[classes].corr().to_numpy()
correlation_mat = (correlation_mat//0.0001)/10000
correlation_mat_norm = (correlation_mat//0.01)/100

fig = ff.create_annotated_heatmap(correlation_mat, x=classes, y=classes,
                                  annotation_text=correlation_mat_norm,
                                  colorscale=colors,text = correlation_mat,
                                  hovertemplate='Column: %{x}<br>'+
                                                'Row: %{y}<br>'+
                                                'Correlation: %{text}<extra></extra>')


fig.update_layout(title_text= title,width = (height*(1.618))//1,height = height,
                  xaxis = {'title':'Columns'},
                  yaxis = {'title':'Rows','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

for i in classes:
    if abs(correlation[i]['price'])>= correlation_threshold:
        useful_classes.add(i)

In this most columns do not have any significant relationship but any columns with a significant relationship have been added to useful_classes

In [20]:
height = 700
title = '<b>Correlation Matrix for the dataset:</b>'
colors = 'Viridis'
#-------------------------------------------------------------------------#
df_train = df_train
classes = unique_sets[2]+['price']
#-------------------------------------------------------------------------#
correlation = df_train[classes].corr()
correlation_mat = df_train[classes].corr().to_numpy()
correlation_mat = (correlation_mat//0.0001)/10000
correlation_mat_norm = (correlation_mat//0.01)/100

fig = ff.create_annotated_heatmap(correlation_mat, x=classes, y=classes,
                                  annotation_text=correlation_mat_norm,
                                  colorscale=colors,text = correlation_mat,
                                  hovertemplate='Column: %{x}<br>'+
                                                'Row: %{y}<br>'+
                                                'Correlation: %{text}<extra></extra>')


fig.update_layout(title_text= title,width = (height*(1.618))//1,height = height,
                  xaxis = {'title':'Columns'},
                  yaxis = {'title':'Rows','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

for i in classes:
    if abs(correlation[i]['price'])>= correlation_threshold:
        useful_classes.add(i)

In this the columns that are show the most significant correlation are:

  1. genres_Free to Play: This is the most obvious correlation that these games are most likely free and therefore show a negative correlation.

  2. generes_Casual: This is also quite self explanatory. Casual games generally more for a relaxed gameplay rather than competitive gaming and therefore would not be priced higher.

  3. generes_Indie: These are those games that are developed by individual developers or small groups of developers and lack high sophistication and thus are not priced high.

In [21]:
height = 900
title = ''
colors = 'Viridis'
#-------------------------------------------------------------------------#
df_train = df_train
classes = unique_sets[3]+['price']
#-------------------------------------------------------------------------#
correlation = df_train[classes].corr()
correlation_mat = df_train[classes].corr().to_numpy()
correlation_mat = (correlation_mat//0.0001)/10000
correlation_mat_norm = (correlation_mat//0.01)/100

fig = ff.create_annotated_heatmap(correlation_mat, x=classes, y=classes,
                                  annotation_text=correlation_mat_norm,
                                  colorscale=colors,text = correlation_mat,
                                  hovertemplate='Column: %{x}<br>'+
                                                'Row: %{y}<br>'+
                                                'Correlation: %{text}<extra></extra>')


fig.update_layout(title_text= title,width = (height*(1.618))//1,height = height,
                  xaxis = {'title':'Columns'},
                  yaxis = {'title':'Rows','autorange':'reversed'})
fig.update_traces(showscale = True)
fig.show()

for i in classes:
    if abs(correlation[i]['price'])>= correlation_threshold:
        useful_classes.add(i)
In [22]:
useful_classes
Out[22]:
{'Owners',
 'achievements',
 'average_playtime',
 'categories_Captions available',
 'categories_Co-op',
 'categories_Cross-Platform Multiplayer',
 'categories_Full controller support',
 'categories_In-App Purchases',
 'categories_Includes level editor',
 'categories_Local Co-op',
 'categories_Local Multi-Player',
 'categories_MMO',
 'categories_Multi-player',
 'categories_Online Co-op',
 'categories_Online Multi-Player',
 'categories_Partial Controller Support',
 'categories_Shared/Split Screen',
 'categories_Single-player',
 'categories_Stats',
 'categories_Steam Achievements',
 'categories_Steam Cloud',
 'categories_Steam Leaderboards',
 'categories_Steam Trading Cards',
 'categories_Steam Workshop',
 'days_in_store',
 'genres_Action',
 'genres_Adventure',
 'genres_Casual',
 'genres_Early Access',
 'genres_Free to Play',
 'genres_Gore',
 'genres_Indie',
 'genres_Massively Multiplayer',
 'genres_RPG',
 'genres_Racing',
 'genres_Simulation',
 'genres_Sports',
 'genres_Strategy',
 'genres_Violent',
 'likeliness',
 'likeliness_norm',
 'negative_ratings',
 'negative_ratings_norm',
 'platforms_linux',
 'platforms_mac',
 'positive_ratings',
 'positive_ratings_norm',
 'price',
 'required_age',
 'steamspy_tags_Action',
 'steamspy_tags_Adventure',
 'steamspy_tags_Anime',
 'steamspy_tags_Casual',
 'steamspy_tags_Early Access',
 'steamspy_tags_FPS',
 'steamspy_tags_Free to Play',
 'steamspy_tags_Gore',
 'steamspy_tags_Hidden Object',
 'steamspy_tags_Horror',
 'steamspy_tags_Indie',
 'steamspy_tags_Massively Multiplayer',
 'steamspy_tags_Multiplayer',
 'steamspy_tags_Nudity',
 'steamspy_tags_Platformer',
 'steamspy_tags_Point & Click',
 'steamspy_tags_Puzzle',
 'steamspy_tags_RPG',
 'steamspy_tags_Racing',
 'steamspy_tags_Sexual Content',
 'steamspy_tags_Simulation',
 'steamspy_tags_Sports',
 'steamspy_tags_Strategy',
 'steamspy_tags_VR',
 'steamspy_tags_Violent',
 'steamspy_tags_Visual Novel'}
In [23]:
print ("The total number of useful classes are:",len(useful_classes))
The total number of useful classes are: 75

Preparing training and testing sets

1. Training set

In [24]:
df_train_x = df_train[useful_classes].drop(['price'],axis = 1)
df_train_x.describe()
Out[24]:
categories_Steam Cloud categories_Stats negative_ratings_norm likeliness_norm steamspy_tags_Nudity days_in_store genres_Adventure steamspy_tags_Strategy steamspy_tags_Adventure categories_Online Multi-Player ... genres_RPG genres_Indie steamspy_tags_FPS steamspy_tags_Visual Novel steamspy_tags_Puzzle categories_Online Co-op positive_ratings categories_MMO categories_Local Multi-Player genres_Massively Multiplayer
count 24748.000000 24748.000000 24748.000000 2.474800e+04 24748.000000 24748.000000 24748.000000 24748.000000 24748.000000 24748.000000 ... 24748.000000 24748.000000 24748.000000 24748.000000 24748.000000 24748.000000 2.474800e+04 24748.000000 24748.000000 24748.000000
mean 0.263577 0.069985 0.001113 1.582094e-07 0.020850 1392.879505 0.372515 0.153952 0.289195 0.088815 ... 0.159811 0.724745 0.014870 0.020042 0.041943 0.037862 9.547523e+02 0.015032 0.059156 0.026507
std 0.440581 0.255128 0.001960 4.259797e-07 0.142885 783.330462 0.483484 0.360910 0.453398 0.284483 ... 0.366438 0.446651 0.121035 0.140147 0.200462 0.190865 1.960579e+04 0.121681 0.235922 0.160641
min 0.000000 0.000000 0.000000 -2.490000e-06 0.000000 538.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000200 4.240816e-09 0.000000 868.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000e+00 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000500 3.910249e-08 0.000000 1174.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 2.400000e+01 0.000000 0.000000 0.000000
75% 1.000000 0.000000 0.001300 1.400000e-07 0.000000 1664.000000 1.000000 0.000000 1.000000 0.000000 ... 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.182500e+02 0.000000 0.000000 0.000000
max 1.000000 1.000000 0.045300 9.170000e-06 1.000000 8513.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.644404e+06 1.000000 1.000000 1.000000

8 rows × 74 columns

2. Testing set

In [25]:
df_train_y = df_train[['price']]
df_train_y.describe()
Out[25]:
price
count 24748.000000
mean 5.589131
std 5.400076
min 0.000000
25% 1.690000
50% 3.990000
75% 7.190000
max 29.990000

Machine Learning Model

Splitting into train and test sets

In [26]:
x_train,x_test,y_train,y_test = train_test_split(df_train_x, df_train_y,test_size=0.10,
                                                 random_state=42)

Defining a custom scorer function

In [27]:
def custom_scorer(y_true,y_pred):
    return mean_squared_error(y_true,y_pred,squared = False)
scorer = make_scorer(custom_scorer,greater_is_better = False)

Hyper-parameter tuning for XGBoost

In [28]:
xg = xgb.XGBRegressor()
#parameters = {"max_depth": [5,6,7],
#              "eta": [0.03,0.05],
#              "alpha":[0],
#              'n_estimators': [500]}

#xg = GridSearchCV(xg,parameters,cv=5,verbose = 2 , scoring = scorer, n_jobs = -1)
xg.fit(x_train, y_train)
predictions_xg = xg.predict(x_test)
predictions_xg = [max(0,x) for x in predictions_xg]
In [29]:
#print("The best parameters for the model are:",xg.best_params_)
In [30]:
print("The MSE obtained is:",mean_squared_error(y_test,predictions_xg,squared = False))
The MSE obtained is: 3.9215450127027007

Predicting over the whole dataset

In [31]:
predictions = xg.predict(df_train_x)
predictions = [max(0,x) for x in predictions]
df_train['price_predicted'] = predictions

Plotting the residual plot for the model

In [32]:
df_train['Residuals'] = (df_train['price'] - df_train['price_predicted'])//0.01/100
df_train['mod_Residuals'] = abs(df_train['Residuals'])
In [33]:
dic_residuals = {'price_predicted':'Value predicted by the model',
                 'Residuals':'Residual value','mod_Residuals':'Divergence'}
fig = px.scatter(data_frame = df_train,x = 'price_predicted',y = 'Residuals',
                 hover_name ='appid_visual',hover_data = ['price'],opacity = 1,
                 trendline = 'ols',trendline_color_override = 'darkred',
                 color= 'mod_Residuals',marginal_y ='box',labels = dic_residuals,
                 marginal_x ='violin',
                 title = 'Residual value plot when using XGBoost Regression Model (Hover for more details.)')
fig.show()

Predicting over the testset

Reading the test file

In [34]:
df_test = pd.read_csv('/kaggle/input/logical-rythm-2k20-game-price-prediction/steam_test.csv')
df_test.describe()
Out[34]:
appid english required_age achievements positive_ratings negative_ratings average_playtime median_playtime
count 2.031000e+03 2031.000000 2031.000000 2031.000000 2031.000000 2031.000000 2031.000000 2031.000000
mean 5.992926e+05 0.981290 0.346627 44.570655 952.408666 152.915805 191.973412 227.173806
std 2.443364e+05 0.135532 2.377644 372.762350 9369.513578 904.502089 2301.593278 4327.346725
min 7.000000e+01 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 3.998350e+05 1.000000 0.000000 0.000000 7.000000 2.000000 0.000000 0.000000
50% 6.038100e+05 1.000000 0.000000 8.000000 25.000000 10.000000 0.000000 0.000000
75% 7.990200e+05 1.000000 0.000000 24.000000 153.000000 46.000000 1.000000 1.000000
max 1.062670e+06 1.000000 18.000000 9821.000000 363721.000000 25607.000000 95242.000000 190445.000000

Preprocessing of the test file

In [35]:
for column,unique_vals in zip (encoding_columns, unique_sets):
    for i in unique_vals:
        encode = []
        for j in df_test[column]:
            if i in [column+"_"+ x for x in j.split(";")]:
                encode.append(1)
            else:
                encode.append(0)
        df_test[i] = encode

encode = []
for i in df_test['owners']:
    x = i.split('-')
    encode.append(sum([int(i) for i in x])/len(x))
df_test['Owners'] = encode
encode = []
for i in df_test['release_date']:
    x,y,z = i.split('-')
    i = datetime.date(int(x),int(y),int(z))
    diff = (datetime.date.today() - i).days
    encode.append(diff)
df_test['days_in_store'] = encode
df_test
Out[35]:
appid name release_date english developer publisher platforms required_age categories genres ... steamspy_tags_VR steamspy_tags_Racing steamspy_tags_Multiplayer steamspy_tags_Puzzle steamspy_tags_Casual steamspy_tags_Gore steamspy_tags_Visual Novel steamspy_tags_Free to Play Owners days_in_store
0 514520 Sparky's Hunt 2016-08-18 1 Luke Cripps Fellowplayer windows 0 Single-player Indie ... 0 0 0 0 0 0 0 0 10000.0 1524
1 1012710 Endzeit 2019-04-03 1 RockyDev RockyDev windows 0 Single-player;Multi-player;Co-op;Full controll... Action;Early Access ... 0 0 0 0 0 0 0 0 10000.0 566
2 279260 Richard & Alice 2014-06-05 1 Owl Cave Owl Cave windows 0 Single-player;Steam Achievements;Steam Trading... Adventure;Indie ... 0 0 0 0 0 0 0 0 75000.0 2329
3 220090 The Journey Down: Chapter One 2013-01-09 1 SkyGoblin SkyGoblin windows;mac;linux 0 Single-player;Steam Trading Cards Adventure;Indie ... 0 0 0 0 0 0 0 0 350000.0 2841
4 788870 In The Long Run The Game 2018-07-02 1 Zerstoren Games Zerstoren Games windows 0 Single-player;Partial Controller Support Action;Adventure;Indie;Simulation;Strategy;Ear... ... 0 0 0 0 0 0 0 0 10000.0 841
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2026 9940 Blade Kitten 2014-05-22 1 Krome Studios Krome Studios windows 0 Single-player;Steam Achievements;Steam Trading... Action;Adventure ... 0 0 0 0 0 0 0 0 75000.0 2343
2027 983880 Arctic Fleet 2019-03-29 1 Coffee-Powered Games Coffee-Powered Games windows;mac 0 Single-player;Steam Achievements Indie;Strategy;Early Access ... 0 0 0 0 0 0 0 0 10000.0 571
2028 687760 Occultus - Mediterranean Cabal 2017-11-01 1 Sylphe Labs Microids windows;mac 0 Single-player Adventure ... 0 0 0 0 0 0 0 0 10000.0 1084
2029 391290 DeathCrank 2016-12-14 1 Kenny Roy Kenny Roy windows 0 Multi-player;Online Multi-Player;Local Multi-P... Action;Indie;Racing;Early Access ... 0 1 0 0 0 0 0 0 10000.0 1406
2030 944710 TurnTack 2018-12-10 1 Geuluteogi Geuluteogi windows 0 Single-player;Steam Cloud Adventure;Indie;Early Access ... 0 0 0 0 0 0 0 0 10000.0 680

2031 rows × 82 columns

In [36]:
df_test['positive_ratings_norm'] = df_test['positive_ratings']/df_test['Owners']
df_test['negative_ratings_norm'] = df_test['negative_ratings']/df_test['Owners']
df_test['average_playtime_norm'] = df_test['average_playtime']/df_test['days_in_store']

df_test['likeliness'] = (df_test['positive_ratings']-df_test['negative_ratings'])/df_test['Owners']
df_test['likeliness_norm'] = df_test['likeliness']/df_test['Owners']

Predicting over the test set

In [37]:
predictions = xg.predict(df_test[[x for x in useful_classes if x != 'price']])
predictions = [max(0,x) for x in predictions]
df_test['price'] = predictions
df_test
Out[37]:
appid name release_date english developer publisher platforms required_age categories genres ... steamspy_tags_Visual Novel steamspy_tags_Free to Play Owners days_in_store positive_ratings_norm negative_ratings_norm average_playtime_norm likeliness likeliness_norm price
0 514520 Sparky's Hunt 2016-08-18 1 Luke Cripps Fellowplayer windows 0 Single-player Indie ... 0 0 10000.0 1524 0.000600 0.000300 0.000000 0.000300 3.000000e-08 3.376132
1 1012710 Endzeit 2019-04-03 1 RockyDev RockyDev windows 0 Single-player;Multi-player;Co-op;Full controll... Action;Early Access ... 0 0 10000.0 566 0.000000 0.000100 0.155477 -0.000100 -1.000000e-08 7.941587
2 279260 Richard & Alice 2014-06-05 1 Owl Cave Owl Cave windows 0 Single-player;Steam Achievements;Steam Trading... Adventure;Indie ... 0 0 75000.0 2329 0.003520 0.001320 0.142550 0.002200 2.933333e-08 8.025722
3 220090 The Journey Down: Chapter One 2013-01-09 1 SkyGoblin SkyGoblin windows;mac;linux 0 Single-player;Steam Trading Cards Adventure;Indie ... 0 0 350000.0 2841 0.002577 0.000369 0.106653 0.002209 6.310204e-09 6.932395
4 788870 In The Long Run The Game 2018-07-02 1 Zerstoren Games Zerstoren Games windows 0 Single-player;Partial Controller Support Action;Adventure;Indie;Simulation;Strategy;Ear... ... 0 0 10000.0 841 0.000600 0.001100 0.000000 -0.000500 -5.000000e-08 5.819439
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2026 9940 Blade Kitten 2014-05-22 1 Krome Studios Krome Studios windows 0 Single-player;Steam Achievements;Steam Trading... Action;Adventure ... 0 0 75000.0 2343 0.009947 0.001987 0.094750 0.007960 1.061333e-07 6.466403
2027 983880 Arctic Fleet 2019-03-29 1 Coffee-Powered Games Coffee-Powered Games windows;mac 0 Single-player;Steam Achievements Indie;Strategy;Early Access ... 0 0 10000.0 571 0.000700 0.000800 0.000000 -0.000100 -1.000000e-08 6.688753
2028 687760 Occultus - Mediterranean Cabal 2017-11-01 1 Sylphe Labs Microids windows;mac 0 Single-player Adventure ... 0 0 10000.0 1084 0.000400 0.000600 0.000000 -0.000200 -2.000000e-08 5.005775
2029 391290 DeathCrank 2016-12-14 1 Kenny Roy Kenny Roy windows 0 Multi-player;Online Multi-Player;Local Multi-P... Action;Indie;Racing;Early Access ... 0 0 10000.0 1406 0.000500 0.000200 0.000000 0.000300 3.000000e-08 4.987775
2030 944710 TurnTack 2018-12-10 1 Geuluteogi Geuluteogi windows 0 Single-player;Steam Cloud Adventure;Indie;Early Access ... 0 0 10000.0 680 0.000600 0.000100 0.000000 0.000500 5.000000e-08 5.866871

2031 rows × 88 columns

Exporting output to CSV

In [38]:
df_test[['appid','price']].to_csv('submission.csv',index=False)
In [ ]: