As a first step to this analysis, we identify possible features that may change the cost of a tube assembly quoted by a supplier. We also ask questions about the dataset where the answers to these questions will help us to choose the right machine learning algorithms. Since we are predicting the cost value, a continuous variable, we will use a regression algorithm family.
To archieve our goal of predicting the cost with a good accuracy, we need to answer the following questions.
In this section, we will answer the question: What features are used to determine the cost and what features to exclude from the analysis?
As a supplier, we have to think on which tube features the cost will be based. We know that a tube assembly is made with one or more components. Some numerical tube properties may be helpful to check.
In this section, we will explain why we chose to keep and exclude features and how we will clean the dataset.
From the dataset, we note that there are a total of 2048 components. These components are spread amoung the comp_[type].csv
files uniquely. This means that we can create a single table Component
by merging those files together. To avoid to many columns, we will remove some features that we do not want in this analysis.
The training and test sets are merged together where we set the cost to 0 for the test set.
The file bill_of_materials.csv
gives us the list of components with their respective quantity used to assemble a tube. Thus, to calculate the total weight for each tube, we use the formula \[W_T = \sum_{i = 0}^n W_i Q_i\] where \(W_T\) is the total weight of the tube \(T\), \(W = (W_1,\ldots,W_n)\) is the vector of component weights, \(Q = (Q_1,\ldots,Q_n)\) the vector of component quantities and \(n \leq 8\) the number of possible components used to assemble a tube \(T\).
Let the total volume estimation of a tube assembly be denoted by \(V_T\). The volume is function of the length, the wall thickness and the diameter of the tube and its formula is \[V_T = \pi L t(d - t)\], where \(t\) is the wall thickness, \(d\) the outside diameter and \(L\) the developed length of the tube.
Every ID used (e.g. tube_assembly_id, material, supplier, etc.) as a string in CSV files are converted to a positive integer without the leading zeros. The quote date is converted to a positive integer in the format YYYYMMDD
. We keep only the year and month (YYYYMM) of the date since the influance of the days in a month on the cost should be negligeable.
To test and find data efficiently, we create a database Caterpillar
with tables, views and indexes. The script to query this database is given by the file DatabaseManipulation.R
. The script to insert in batch the data from the CSV files to the database is given by the file DatabaseInsertions.R
.
We include librairies for graphs and tables to display in this document. We also connect to the Caterpillar database for the next queries to execute.
We prepare the train and test datasets needed for the analysis.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 512817 27.4 940480 50.3 652250 34.9
## Vcells 1690712 12.9 2692307 20.6 2630291 20.1
In this section, we will answer the question: Is there a unique mathematical model describing the cost in function of the quantity for each supplier? The first objective is to check the existence of a mathematical model representing the cost in function of the quantity. The second objective is to show if the model is applied by a unique supplier. The last objective is to show if each supplier has its own model. If the unicity does not hold, then we have to check if a model is applied by more than one supplier or if a supplier can apply more than one model depending of other features. We will then answer the question: If a mathematical model exists, is it a linear or non-linear model?
We denote \(C_{T}(Q)\) our cost heuristic function of a tube assembly \(T\) given by a supplier with \(\beta\) our learning parameters and \(Q\) the vector of quantities.
We start with few tube assemblies which are quoted by the supplier S-0066
.
From the graphs, we see that the curves estimating the red points are clearly hyperbolas of equation \[C_{T}(Q) = \frac{\beta_0 - \beta_1}{Q} + \beta_1\] where \(Q \geq 1\) is the quantity for a tube assembly ID \(T\), \(\beta_1\) is the cost at the last level of purchase based on quantity and supplier (most of the time \(Q = 250\)), and \(\beta_0\) is the cost at the first level of purchase based on quantity and supplier (most of the time \(Q = 1\)). This equation indicates that if Caterpillar buy more tubes of the same ID, cheaper will be the cost per tube. This proves the existence of a mathematical model representing the cost in function of the quantity.
If we take a look at the right most graph, we see that our curve doesn’t seem to fit the points. However, the maximum quantity is 7 (not 250) for this tube which make the model less accurate assuming the same model is used. This assumption makes sense since \[\lim_{Q \to \infty} C_{T}(Q) = \beta_1\] which means that we need to find the right \(\beta_1\) to match with any quantity. We also have to find the cost of one tube which is \(\beta_0\).
For example, if we take the tube TA-19365
, we have \(C_{T}(1) = \beta_0 = 298.7820145446\). We know that \(C_{T}(2) = \frac{\beta_0 + \beta_1}{2} = 156.1959237271 \Leftrightarrow \beta_1 = 13.60983291\). Therefore, the model for the tube TA-19365
is \(C_{T}(Q) = \frac{285.172181635}{Q} + 13.60983291\). With \(Q = 7\), we obtain \(C_{T}(7) = 54.348716001\) which has a square error of \(0.000001422\) from the original cost. With our estimated, i.e. \(C_{T}(Q) = (244.434490855/Q) + 54.3475236892\), we have \(C_{T}(7) = 89.266736668\) which has a square error of \(1219.351435073\). Thus, if \(Q\) is small (say \(Q < 25\)), the model may underfit.
We verify with few tube assemblies, which are quoted by the supplier S-0054
, if the same model used for the supplier S-0066
applies.
The model used by the supplier S-0054
seems to be the same as the one used by the supplier S-0066
, but if we look carefully at the curves, we see that greater is the quantity, more accurate is the estimate. This means that the model follows the same behaviour as the model used by the supplier S-0066
.
This doesn’t seem to be the case for the tube TA-00384
from the supplier S-0064
. This supplier provides 6 levels of purchase where the highest quantity is \(Q = 45\). We use the same model as before but this time, the model doesn’t fit the points.
Since the first quantity level is \(Q_0 = 20\), we need to translate the model by subtracting \(x\) by \(Q_0 - 1 = 19\). This gives the following model \[C_{T}(Q) = \frac{\beta_0 - \beta_1}{Q - Q_0 - 1} + \beta_1\] for all \(Q \geq 1\). However, the model still underfits the data because the cost decreases much slower than the model used for our previous tests. Therefore, we can assume that a model with specific parameters is used to estimate the cost by one or many suppliers but not all. However, the general model is used by suppliers and may need to adjust the parameters to fit the data.
We present the machine learning algorithms we will try after the analysis given by the four previous sections. We have seen that many decision trees can be built. Also, per section 4, many models which are simplifications of the general model can be used together to predict the cost. This leaves us two possible ensemble algorithms: Random Forest for regression or Extreme Boosting Trees for Regression (XGBoost).
We identify conditional paths which will tell us if decision trees will be useful or not. In the previous section, we have seen that the model can underfit if there are not enough quantity purchase levels and if the quantity is small. Otherwise, we can use the model to estimate the cost given a quantity and a tube assembly. Here are few points that identify some conditions.
Only with those conditions, we can build many decision trees to help us estimating the cost. Given a feature \(x\), we have to establish a probability \(\mathbb{P}_{k,l}(x)\) for each edge \(l\) of each level \(k\) of the trees.
For example, we can set at the second level 3 edges: \(\mathbb{P}_{2,1}(x) = 0.65\) if \(Q_0 = 1\), \(\mathbb{P}_{2,2}(x) = 0.25\) if \(1 < Q_0 \leq 20\) and \(\mathbb{P}_{2,3}(x) = 0.1\) if \(Q_0 > 20\). Another example would be to set at the fourth level 2 edges: \(\mathbb{P}_{4,1}(x) = 0.7\) if the number of bends is less than 4 and \(\mathbb{P}_{4,2}(x) = 0.3\) if the number of bends is greater or equal to 4.
Suppose we want to create a tree for each supplier. This gives us a forest of 68 trees. The next level can be if the tubes have been bended (at least one bend) or not. The next level can be the material used to make a tube where each of them has a certain probability of usage. We can go deeper, but this gives us a good intuition to show that a random forest algorithm is a good choice to predict the cost.
## Random Forest
##
## 6045 samples
## 21 predictors
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 4837, 4836, 4837, 4836, 4834
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 18.42026 0.5831277 4.173131 0.07882579
## 11 18.32394 0.5946248 3.424209 0.05150506
## 21 18.74144 0.5999932 3.529462 0.05393226
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 11.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 11
##
## Mean of squared residuals: 301.6091
## % Var explained: 60.93
##
## Call:
## randomForest(formula = log(cost + 1) ~ ., data = train, nTree = 20)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 7
##
## Mean of squared residuals: 0.04356533
## % Var explained: 93.57
## user system elapsed
## 728.545 0.512 728.786
## IncNodePurity
## fkTubeAssembly 288.7213
## supplierID 417.8784
## quoteDate 337.0331
## anualUsage 1276.4126
## minOrderQuantity 369.3894
## quantity 8663.2446
## totalWeight 1151.4635
## minWeight 575.7186
## maxWeight 1470.2637
## numberOfComponents 190.9143
## diameter 1637.6252
## wallThickness 220.9365
## length 498.0368
## bendRadius 688.3858
## materialID 176.5664
## specs 105.4682
## numberOfBends 174.2690
## maxQty 638.1472
## avgQty 773.4902
## cntQty 314.2224
## id 316.5171
Using only all features of the train set to predict the cost with the random forest algorithm using regression gives a score of 80% of variances explained. Here is a list of engineer tuning actions done to improve the prediction.
YYYYMM
improved the score with 82.4% of variances explained.Some features didn’t improve the prediction.
Before the learning we will use the cross validation to evaluate our error rate (RMSE) to get a better prediction.
## train.rmse.mean train.rmse.std test.rmse.mean test.rmse.std names
## 1: 1.853142 0.003802 1.853220 0.014968 1
## 2: 1.817760 0.003633 1.817940 0.014811 2
## 3: 1.783087 0.003550 1.783397 0.014657 3
## 4: 1.749700 0.003524 1.750102 0.014692 4
## 5: 1.717397 0.004220 1.717892 0.013880 5
## ---
## 3996: 0.042117 0.000773 0.169005 0.004636 3996
## 3997: 0.042104 0.000770 0.169002 0.004638 3997
## 3998: 0.042094 0.000769 0.169001 0.004637 3998
## 3999: 0.042084 0.000766 0.169000 0.004638 3999
## 4000: 0.042070 0.000771 0.168997 0.004637 4000
## user system elapsed
## 1321.923 1.040 339.383
We now train the data and predict with the test matrix. Then we will see the important features in ascending order of importance.
## Prediction with 4000 trees...
##
## [1] "booster[0]"
## [2] "0:[f5<9.5] yes=1,no=2,missing=1,gain=6314.54,cover=25728"
## [3] "1:[f8<0.5535] yes=3,no=4,missing=3,gain=1202.66,cover=12677"
## [4] "3:[f3<69.5] yes=7,no=8,missing=7,gain=819.873,cover=12108"
## [5] "7:[f5<4.5] yes=15,no=16,missing=15,gain=859.18,cover=10458"
## [6] "15:[f10<26.3] yes=31,no=32,missing=31,gain=305.792,cover=7083"
## [7] "31:[f17<1.5] yes=61,no=62,missing=61,gain=473.819,cover=6527"
## [8] "61:[f3<1.5] yes=101,no=102,missing=101,gain=61.2508,cover=1132"
## [9] "101:leaf=0.0261963,cover=228"
## [10] "102:leaf=0.0380719,cover=904"
## [11] "62:[f5<1.5] yes=103,no=104,missing=103,gain=403.832,cover=5395"
## [12] "103:leaf=0.0554786,cover=2718"
## [13] "104:leaf=0.0444522,cover=2677"
## [14] "32:[f3<15.5] yes=63,no=64,missing=64,gain=66.4415,cover=556"
## [15] "63:[f9<2.5] yes=105,no=106,missing=106,gain=39.4926,cover=365"
## user system elapsed
## 346.489 0.252 91.316
Which of the algorithms gives the best accuracy and why?
The objective of this section is to visualize which features have the biggest importance on the cost of a given tube assembly. Also, we want to see how accurate is the model to predict the cost. We need to show that applying our model to the train set, our prediction is close to the real costs. By close, we mean with an accuracy greater than 95%.
The following graph shows the features in descending order of importance that influance the cost. As we can see, the purchase quantity has a major importance on the cost of tubes.