1 Pre-analysis

As a first step to this analysis, we identify possible features that may change the cost of a tube assembly quoted by a supplier. We also ask questions about the dataset where the answers to these questions will help us to choose the right machine learning algorithms. Since we are predicting the cost value, a continuous variable, we will use a regression algorithm family.

1.1 Questions

To archieve our goal of predicting the cost with a good accuracy, we need to answer the following questions.

  1. What features are useful to determine the cost and what features to exclude from the analysis?
  2. Is there a unique mathematical model describing the cost in function of the quantity for each supplier?
  3. If there is a mathematical model, is it a linear or non-linear model?
  4. Are there more than one model to estimate the cost where each model are independent of the others?
  5. Are there weighted decisions that can be taken amoung the features?

2 Possible Dependent Features

In this section, we will answer the question: What features are used to determine the cost and what features to exclude from the analysis?

2.1 Tube Physical Properties

As a supplier, we have to think on which tube features the cost will be based. We know that a tube assembly is made with one or more components. Some numerical tube properties may be helpful to check.

  • The total weight of the tube and some statistical measures like the maximum and average weights
  • The quantity to purchase with the mean, maximum and minimum quantities for each tube
  • The volume which depends on the diameter, the wall thickness and the length of the tube
  • The number of bends used with the bend radius. Logically, it is more difficult to bend a tube than to keep it linear, so it should be more expansive
  • The number of components to assemble a tube. Assembling many components need welds and connectors which should be more expansive
  • The material used to make the tube. Some type of material can be much expensive than others (e.g. Basic Metal Group vs Precious Metal Group)
  • The type of component used to assemble the tube. Some types of component may be more complicated to build by their shape (e.g. a tee is more complicated to build than a straight tube)

2.2 Supplier Features

  • The date when the supplier has quoted the price which is certainly less 20 years ago than today when the cost is not adjusted.
  • The suppliers may use different mathematical models to quote their price. Some may set cheaper costs, some may set expensive costs.
  • The supplier itself which can use different model and may be considered as classes.

3 Preparing & Cleaning the Dataset

In this section, we will explain why we chose to keep and exclude features and how we will clean the dataset.

From the dataset, we note that there are a total of 2048 components. These components are spread amoung the comp_[type].csv files uniquely. This means that we can create a single table Component by merging those files together. To avoid to many columns, we will remove some features that we do not want in this analysis.

The training and test sets are merged together where we set the cost to 0 for the test set.

The file bill_of_materials.csv gives us the list of components with their respective quantity used to assemble a tube. Thus, to calculate the total weight for each tube, we use the formula \[W_T = \sum_{i = 0}^n W_i Q_i\] where \(W_T\) is the total weight of the tube \(T\), \(W = (W_1,\ldots,W_n)\) is the vector of component weights, \(Q = (Q_1,\ldots,Q_n)\) the vector of component quantities and \(n \leq 8\) the number of possible components used to assemble a tube \(T\).

Let the total volume estimation of a tube assembly be denoted by \(V_T\). The volume is function of the length, the wall thickness and the diameter of the tube and its formula is \[V_T = \pi L t(d - t)\], where \(t\) is the wall thickness, \(d\) the outside diameter and \(L\) the developed length of the tube.

Every ID used (e.g. tube_assembly_id, material, supplier, etc.) as a string in CSV files are converted to a positive integer without the leading zeros. The quote date is converted to a positive integer in the format YYYYMMDD. We keep only the year and month (YYYYMM) of the date since the influance of the days in a month on the cost should be negligeable.

To test and find data efficiently, we create a database Caterpillar with tables, views and indexes. The script to query this database is given by the file DatabaseManipulation.R. The script to insert in batch the data from the CSV files to the database is given by the file DatabaseInsertions.R.

We include librairies for graphs and tables to display in this document. We also connect to the Caterpillar database for the next queries to execute.

We prepare the train and test datasets needed for the analysis.

##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  512817 27.4     940480 50.3   652250 34.9
## Vcells 1690712 12.9    2692307 20.6  2630291 20.1

4 Cost Models

In this section, we will answer the question: Is there a unique mathematical model describing the cost in function of the quantity for each supplier? The first objective is to check the existence of a mathematical model representing the cost in function of the quantity. The second objective is to show if the model is applied by a unique supplier. The last objective is to show if each supplier has its own model. If the unicity does not hold, then we have to check if a model is applied by more than one supplier or if a supplier can apply more than one model depending of other features. We will then answer the question: If a mathematical model exists, is it a linear or non-linear model?

We denote \(C_{T}(Q)\) our cost heuristic function of a tube assembly \(T\) given by a supplier with \(\beta\) our learning parameters and \(Q\) the vector of quantities.

4.1 Existence of a Mathematical Model

We start with few tube assemblies which are quoted by the supplier S-0066.

From the graphs, we see that the curves estimating the red points are clearly hyperbolas of equation \[C_{T}(Q) = \frac{\beta_0 - \beta_1}{Q} + \beta_1\] where \(Q \geq 1\) is the quantity for a tube assembly ID \(T\), \(\beta_1\) is the cost at the last level of purchase based on quantity and supplier (most of the time \(Q = 250\)), and \(\beta_0\) is the cost at the first level of purchase based on quantity and supplier (most of the time \(Q = 1\)). This equation indicates that if Caterpillar buy more tubes of the same ID, cheaper will be the cost per tube. This proves the existence of a mathematical model representing the cost in function of the quantity.

If we take a look at the right most graph, we see that our curve doesn’t seem to fit the points. However, the maximum quantity is 7 (not 250) for this tube which make the model less accurate assuming the same model is used. This assumption makes sense since \[\lim_{Q \to \infty} C_{T}(Q) = \beta_1\] which means that we need to find the right \(\beta_1\) to match with any quantity. We also have to find the cost of one tube which is \(\beta_0\).

For example, if we take the tube TA-19365, we have \(C_{T}(1) = \beta_0 = 298.7820145446\). We know that \(C_{T}(2) = \frac{\beta_0 + \beta_1}{2} = 156.1959237271 \Leftrightarrow \beta_1 = 13.60983291\). Therefore, the model for the tube TA-19365 is \(C_{T}(Q) = \frac{285.172181635}{Q} + 13.60983291\). With \(Q = 7\), we obtain \(C_{T}(7) = 54.348716001\) which has a square error of \(0.000001422\) from the original cost. With our estimated, i.e. \(C_{T}(Q) = (244.434490855/Q) + 54.3475236892\), we have \(C_{T}(7) = 89.266736668\) which has a square error of \(1219.351435073\). Thus, if \(Q\) is small (say \(Q < 25\)), the model may underfit.

4.2 Unicity of the Model per Supplier

We verify with few tube assemblies, which are quoted by the supplier S-0054, if the same model used for the supplier S-0066 applies.

The model used by the supplier S-0054 seems to be the same as the one used by the supplier S-0066, but if we look carefully at the curves, we see that greater is the quantity, more accurate is the estimate. This means that the model follows the same behaviour as the model used by the supplier S-0066.

This doesn’t seem to be the case for the tube TA-00384 from the supplier S-0064. This supplier provides 6 levels of purchase where the highest quantity is \(Q = 45\). We use the same model as before but this time, the model doesn’t fit the points.

Since the first quantity level is \(Q_0 = 20\), we need to translate the model by subtracting \(x\) by \(Q_0 - 1 = 19\). This gives the following model \[C_{T}(Q) = \frac{\beta_0 - \beta_1}{Q - Q_0 - 1} + \beta_1\] for all \(Q \geq 1\). However, the model still underfits the data because the cost decreases much slower than the model used for our previous tests. Therefore, we can assume that a model with specific parameters is used to estimate the cost by one or many suppliers but not all. However, the general model is used by suppliers and may need to adjust the parameters to fit the data.

5 Machine Learning Algorithms

We present the machine learning algorithms we will try after the analysis given by the four previous sections. We have seen that many decision trees can be built. Also, per section 4, many models which are simplifications of the general model can be used together to predict the cost. This leaves us two possible ensemble algorithms: Random Forest for regression or Extreme Boosting Trees for Regression (XGBoost).

We identify conditional paths which will tell us if decision trees will be useful or not. In the previous section, we have seen that the model can underfit if there are not enough quantity purchase levels and if the quantity is small. Otherwise, we can use the model to estimate the cost given a quantity and a tube assembly. Here are few points that identify some conditions.

Only with those conditions, we can build many decision trees to help us estimating the cost. Given a feature \(x\), we have to establish a probability \(\mathbb{P}_{k,l}(x)\) for each edge \(l\) of each level \(k\) of the trees.

For example, we can set at the second level 3 edges: \(\mathbb{P}_{2,1}(x) = 0.65\) if \(Q_0 = 1\), \(\mathbb{P}_{2,2}(x) = 0.25\) if \(1 < Q_0 \leq 20\) and \(\mathbb{P}_{2,3}(x) = 0.1\) if \(Q_0 > 20\). Another example would be to set at the fourth level 2 edges: \(\mathbb{P}_{4,1}(x) = 0.7\) if the number of bends is less than 4 and \(\mathbb{P}_{4,2}(x) = 0.3\) if the number of bends is greater or equal to 4.

5.1 Random Forest Algorithm

Suppose we want to create a tree for each supplier. This gives us a forest of 68 trees. The next level can be if the tubes have been bended (at least one bend) or not. The next level can be the material used to make a tube where each of them has a certain probability of usage. We can go deeper, but this gives us a good intuition to show that a random forest algorithm is a good choice to predict the cost.

## Random Forest 
## 
## 6045 samples
##   21 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 4837, 4836, 4837, 4836, 4834 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   RMSE SD   Rsquared SD
##    2    18.42026  0.5831277  4.173131  0.07882579 
##   11    18.32394  0.5946248  3.424209  0.05150506 
##   21    18.74144  0.5999932  3.529462  0.05393226 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 11.
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 301.6091
##                     % Var explained: 60.93
## 
## Call:
##  randomForest(formula = log(cost + 1) ~ ., data = train, nTree = 20) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##           Mean of squared residuals: 0.04356533
##                     % Var explained: 93.57
##    user  system elapsed 
## 728.545   0.512 728.786

##                    IncNodePurity
## fkTubeAssembly          288.7213
## supplierID              417.8784
## quoteDate               337.0331
## anualUsage             1276.4126
## minOrderQuantity        369.3894
## quantity               8663.2446
## totalWeight            1151.4635
## minWeight               575.7186
## maxWeight              1470.2637
## numberOfComponents      190.9143
## diameter               1637.6252
## wallThickness           220.9365
## length                  498.0368
## bendRadius              688.3858
## materialID              176.5664
## specs                   105.4682
## numberOfBends           174.2690
## maxQty                  638.1472
## avgQty                  773.4902
## cntQty                  314.2224
## id                      316.5171

Using only all features of the train set to predict the cost with the random forest algorithm using regression gives a score of 80% of variances explained. Here is a list of engineer tuning actions done to improve the prediction.

  • Removing the day part of the quote date and keeping the date in the integer format YYYYMM improved the score with 82.4% of variances explained.
  • Removing the bracket_pricing field improved the score with 84.56% of variances explained.
  • Adding the total weight improved the score with 90.22% of variances explained.
  • Adding the diameter of the tube improved the score with 91.96% of variances explained.
  • Adding the wall thickness and length of the tube improved the score with 92.75% of variances explained.
  • Adding the maximum and average quantity for each tube improved the score with 93.06% of variances explained.
  • Adding the maximum weight of components for each tube improved the score with 93.17% of variances explained.
  • Adding the component type for each tube improved the score with 93.31% of variances explained.
  • Adding the typical bend radius for each tube improved the score with 93.40% of variances explained.
  • Adding the number of quantities for each tube improved the score with 93.45% of variances explained.
  • Adding the material ID for each tube improved the score with 93.59% of variances explained.
  • Adding the number of specs for each tube improved the score with 93.61% of variances explained.
  • Adding the number of components for each tube improved the score with 93.67% of variances explained.
  • Adding the number of bends for each tube improved the score with 93.72% of variances explained.

Some features didn’t improve the prediction.

  • end_a_1x
  • end_a_2x
  • end_x_1x
  • end_x_2x
  • end_a
  • end_x
  • num_bracket
  • other
  • num_boss
  • number of quantity levels per tube
  • bracket_pricing

5.2 Extreme Gradient Boosted Regression Trees

Before the learning we will use the cross validation to evaluate our error rate (RMSE) to get a better prediction.

##       train.rmse.mean train.rmse.std test.rmse.mean test.rmse.std names
##    1:        1.853142       0.003802       1.853220      0.014968     1
##    2:        1.817760       0.003633       1.817940      0.014811     2
##    3:        1.783087       0.003550       1.783397      0.014657     3
##    4:        1.749700       0.003524       1.750102      0.014692     4
##    5:        1.717397       0.004220       1.717892      0.013880     5
##   ---                                                                  
## 3996:        0.042117       0.000773       0.169005      0.004636  3996
## 3997:        0.042104       0.000770       0.169002      0.004638  3997
## 3998:        0.042094       0.000769       0.169001      0.004637  3998
## 3999:        0.042084       0.000766       0.169000      0.004638  3999
## 4000:        0.042070       0.000771       0.168997      0.004637  4000
##     user   system  elapsed 
## 1321.923    1.040  339.383

We now train the data and predict with the test matrix. Then we will see the important features in ascending order of importance.

## Prediction with 4000 trees...
## 
##  [1] "booster[0]"                                                    
##  [2] "0:[f5<9.5] yes=1,no=2,missing=1,gain=6314.54,cover=25728"      
##  [3] "1:[f8<0.5535] yes=3,no=4,missing=3,gain=1202.66,cover=12677"   
##  [4] "3:[f3<69.5] yes=7,no=8,missing=7,gain=819.873,cover=12108"     
##  [5] "7:[f5<4.5] yes=15,no=16,missing=15,gain=859.18,cover=10458"    
##  [6] "15:[f10<26.3] yes=31,no=32,missing=31,gain=305.792,cover=7083" 
##  [7] "31:[f17<1.5] yes=61,no=62,missing=61,gain=473.819,cover=6527"  
##  [8] "61:[f3<1.5] yes=101,no=102,missing=101,gain=61.2508,cover=1132"
##  [9] "101:leaf=0.0261963,cover=228"                                  
## [10] "102:leaf=0.0380719,cover=904"                                  
## [11] "62:[f5<1.5] yes=103,no=104,missing=103,gain=403.832,cover=5395"
## [12] "103:leaf=0.0554786,cover=2718"                                 
## [13] "104:leaf=0.0444522,cover=2677"                                 
## [14] "32:[f3<15.5] yes=63,no=64,missing=64,gain=66.4415,cover=556"   
## [15] "63:[f9<2.5] yes=105,no=106,missing=106,gain=39.4926,cover=365"
##    user  system elapsed 
## 346.489   0.252  91.316

5.3 Conclusion

Which of the algorithms gives the best accuracy and why?

6 Data Visualization

The objective of this section is to visualize which features have the biggest importance on the cost of a given tube assembly. Also, we want to see how accurate is the model to predict the cost. We need to show that applying our model to the train set, our prediction is close to the real costs. By close, we mean with an accuracy greater than 95%.

The following graph shows the features in descending order of importance that influance the cost. As we can see, the purchase quantity has a major importance on the cost of tubes.