Data Mining for Business Intelligence – Answers

A fine WordPress.com site

Tag: answer key

Chapter 4: Dimension Reduction

 

4.1

a. The following variables are numerical/quantitative

calories
protein
fat
sodium
fiber
carbo
sugars
potass
vitamins
weight
cups
rating

The following are ordinal

shelf

The following are nominal

mfr, type

b.

calories protein fat sodium fiber carbo sugars potass vitamins
Mean 106.8831 Mean 2.545455 Mean 1.012987 Mean 159.6753 Mean 2.151948 Mean 14.80263 Mean 7.026316 Mean 98.66667 Mean 28.24675
Standard Error 2.220421 Standard Error 0.124763 Standard Error 0.114698 Standard Error 9.553577 Standard Error 0.27161 Standard Error 0.448201 Standard Error 0.502266 Standard Error 8.13032 Standard Error 2.546167
Median 110 Median 3 Median 1 Median 180 Median 2 Median 14.5 Median 7 Median 90 Median 25
Mode 110 Mode 3 Mode 1 Mode 0 Mode 0 Mode 15 Mode 3 Mode 35 Mode 25
Standard Deviation 19.48412 Standard Deviation 1.09479 Standard Deviation 1.006473 Standard Deviation 83.8323 Standard Deviation 2.383364 Standard Deviation 3.907326 Standard Deviation 4.378656 Standard Deviation 70.41064 Standard Deviation 22.34252
Sample Variance 379.6309 Sample Variance 1.198565 Sample Variance 1.012987 Sample Variance 7027.854 Sample Variance 5.680424 Sample Variance 15.26719 Sample Variance 19.17263 Sample Variance 4957.658 Sample Variance 499.1883
Kurtosis 2.370146 Kurtosis 1.184656 Kurtosis 2.044655 Kurtosis -0.34524 Kurtosis 8.647492 Kurtosis -0.33724 Kurtosis -1.15336 Kurtosis 1.963826 Kurtosis 6.257233
Skewness -0.44541 Skewness 0.74583 Skewness 1.165989 Skewness -0.57571 Skewness 2.431675 Skewness 0.112726 Skewness 0.044445 Skewness 1.400355 Skewness 2.463704
Range 110 Range 5 Range 5 Range 320 Range 14 Range 18 Range 15 Range 315 Range 100
Minimum 50 Minimum 1 Minimum 0 Minimum 0 Minimum 0 Minimum 5 Minimum 0 Minimum 15 Minimum 0
Maximum 160 Maximum 6 Maximum 5 Maximum 320 Maximum 14 Maximum 23 Maximum 15 Maximum 330 Maximum 100
Sum 8230 Sum 196 Sum 78 Sum 12295 Sum 165.7 Sum 1125 Sum 534 Sum 7400 Sum 2175
Count 77 Count 77 Count 77 Count 77 Count 77 Count 76 Count 76 Count 75 Count 77

c

i)

d. It makes no sense to have a side-by-side box plot of something that just has 3 values (the hot cereal). Not sure what the author was thinking in this question.

e.

Shelf height 1 and 3 can be combined, since they are very similar.

f.

calories protein fat sodium fiber carbo sugars potass vitamins
calories 1
protein 0.033992 1
fat 0.507373 0.202353 1
sodium 0.296247 0.011559 0.000822 1
fiber -0.29521 0.514006 0.014036 -0.07073 1
carbo 0.270606 -0.03674 -0.28493 0.328409 -0.37908 1
sugars 0.569121 -0.28658 0.287152 0.037059 -0.15095 -0.45207 1
potass -0.07136 0.578743 0.199637 -0.03944 0.911504 -0.365 0.001414 1
vitamins 0.259846 0.0548 -0.03051 0.331576 -0.03872 0.253579 0.072954 -0.00264 1

i) Potassium and Fiber are very strongly correlated (0.911)

ii) Use PCA and combine certain values.

iii) Normalized Correlations

calories protein fat sodium fiber carbo sugars potass vitamins
calories 1
protein 0.033992 1
fat 0.507373 0.202353 1
sodium 0.296247 0.011559 0.000822 1
fiber -0.29521 0.514006 0.014036 -0.07073 1
carbo 0.270606 -0.03674 -0.28493 0.328409 -0.37908 1
sugars 0.569121 -0.28658 0.287152 0.037059 -0.15095 -0.45207 1
potass -0.07136 0.578743 0.199637 -0.03944 0.911504 -0.365 0.001414 1
vitamins 0.259846 0.0548 -0.03051 0.331576 -0.03872 0.253579 0.072954 -0.00264 1

It is exactly the same. The correlations shouldn’t change when we normalize the data. Normalization puts it in a guassian curve. It doesn’t do anything to the information contained in the data.

g. The various variables? Is the author drunk, or are they not checking their work?

4.2

a. Column 1 variance is so much greater because it is not normalized and proline has a very high order of magnitude (in the 1000s) as compared to the other variables.

b. Normalization would ensure that all variables are on a normal curve, with the same magnitude.

4.3

a.

b. The data should be normalized, since the order of magnitude for the variables are vastly different. Key components are those that have a high positive or negative value in the first few columns.

4.4

a. Categorical Variables are Color, Automatic Transmission, No. of Gear positions etc. – things which have categories (ordinal and nominal values)

b. The binary values tell us which category the variable belongs to.

c. N-1

d.

Color_Black Color_Blue Color_Green Color_Grey Color_Red Color_Silver Color_Violet Color_White Color_Yellow
0 1 0 0 0 0 0 0 0

Shows that the color is Blue.

e.

 

 

 

 

 

 

Chapter 3: Data Visualization

3.1

a. Notice that the quarterly data is sorted alphabetically, placing all the Q1 data first. To get over this. create two new columns – Quarter and Year, and use the MID function to get the quarter and year separated out into each of the new columns. You can also manually do it, but the whole point is to automate and learn – and it’s hard to do this with large data sets manually. Once you have that, sort by year first and then by quarters, using Excel sort. Then create a line plot as below.

b. Every forth quarter, there is a decline.

c. Generally, Q2 and Q3 are higher than Q1 and Q4

d.

e. Use JMP or any other software to do this.

f. Obviously, an interactive visualization tool is better, since you can slice and dice and zoom in and out. The effort it takes to create these in Excel is a lot more. But then you pay more for these specialized data mining and analysis tools.

3.2

b. Using JMP9

c. It is much easier with an interactive visualization tool.

3.3

a. I used Excel 2007 to do this. Here are the steps.

First, you’ll need to get a list of all unique stores. Since there are no store IDs, we’ll assume that there’s only one store per Store Postcode (Postcode is like ZIP code in the US). To do this, use the Excel’s Advanced Filter on the Data ribbon

Then, filter on the Store Postcode and save the result on an empty column (say column R). Be sure to select the “Unique records only” and “Copy to another location”

Column R will look like this

Store Postcode
SE1 2BN
SW12 9HD
E2 0RY
SW1V 4QQ
SE8 3JD
SW18 1NN
CR7 8LE
NW5 2QH
W4 3PH
SW1P 3AU
E7 8NW
N3 1DH
W10 6HQ
KT2 5AU
N17 6QA
S1P 3AU

Then use the Excel AVERAGEIF function. Using this, you can get the store average for each store. The formula should look like this for the first store

=AVERAGEIF($D$2:$D$7957, R2, $E$2:$E$7957)

Drag to copy the formula for other stores. Now that you have the data, use the Excel Column graph to get

The store in N17 6QA has the highest average at 494.63 and the store in W4 3PH has the lowest at 481.

b. Since XLMiner doesn’t allow you to plot boxplots for more than 5 variables (which is very lame – I have no idea who pays the 900 bucks to buy this crappy software), I used JMP. Here’s what I got

There is very little you can tell from seeing the box-plot, except that the lowest and highest price of N17 6QA is a little more than that for W4 3PH, and so is the mean.

3.4

a

i. Not sure what “actually” means here, but laptops are selling between, but laptops are selling between 168 and 890 pounds. Most of them sell for around 500.

ii. Yes, the price changes with time.

iii. Prices are more or less consistent across retail outlets

iv. Price increases with increase in each of the configuration variables chosen below

Chapter 2: Overview of the data mining process

2.1

a) Supervised Learning

b) Supervised Learning

c) Supervised Learning

d) Unsupervised Learning

e) Supervised Learning

f) Supervised Learning (the assumption here is that similar trouble tickets with their estimates are available for learning, and the estimate is based on such learning)

g) Supervised Learning

h) Supervised Learning

2.2 The validation partition is used to pick the best model (where multiple models are trained on the training data) whereas the test partition is used to provide an estimate of how the chosen model will perform with unknown data.

2.3 It appears to have been sampled randomly, as there are no consistencies within a column or across rows. However, it is not likely to be a useful sample, because of three reasons

  1. OBS# seems to be some sort of a serial number and shouldn’t be included for training and definitely doesn’t have a bearing on the outcome variable
  2. There are too many predictor variables and not enough rows of data. The rule of thumb is 10 times the number of predictor variables times number of outcome classes, which in this case should be 10*11*2 = 220 (we’ve excluded OBS)
  3. There aren’t enough responses with 0 in them (most of them are 1s)

2.4 Our next step should be to get more data where the personal loan was accepted. This sample data has all rejected (assuming 0 stands for rejection) personal loans.

2.5 Zero error in a training data indicates that (for most cases) the model has fit random noise in the training data as well. This means that the model will not generalize well for new or unknown data.

2.6 Refund issued depends on the outcome variable, which in this case is the successful purchase. We cannot know refund issued, till a purchase is made.

2.7 The chance that a record would not have any missing variable is (1 – 0.5)^50, = .077. So out of a 1000 records, only 77 are likely to have all variables, which means we can expect about 923 records to be removed.

2.8 First calculate the mean and standard deviation of age and income.

Mean of Age = 44.67, Standard Deviation of Age = 14.97

Mean of Income = 98666.67, Standard Deviation of Income = 62867.06

Subtract column mean from the respective columns and divide by respective standard deviations to get

Age Income
-1.31325 -0.79003
0.75679 0.911977
1.35777 0.005302
-0.84582 1.484614
-0.24484 -0.94909
0.289361 -0.66277

2.9. It doesn’t. All normalizing does is to reduce them to similar scales. Records that are farthest from each other, still stay the farthest.

2.10 Model B, because it generalizes better than model A. As a general rule, whichever model does better on the validation set is the one that is considered for deployment.

2.11

a. Manufacturing year is clearly related to age of the car, etc.

b. Change Fuel Type = Diesel to 1 and Petrol to 0. Metallic Color is already a 0 or 1 – nothing to do here.

c. Delete one of the columns, for example, in Fuel Type, you can have Fuel Type Petrol and delete the one with Fuel Type Diesel.

ii. The training set is used to train the data on various models. The validation set will test the trained data to see which model predicts the best. The test data gives an indication of how the model will perform with unknown examples.

Follow

Get every new post delivered to your Inbox.