### Chapter 4: Dimension Reduction

4.1

a. The following variables are numerical/quantitative

calories |

protein |

fat |

sodium |

fiber |

carbo |

sugars |

potass |

vitamins |

weight |

cups |

rating |

The following are ordinal

shelf

The following are nominal

mfr, type

b.

calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | |||||||||

Mean | 106.8831 | Mean | 2.545455 | Mean | 1.012987 | Mean | 159.6753 | Mean | 2.151948 | Mean | 14.80263 | Mean | 7.026316 | Mean | 98.66667 | Mean | 28.24675 |

Standard Error | 2.220421 | Standard Error | 0.124763 | Standard Error | 0.114698 | Standard Error | 9.553577 | Standard Error | 0.27161 | Standard Error | 0.448201 | Standard Error | 0.502266 | Standard Error | 8.13032 | Standard Error | 2.546167 |

Median | 110 | Median | 3 | Median | 1 | Median | 180 | Median | 2 | Median | 14.5 | Median | 7 | Median | 90 | Median | 25 |

Mode | 110 | Mode | 3 | Mode | 1 | Mode | 0 | Mode | 0 | Mode | 15 | Mode | 3 | Mode | 35 | Mode | 25 |

Standard Deviation | 19.48412 | Standard Deviation | 1.09479 | Standard Deviation | 1.006473 | Standard Deviation | 83.8323 | Standard Deviation | 2.383364 | Standard Deviation | 3.907326 | Standard Deviation | 4.378656 | Standard Deviation | 70.41064 | Standard Deviation | 22.34252 |

Sample Variance | 379.6309 | Sample Variance | 1.198565 | Sample Variance | 1.012987 | Sample Variance | 7027.854 | Sample Variance | 5.680424 | Sample Variance | 15.26719 | Sample Variance | 19.17263 | Sample Variance | 4957.658 | Sample Variance | 499.1883 |

Kurtosis | 2.370146 | Kurtosis | 1.184656 | Kurtosis | 2.044655 | Kurtosis | -0.34524 | Kurtosis | 8.647492 | Kurtosis | -0.33724 | Kurtosis | -1.15336 | Kurtosis | 1.963826 | Kurtosis | 6.257233 |

Skewness | -0.44541 | Skewness | 0.74583 | Skewness | 1.165989 | Skewness | -0.57571 | Skewness | 2.431675 | Skewness | 0.112726 | Skewness | 0.044445 | Skewness | 1.400355 | Skewness | 2.463704 |

Range | 110 | Range | 5 | Range | 5 | Range | 320 | Range | 14 | Range | 18 | Range | 15 | Range | 315 | Range | 100 |

Minimum | 50 | Minimum | 1 | Minimum | 0 | Minimum | 0 | Minimum | 0 | Minimum | 5 | Minimum | 0 | Minimum | 15 | Minimum | 0 |

Maximum | 160 | Maximum | 6 | Maximum | 5 | Maximum | 320 | Maximum | 14 | Maximum | 23 | Maximum | 15 | Maximum | 330 | Maximum | 100 |

Sum | 8230 | Sum | 196 | Sum | 78 | Sum | 12295 | Sum | 165.7 | Sum | 1125 | Sum | 534 | Sum | 7400 | Sum | 2175 |

Count | 77 | Count | 77 | Count | 77 | Count | 77 | Count | 77 | Count | 76 | Count | 76 | Count | 75 | Count | 77 |

c

i)

d. It makes no sense to have a side-by-side box plot of something that just has 3 values (the hot cereal). Not sure what the author was thinking in this question.

e.

Shelf height 1 and 3 can be combined, since they are very similar.

f.

calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | |

calories | 1 | ||||||||

protein | 0.033992 | 1 | |||||||

fat | 0.507373 | 0.202353 | 1 | ||||||

sodium | 0.296247 | 0.011559 | 0.000822 | 1 | |||||

fiber | -0.29521 | 0.514006 | 0.014036 | -0.07073 | 1 | ||||

carbo | 0.270606 | -0.03674 | -0.28493 | 0.328409 | -0.37908 | 1 | |||

sugars | 0.569121 | -0.28658 | 0.287152 | 0.037059 | -0.15095 | -0.45207 | 1 | ||

potass | -0.07136 | 0.578743 | 0.199637 | -0.03944 | 0.911504 | -0.365 | 0.001414 | 1 | |

vitamins | 0.259846 | 0.0548 | -0.03051 | 0.331576 | -0.03872 | 0.253579 | 0.072954 | -0.00264 | 1 |

i) Potassium and Fiber are very strongly correlated (0.911)

ii) Use PCA and combine certain values.

iii) Normalized Correlations

calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | |

calories | 1 | ||||||||

protein | 0.033992 | 1 | |||||||

fat | 0.507373 | 0.202353 | 1 | ||||||

sodium | 0.296247 | 0.011559 | 0.000822 | 1 | |||||

fiber | -0.29521 | 0.514006 | 0.014036 | -0.07073 | 1 | ||||

carbo | 0.270606 | -0.03674 | -0.28493 | 0.328409 | -0.37908 | 1 | |||

sugars | 0.569121 | -0.28658 | 0.287152 | 0.037059 | -0.15095 | -0.45207 | 1 | ||

potass | -0.07136 | 0.578743 | 0.199637 | -0.03944 | 0.911504 | -0.365 | 0.001414 | 1 | |

vitamins | 0.259846 | 0.0548 | -0.03051 | 0.331576 | -0.03872 | 0.253579 | 0.072954 | -0.00264 | 1 |

It is exactly the same. The correlations shouldn’t change when we normalize the data. Normalization puts it in a guassian curve. It doesn’t do anything to the information contained in the data.

g. The various variables? Is the author drunk, or are they not checking their work?

4.2

a. Column 1 variance is so much greater because it is not normalized and proline has a very high order of magnitude (in the 1000s) as compared to the other variables.

b. Normalization would ensure that all variables are on a normal curve, with the same magnitude.

4.3

a.

b. The data should be normalized, since the order of magnitude for the variables are vastly different. Key components are those that have a high positive or negative value in the first few columns.

4.4

a. Categorical Variables are Color, Automatic Transmission, No. of Gear positions etc. – things which have categories (ordinal and nominal values)

b. The binary values tell us which category the variable belongs to.

c. N-1

d.

Color_Black | Color_Blue | Color_Green | Color_Grey | Color_Red | Color_Silver | Color_Violet | Color_White | Color_Yellow |

0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

Shows that the color is Blue.

e.