Preface |
|
vii | |
|
|
1 | (8) |
|
Overview of Supervised Learning |
|
|
9 | (32) |
|
|
9 | (1) |
|
Variable Types and Terminology |
|
|
9 | (2) |
|
Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors |
|
|
11 | (7) |
|
Linear Models and Least Squares |
|
|
11 | (3) |
|
|
14 | (2) |
|
From Least Squares to Nearest Neighbors |
|
|
16 | (2) |
|
Statistical Decision Theory |
|
|
18 | (4) |
|
Local Methods in High Dimensions |
|
|
22 | (6) |
|
Statistical Models, Supervised Learning and Function Approximation |
|
|
28 | (4) |
|
A Statistical Model for the Joint Distribution Pr (X, Y) |
|
|
28 | (1) |
|
|
29 | (1) |
|
|
29 | (3) |
|
Structured Regression Models |
|
|
32 | (1) |
|
Difficulty of the Problem |
|
|
32 | (1) |
|
Classes of Restricted Estimators |
|
|
33 | (4) |
|
Roughness Penalty and Bayesian Methods |
|
|
34 | (1) |
|
Kernel Methods and Local Regression |
|
|
34 | (1) |
|
Basis Functions and Dictionary Methods |
|
|
35 | (2) |
|
Model Selection and the Bias --- Variance Tradeoff |
|
|
37 | (4) |
|
|
39 | (1) |
|
|
39 | (2) |
|
Linear Methods for Regression |
|
|
41 | (38) |
|
|
41 | (1) |
|
Linear Regression Models and Least Squares |
|
|
42 | (8) |
|
|
47 | (2) |
|
The Gauss --- Markov Theorem |
|
|
49 | (1) |
|
Multiple Regression from Simple Univariate Regression |
|
|
50 | (5) |
|
|
54 | (1) |
|
Subset Selection and Coefficient Shrinkage |
|
|
55 | (20) |
|
|
55 | (2) |
|
Prostate Cancer Data Example (Continued) |
|
|
57 | (2) |
|
|
59 | (7) |
|
Methods Using Derived Input Directions |
|
|
66 | (2) |
|
Discussion: A Comparison of the Selection and Shrinkage Methods |
|
|
68 | (5) |
|
Multiple Outcome Shrinkage and Selection |
|
|
73 | (2) |
|
Computational Considerations |
|
|
75 | (4) |
|
|
75 | (1) |
|
|
75 | (4) |
|
Linear Methods for Classification |
|
|
79 | (36) |
|
|
79 | (2) |
|
Linear Regression of an Indicator Matrix |
|
|
81 | (3) |
|
Linear Discriminant Analysis |
|
|
84 | (11) |
|
Regularized Discriminant Analysis |
|
|
90 | (1) |
|
|
91 | (1) |
|
Reduced-Rank Linear Discriminant Analysis |
|
|
91 | (4) |
|
|
95 | (10) |
|
Fitting Logistic Regression Models |
|
|
98 | (2) |
|
Example: South African Heart Disease |
|
|
100 | (2) |
|
Quadratic Approximations and Inference |
|
|
102 | (1) |
|
Logistic Regression or LDA? |
|
|
103 | (2) |
|
|
105 | (10) |
|
Rosenblatt's Perceptron Learning Algorithm |
|
|
107 | (1) |
|
Optimal Separating Hyperplanes |
|
|
108 | (3) |
|
|
111 | (1) |
|
|
111 | (4) |
|
Basis Expansions and Regularization |
|
|
115 | (50) |
|
|
115 | (2) |
|
Piecewise Polynomials and Splines |
|
|
117 | (9) |
|
|
120 | (2) |
|
Example: South African Heart Disease (Continued) |
|
|
122 | (2) |
|
Example: Phoneme Recognition |
|
|
124 | (2) |
|
Filtering and Feature Extraction |
|
|
126 | (1) |
|
|
127 | (7) |
|
Degrees of Freedom and Smoother Matrices |
|
|
129 | (5) |
|
Automatic Selection of the Smoothing Parameters |
|
|
134 | (3) |
|
Fixing the Degrees of Freedom |
|
|
134 | (1) |
|
The Bias --- Variance Tradeoff |
|
|
134 | (3) |
|
Nonparametric Logistic Regression |
|
|
137 | (1) |
|
|
138 | (6) |
|
Regularization and Reproducing Kernel Hilbert Spaces |
|
|
144 | (4) |
|
Spaces of Functions Generated by Kernels |
|
|
144 | (2) |
|
|
146 | (2) |
|
|
148 | (17) |
|
Wavelet Bases and the Wavelet Transform |
|
|
150 | (3) |
|
Adaptive Wavelet Filtering |
|
|
153 | (2) |
|
|
155 | (1) |
|
|
155 | (5) |
|
Appendix: Computational Considerations for Splines |
|
|
160 | (1) |
|
|
160 | (3) |
|
Appendix: Computations for Smoothing Splines |
|
|
163 | (2) |
|
|
165 | (28) |
|
One-Dimensional Kernel Smoothers |
|
|
165 | (7) |
|
|
168 | (3) |
|
Local Polynomial Regression |
|
|
171 | (1) |
|
Selecting the Width of the Kernel |
|
|
172 | (2) |
|
|
174 | (1) |
|
Structured Local Regression Models in IRp |
|
|
175 | (4) |
|
|
177 | (1) |
|
Structured Regression Functions |
|
|
177 | (2) |
|
Local Likelihood and Other Models |
|
|
179 | (3) |
|
Kernel Density Estimation and Classification |
|
|
182 | (4) |
|
Kernel Density Estimation |
|
|
182 | (2) |
|
Kernel Density Classification |
|
|
184 | (1) |
|
The Naive Bayes Classifier |
|
|
184 | (2) |
|
Radial Basis Functions and Kernels |
|
|
186 | (2) |
|
Mixture Models for Density Estimation and Classification |
|
|
188 | (2) |
|
Computational Considerations |
|
|
190 | (3) |
|
|
190 | (1) |
|
|
190 | (3) |
|
Model Assessment and Selection |
|
|
193 | (32) |
|
|
193 | (1) |
|
Bias, Variance and Model Complexity |
|
|
193 | (3) |
|
The Bias --- Variance Decomposition |
|
|
196 | (4) |
|
Example: Bias --- Variance Tradeoff |
|
|
198 | (2) |
|
Optimism of the Training Error Rate |
|
|
200 | (3) |
|
Estimates of In-Sample Prediction Error |
|
|
203 | (2) |
|
The Effective Number of Parameters |
|
|
205 | (1) |
|
The Bayesian Approach and BIC |
|
|
206 | (2) |
|
Minimum Description Length |
|
|
208 | (2) |
|
Vapnik---Chernovenkis Dimension |
|
|
210 | (4) |
|
|
212 | (2) |
|
|
214 | (3) |
|
|
217 | (8) |
|
|
220 | (2) |
|
|
222 | (1) |
|
|
222 | (3) |
|
Model Inference and Averaging |
|
|
225 | (32) |
|
|
225 | (1) |
|
The Bootstrap and Maximum Likelihood Methods |
|
|
225 | (6) |
|
|
225 | (4) |
|
Maximum Likelihood Inference |
|
|
229 | (2) |
|
Bootstrap versus Maximum Likelihood |
|
|
231 | (1) |
|
|
231 | (4) |
|
Relationship Between the Bootstrap and Bayesian Inference |
|
|
235 | (1) |
|
|
236 | (7) |
|
Two-Component Mixture Model |
|
|
236 | (4) |
|
The EM Algorithm in General |
|
|
240 | (1) |
|
EM as a Maximization-Maximization Procedure |
|
|
241 | (2) |
|
MCMC for Sampling from the Posterior |
|
|
243 | (3) |
|
|
246 | (4) |
|
Example: Trees with Simulated Data |
|
|
247 | (3) |
|
Model Averaging and Stacking |
|
|
250 | (3) |
|
Stochastic Search: Bumping |
|
|
253 | (4) |
|
|
254 | (1) |
|
|
255 | (2) |
|
Additive Models, Trees, and Related Methods |
|
|
257 | (42) |
|
Generalized Additive Models |
|
|
257 | (9) |
|
|
259 | (2) |
|
Example: Additive Logistic Regression |
|
|
261 | (5) |
|
|
266 | (1) |
|
|
266 | (13) |
|
|
266 | (1) |
|
|
267 | (3) |
|
|
270 | (2) |
|
|
272 | (3) |
|
|
275 | (4) |
|
|
279 | (4) |
|
|
282 | (1) |
|
MARS: Multivariate Adaptive Regression Splines |
|
|
283 | (7) |
|
|
287 | (1) |
|
|
288 | (1) |
|
|
289 | (1) |
|
Hierarchical Mixtures of Experts |
|
|
290 | (3) |
|
|
293 | (2) |
|
Computational Considerations |
|
|
295 | (4) |
|
|
295 | (1) |
|
|
296 | (3) |
|
Boosting and Additive Trees |
|
|
299 | (48) |
|
|
299 | (4) |
|
|
302 | (1) |
|
Boosting Fits an Additive Model |
|
|
303 | (1) |
|
Forward Stagewise Additive Modeling |
|
|
304 | (1) |
|
Exponential Loss and AdaBoost |
|
|
305 | (1) |
|
|
306 | (2) |
|
Loss Functions and Robustness |
|
|
308 | (4) |
|
``Off-the-Shelf'' Procedures for Data Mining |
|
|
312 | (2) |
|
|
314 | (2) |
|
|
316 | (3) |
|
|
319 | (4) |
|
|
320 | (1) |
|
|
320 | (2) |
|
|
322 | (1) |
|
Right-Sized Trees for Boosting |
|
|
323 | (1) |
|
|
324 | (7) |
|
|
326 | (2) |
|
|
328 | (2) |
|
Virtues of the L1 Penalty (Lasso) over L2 |
|
|
330 | (1) |
|
|
331 | (4) |
|
Relative Importance of Predictor Variables |
|
|
331 | (2) |
|
|
333 | (2) |
|
|
335 | (12) |
|
|
335 | (4) |
|
|
339 | (1) |
|
|
340 | (4) |
|
|
344 | (3) |
|
|
347 | (24) |
|
|
347 | (1) |
|
Projection Pursuit Regression |
|
|
347 | (3) |
|
|
350 | (3) |
|
|
353 | (2) |
|
Some Issues in Training Neural Networks |
|
|
355 | (4) |
|
|
355 | (1) |
|
|
356 | (2) |
|
|
358 | (1) |
|
Number of Hidden Units and Layers |
|
|
358 | (1) |
|
|
359 | (1) |
|
|
359 | (3) |
|
|
362 | (4) |
|
|
366 | (1) |
|
Computational Considerations |
|
|
367 | (4) |
|
|
367 | (2) |
|
|
369 | (2) |
|
Support Vector Machines and Flexible Discriminants |
|
|
371 | (40) |
|
|
371 | (1) |
|
The Support Vector Classifier |
|
|
371 | (6) |
|
Computing the Support Vector Classifier |
|
|
373 | (2) |
|
Mixture Example (Continued) |
|
|
375 | (2) |
|
|
377 | (13) |
|
Computing the SVM for Classification |
|
|
377 | (3) |
|
The SVM as a Penalization Method |
|
|
380 | (1) |
|
Function Estimation and Reproducing Kernels |
|
|
381 | (3) |
|
SVMs and the Curse of Dimensionality |
|
|
384 | (1) |
|
Support Vector Machines for Regression |
|
|
385 | (2) |
|
|
387 | (2) |
|
|
389 | (1) |
|
Generalizing Linear Discriminant Analysis |
|
|
390 | (1) |
|
Flexible Discriminant Analysis |
|
|
391 | (6) |
|
Computing the FDA Estimates |
|
|
394 | (3) |
|
Penalized Discriminant Analysis |
|
|
397 | (2) |
|
Mixture Discriminant Analysis |
|
|
399 | (12) |
|
|
402 | (4) |
|
|
406 | (1) |
|
|
406 | (5) |
|
Prototype Methods and Nearest - Neighbors |
|
|
411 | (26) |
|
|
411 | (1) |
|
|
411 | (4) |
|
|
412 | (2) |
|
Learning Vector Quantization |
|
|
414 | (1) |
|
|
415 | (1) |
|
κ-Nearest-Neighbor Classifiers |
|
|
415 | (12) |
|
Example: A Comparative Study |
|
|
420 | (2) |
|
Example: κ-Nearest-Neighbors and Image Scene Classification |
|
|
422 | (1) |
|
Invariant Metrics and Tangent Distance |
|
|
423 | (4) |
|
Adaptive Nearest-Neighbor Methods |
|
|
427 | (5) |
|
|
430 | (1) |
|
Global Dimension Reduction for Nearest-Neighbors |
|
|
431 | (1) |
|
Computational Considerations |
|
|
432 | (5) |
|
|
433 | (1) |
|
|
433 | (4) |
|
|
437 | (72) |
|
|
437 | (2) |
|
|
439 | (14) |
|
|
440 | (1) |
|
|
441 | (3) |
|
Example: Market Basket Analysis |
|
|
444 | (3) |
|
Unsupervised as Supervised Learning |
|
|
447 | (2) |
|
Generalized Association Rules |
|
|
449 | (2) |
|
Choice of Supervised Learning Method |
|
|
451 | (1) |
|
Example: Market Basket Analysis (Continued) |
|
|
451 | (2) |
|
|
453 | (27) |
|
|
455 | (1) |
|
Dissimilarities Based on Attributes |
|
|
455 | (2) |
|
|
457 | (2) |
|
|
459 | (1) |
|
|
460 | (1) |
|
|
461 | (2) |
|
Gaussian Mixtures as Soft K-means Clustering |
|
|
463 | (1) |
|
Example: Human Tumor Microarray Data |
|
|
463 | (3) |
|
|
466 | (2) |
|
|
468 | (2) |
|
|
470 | (2) |
|
|
472 | (8) |
|
|
480 | (5) |
|
Principal Components, Curves and Surfaces |
|
|
485 | (9) |
|
|
485 | (6) |
|
Principal Curves and Surfaces |
|
|
491 | (3) |
|
Independent Component Analysis and Exploratory Projection Pursuit |
|
|
494 | (8) |
|
Latent Variables and Factor Analysis |
|
|
494 | (2) |
|
Independent Component Analysis |
|
|
496 | (4) |
|
Exploratory Projection Pursuit |
|
|
500 | (1) |
|
A Different Approach to ICA |
|
|
500 | (2) |
|
|
502 | (7) |
|
|
503 | (1) |
|
|
504 | (5) |
References |
|
509 | (14) |
Author Index |
|
523 | (4) |
Index |
|
527 | |