Stat Insights: Residuals

Sunday, October 11, 2009

Looks Like a Straight Line to Me

The graph below shows 40 Deflection (in) vs Load (lb) measurements (open circles), and its least squares fit (blue line). The straight line seems to fit the behavior of the data well, with an, almost perfect, RSquare equal to 99.9989%.

(Data available from National Institute of Standards and Technology (NIST))

Do you think the straight line is a good fit for Deflection as a function of Load?

The RSquare tells us that 99.9989% of the variation observed in the Deflection data is explained by the linear relationship, so based on this criteria this seems like a pretty good fit. However, a single measure, like RSquare, does not give us the complete picture of how well a model approximates the data. In my previous post I wrote that a model is just a recipe for transforming data into noise. How do we check that what is left behind is noise? Residual plots provide a way to evaluate the residuals (=Data - Model), or what is left after the model is fit.

There are many types of residual plots that are used to assess the quality of the fit. A plot of the (studentized) residuals vs. predicted Deflection, for example, clearly shows that the linear model did not leave behind noise, but it failed to account for a quadratic term.

But based on the RSquare the fit is almost perfect, you protest. A statistical analysis does not exist in isolation but depends on the context of the data, the uncertainties we need to answer, and the assumptions we make. This data was collected to develop a calibration curve for load cells for which a highly accurate model is desired. The quadratic model explains 99.9999900179% of the variation in the Deflection data.

The quadratic model increases the precision of the coefficients, and prediction of future values, by reducing the Root Mean Square Error (RMSE) from 0.002171 to 0.0002052. A plot of the (studentized) residuals vs. Load now shows that what is left behind now is just noise.

For a complete analysis of the Deflection data see Chapter 7 of our book Analyzing and Interpreting Continuous Data Using JMP.

Sunday, September 27, 2009

Data - Model = Noise

A model is just a recipe for transforming data into noise.

We are used to thinkng of a statistical model as a representation of our data that can be used for describing its behavior, or to predict future values. We fit a statistical model with the hope that it does a good job at extracting the signals in our data. In other words, the goodness of a statistical model can be evaluated by how well it does at leaving behind "just noise".

How good is the model at transforming our data into noise? After the model is fit, the Residuals = Data - Model should behave like white noise, or have no predominant signals left in them. Graphical residual analysis provides a way for us to verify our assumptions about the model, and to make sure that no predominant signals are left in the residuals. They allow us to evaluate the model's lack-of-fit.

In my next post I will show a calibration curve study in which residuals plots helped discover an unaccounted signal even though the R-Square was almost 100%.