Monday, June 21, 2010

To "Log" or Not to "Log"

Many statistical techniques rely upon the assumption that the data follows a normal distribution. In fact, questions surrounding the impact of normality on the validity of a statistical analysis ranks right up there with 'how many samples do I need for my results to be meaningful?'. Some statistical techniques derived from an assumption of normality of the data can deal with slight to moderate departures quite well, while others are greatly impacted. For instance, if a response such as bond strength is slightly skewed to the right with a mean well above 0 then a control chart for subgroup averages derived in the typical manner, using 3 sigma limits, should have adequate performance in detecting a large shift in the mean. However, a response which is heavily skewed to the right, with a lower boundary close to 0, for example, will produce unreasonable results using techniques based upon normality.

Let's examine the impact of departures from normality when setting a '3 sigma' upper specification limit based upon process capability, for a response that inherently follows a lognormal distribution. In the figure below, we see a histogram for 50 measurements of a response, which was randomly generated using a lognormal distribution. At first glance it is obvious that the distribution is skewed to the right with a lower bound close to 0, putting the normality assumption in question. However, is it so skewed that it will call into question the accuracy of statistical techniques derived from a normal assumption?

Let's fit a normal distribution to the data and use the normal probability model to determine the value at which the probability of exceeding it is 0.135%, i.e., a '3 sigma' equivalent upper limit. This value is 4.501, with 2% of the actual readings above this upper limit.

Now, let's fit the more appropriate lognormal distribution to this data and use the lognormal probability model to determine the value at which the probability of exceeding it is 0.135%, i.e., a '3 sigma' equivalent upper limit. This value is 9.501, with no actual readings above this limit.

The upper limits derived from these two distributions are quite different from each other; 9.501 (lognormal) vs. 4.501 (normal). Since we know this response is lognormal, we would be significantly underestimating the '3 sigma' upper limit using the normal assumption, resulting in 2% out-of-spec readings, compared to 0.135% that we set out to achieve. In a real world application, this could result in chasing false alarms on a control chart or scrapping product that is well within the distribution of the data. Before moving forward with a statistical analysis of your data, determine the best fitting distribution by evaluating the histogram, distribution based quantile plots, goodness of fit tests, and most importantly, understanding the context of the data. The best distributions to use for many engineering and scientific applications are well known and well documented.