Stat Insights

Music and Data Visualization (What It's All About)

2010-12-07T21:19:00.000-05:00

A couple of weeks ago, mashup artist Girl Talk (Gregg Gillis) released his new album, All Day. If you are not familiar with Girl Talk's music, each track is a mashup of samples from different artists. How do you visualize one of these musical mashups? Benjamin Rahn has come up with an effective and clever way of doing it. If you love music and data visualization check out what Benjamin has done: All Day by Girl Talk - Mashup Breakdown (be aware that the lyrics of some of the sampled tracks are explicit).

Amazing, isn't? You can not only hear the music but also visualize, in real time, which samples are playing within the mashup. This "little side project", as Benjamin put it on his twitter feed, takes visualization to another dimension. Hats off to Benjamin! This mashup breakdown got me thinking about getting a copy of the data, and about the possible visualizations I could do with JMP. Fortunately, the data for ALL Day, as well as the data for Girl Talk's previous album, Feed the Animals, is available from Wikipedia (huge thanks to all the people who have contributed to these pages).

Feed the Animals was released in June 19, 2008, and there have been a few creative visualizations of the music and information in the mashups. Angela Watercutter (Wired 16.09) deconstructed track #4, What It's All About (you can listen to Feed the Animals tracks here), displaying in a circular graph the 35 samples names, their duration, and the pictures of the artists that are part of this 4:14 minutes track. Although it is a nice visual of all the artists and songs, it is not easy to see how the sampled tracks line up within the mashup.

Bunny Greenhouse (Chris Beckman) has created video mashups for each of the tracks in Feed the Animals. Here is the one for What It's All About. You get flashes of Beyoncé at the beginning of the video, and around 0:21 seconds you see Busta Rhymes with The Police's "Every Little Thing She Does Is Magic" in the background. By 0:43 seconds The Police drummer Stewart Copeland appears playing his unmistakable driving drum beat, followed by Sting, Andy and Stewart dancing. At 3:33 we see a very young Michael Jackson singing "ABC", blending, around 4 minutes, with Queen's "Bohemian Rhapsody". Now you got me, I can enjoy the music, see which artists are part of the mashup, and when their tracks appear.

In order to contribute to this collection of visualizations I decided to use the Feed the Animals data and concentrate on track 4, What It's All About. For each of the 14 tracks in Feed the Animals, what does the distribution of sampled tracks lengths looks like, are they similar to each other? Did Girl Talk used mostly short samples? How long are the longest samples? The chart below shows the lengths of sampled tracks, as jittered points, for each of the 14 tracks, along with a boxplot to get a sense for the distribution. I've added a red line to show the overall median (23 seconds) of all the 329 sampled tracks. The distribution of each of the 14 tracks is skewed to the right, and about 2/3 of the samples are 0:30 seconds or less. The color of the points show how many times a sample of a given length was used (1 to 4). For example, in "Like This" (track #7), he used four 0:01 seconds and four 0:16 seconds sampled tracks, Note that most of the red points are in tracks 6, 7 and 8. The outlier for Track 1, "Gimme Some Lovin'" (Spencer Davis Group), shows that Girl Talk favored this sampled track by giving it 2:11 minutes out of the total 4:45 minutes. What It's All About (Track #4) also has a long sample (Busta Rhymes' Woo Hah!! Got You All in Check) lasting 1:15 minutes, or about 30% of the total track.

A nice visualization tool from the genomics world, the cell plot, gives another perspective to the density of sample lengths within a track. A cell plot is a visual representation of a data table, with each cell in the plot representing a data point. The cell plot for What It's All About shows 35 cells, one for each sampled track, with a color shade, from white (short) to dark blue (long), denoting the length of the track. What It's All About kicks in with sampled tracks of lengths between 10 and 20 seconds, followed by the longest track (Woo Hah!! Got You All in Check), the darkest blue cell. Starting with "Every Little Thing She Does Is Magic" (remember second 21 in Bunny Greenhouse's video mashup?), there is a sequence of 6 sampled tracks with lengths between 30 and 55 seconds, the exception (white cell) being "Memory Band" with only 3 seconds. Towards the end we see a sequence of very short sampled tracks, the almost white strip between "What Up Gangsta?" and "Ms. Jackson", followed by the last 4 sampled tracks with lengths around 30 seconds.

What is missing in these plots is the time dimension. One of the nice things about Bejamin's visualization is that one can see where the sampled tracks fall in the overall time sequence of the track, and with respect to each other. We can use the Graph Builder in JMP to create a plot for What It's All About, with sampled track length in the x-axis, the song name in the y-axis, and the start and stop times in a stock-style bar chart. Now it is easier to see that the first 4 sampled tracks have similar lengths and that they occur around the same time. The longest sampled track, "Woo Hah!! Got You All in Check", starts around 0:15 seconds together with "Every Little Thing She Does Is Magic" but it lasts almost twice as long. In the middle of the track there is another long sampled track, "Go!", lasting about 0:65 seconds. We also see the run of very short sampled tracks towards the end of the track, as we saw in the cell plot. The track ends with about 20 seconds (3:53 to 4:14) of Queen's "Bohemian Rhapsody".

The previous plot is an improvement but music happens over time, dynamically. In order to show the dynamic dimension of time, we can use a bubble plot with bubble trails as I illustrated in my previous post, Visualizing Change with Bubble Plots. Since this visualization does not include the music, I decided to speed things up so you don't have to watch it for the 4:14 minutes that What It's All About lasts. Ready? Hit play.

Now you can see bubbles appearing and disappearing in the order they show up in the track. At 0:21 seconds you can see the red and cyan bubbles corresponding to "Woo Hah!! Got You All in Check" and "Every Little Thing She Does Is Magic". At 0:40 The Cure's "Close to Me" comes in, and at 1:04 minutes, when "Every Little Thing She Does Is Magic" drops out, two more sampled track appear: "Here Comes the Hotstepper" and "Land of a Thousand Dances". We can also see the 6 very short sampled tracks in previous plots starting at 3:17 minutes. "Bohemian Rhapsody" enters at 3:53 minutes, riding the last 0:21 seconds of the track. Who knows? Maybe for the next version of JMP we'll be able to add sound to the bubble plot mix.

Visualizing Change with Bubble Plots

2010-11-29T21:29:00.000-05:00

In my last post I showed some of the features of JMP's Bubble Plot platform using the US crime rate data (How to Make Bubble Charts) to create a static bubble plot, and the 1973 to 1999 US crime rate data to generate a dynamic bubble plot. Describing the dynamic bubble plot i wrote

Around 1976 Nevada starts to move away from the rest of the states, with both a high burglary and murder rates, reaching a maximum around 1980, and returning to California and Florida levels by 1984. Around 1989 the murder rate in Louisiana starts to increase reaching 20 per 100,000 by 1993, staying between 15 and 20 per 100,00 all the way up to 1997, with a fairly constant burglary rate. We can also see that the crime rates for North Dakota are consistently low

You can see these stories unfold in the animation, but after it is over our brains tend to forget the path a bubble took; the sequence of steps that led to its final position. Fortunately for us, the Bubble Plot platform has an option, Trail Lines, that can help our brains visualize motion. This option can be accessed from the bubble plot contextual menu:

Let's select the bubbles for Nevada, Louisiana, and North Dakota. If you run the animation a trail follows the motion of each of these bubbles. By the end of the sequence, 1999, the plot shows the paths taken by these 3 states.

Now we can clearly see Nevada, the green trail, shooting up to the upper right (high burglary and murder rates), and then coming back. Note how Louisiana (blue line) moves horizontally to the right (higher murder rate), without changing too much in the vertical direction (burglary rate). North Dakota's path (yellow line) is a short zigzag motion, keeping itself around a burglary rate of 435 per 100,000, and a murder rate of 1.18 per 100,000.

In Visualizing Change, data visualization expert Stephen Few discusses four meaningful characteristics of change through time: magnitude, shape, velocity and direction. These four characteristics are easier to visualize by using Trail Bubbles in addition to Trail Lines. The plot below shows the Trail Lines and Trail Bubbles for Louisiana, Nevada, and North Dakota. To help the eye, I've added labels for the starting year of 1973.

The magnitude of change can be assessed by looking at the difference between bubble locations. For Nevada, between 1973 and 1980, you can see big changes in the burglary rate, from about 2000 to 3000 per 100,000, and the murder rate, from 12 to 20 per 100,000. By 1999 Nevada's burglary rate have been cut in half to 1000 per 100,000. The shape of change is given by the overall shape of the bubbles, while the direction and velocity of change can be visualized by the trend of the trails and the rate at which a bubble moves from one place to the next. For Nevada, the shape of change is somewhat concave, with rapid changes (big jumps from one bubble to the next), trending upward and downward in the 45° diagonal.

Louisiana's burglary rate did not change much (vertical changes), but its murder rate went up to 20 per 100,000, ending at 10 per 1000,000, lower than where it started (horizontal changes). The changes did not seem to occur rapidly, because the distance between the bubbles is small. As we saw before, not a lot of changes in North Dakota. Its shape is a circle with a small radius; i.e., neither big, nor rapid changes in either murder or burglary rates (the last bubble is almost where started).

A JMP bubble plot, with line and bubbles trails, can really change the way you visualize change. Go ahead, give it a try.

Visualizing Data with Bubble Plots

2010-11-23T23:58:00.000-05:00

Bubble plots are a great way of displaying 3 or more variables using a X-Y scatter plot, and are a useful diagnostic tool for detecting outliers and influential observations in both logistic regression (Hosmer and Lemeshow used them in their 1989 book Applied Logistic Regression), and in multiple linear regression (What If Einstein Had JMP). New technologies have made it possible to animate the bubbles according to a given variable, such as time, as it was masterfully demonstrated by Hans Rosling in his talk, New Insights on Poverty, at the March 2007 TED conference.

Today, Nathan Yau posted an entry in his data visualization blog, FlowingData, on How to Make Bubble Charts using R. He gives 5 steps (6 if you count step 0), and the corresponding R code, to create a static bubble plot that shows the 2008 US burglary rate vs murder rate for each of the 50 states, with red bubbles representing the state population.

(From http://flowingdata.com/2010/11/23/how-to-make-bubble-charts/5-edited-version-2/)

Let me show you how easy it is to create static and dynamic bubble plots in JMP. The 2008 crime rate data is available at http://datasets.flowingdata.com/crimeRatesByState2008.csv, and can be conveniently read into JMP version 9 using File>Internet Open, as shown below.

To create a bubble plot we select Graph>Bubble Plot to bring up the Bubble Plot launch panel. Here we select Burglary as the Y, Murder as the X, and Population as the Sizes. The bubbles in Nathan's plot are red and are labeled with the state name. In order to color and label the bubbles we select State for both ID and Coloring. These selections are shown below

Once you click OK our multicolor bubble plot appears (I have modified the axis to match Nathan's plot). We quickly see that Louisiana and Maryland have the highest murder rates, and similar population sizes, and that North Carolina has the highest burglary rate.

In Step 3 Nathan shows how to size the bubbles by making the radius a function of the area of the bubble. Below the X-axis in JMP's bubble plot there is a slider to dynamically control the size of the bubble. You can just move the slider to the right for larger bubbles, or to the left to decrease their size. Very easy; no code required.

The static bubble plot above is a snapshot of the crime rates in 2008. What if we want to visualize how the burglary and murder rates changed over the years? In JMP, a time variable can be used to animate the bubbles. We use the same Bubble Plot selections as before but now we add Year as the Time variable.

Several stories now emerge from this dynamic plot. Around 1976 Nevada starts to move away from the rest of the states, with both a high burglary and murder rates, reaching a maximum around 1980, and returning to California and Florida levels by 1984. Around 1989 the murder rate in Louisiana starts to increase reaching 20 per 100,000 by 1993, staying between 15 and 20 per 100,00 all the way up to 1997, with a fairly constant burglary rate. We can also see that the crime rates for North Dakota are consistently low, and that by 1999 all the states seem to form a more cohesive group moving towards the lower left corner.

Bubble plots can be animated using other variables, not necessarily a time one. I have used the dynamic bubble plot to show how the relationship between a material degradation vs. time, changes from linear to nonlinear as a function of temperature. In the video below you can see that as the temperature increases from 9°C to 50°C the material degrades faster, and that for higher temperatures, 40°C and 50°C, the degradation is nonlinear. This is a nice visual that helps convey the message without the need to show the model equations.

With JMP's static and dynamic bubble plots you can easily display up to 6 variables (seven using ID2) in the 2-dimensional space of a scatter plot. What an efficient way of visualizing data!

Analysis of Means: A Graphical Decision Tool

2010-10-25T23:06:00.000-04:00

JMP version 9 has been out for about two weeks now, and I hope you had a chance to play with it. If you are not ready to buy it you can give it a try by downloading a 30-day trial copy.

Today I want to share with you a new feature in JMP version 9: the Analysis of Means (ANOM). An analysis of means is a graphical decision tool for comparing a set of averages with respect to their overall average. You can think of it as a control chart but with decision limits instead of control limits, or as an alternative to an analysis of variance (ANOVA). In an ANOVA a significant F test just indicates that the means are different, but it does not reveal where the differences are coming from. By contrast, in an ANOM chart if an average falls outside the decision limits it is an indication that this average is statistically different, with a given risk α, from the overall average.

Prof. Ellis Ott introduced the analysis of means in 1967 as a logical extension of the Shewhart control chart. Let's look at an example. The plot below shows measurements of an electrical assembly as a function of six different types of ceramic sheets used in their construction. The data appears in Table 13.1 of the first edition of Prof. Ott's book Process Quality Control.

One can see some differences in the average performance of the six ceramic sheets. A Shewhart Xbar and R chart shows that the ranges are in control, indicating consistency within a ceramic sheet, but that the average of the ceramic sheet #6 is outside the lower control limit. Based on this we can say that there is probably an assignable cause responsible for this low average, but we can not claim any statistical significance.

The question of interest, quoting from Prof. Ott's book, is: "Is there evidence from the sample data that some of the ceramic sheets are significantly different from their own group average?". We can perform an analysis of variance to test the hypothesis that the averages are different. The F test is significant at the 5% level, indicating that the average electrical performance of the six ceramic sheets differ from each other. The F test, being an 'omnibus' type test, does not, however, tells which, or which ones, are different. For this we need to perform multiple comparisons tests, or an analysis of means.

A ANOM chart can be easily generated by selecting Analysis of Means Methods > ANOM from within the Analyze>Fit Y by X>Oneway Analysis window

The ANOM chart clearly reveals that the assemblies built using the ceramic sheet #6 have an average that is (statistically) lower than the overall average of 15.952. The other five averages are within the 5% risk decision limits, indicating that their electrical performance can be assumed to be similar.

The ANOM chart with decision limits 15.15 and 16.76, provide a graphical test for simultaneously comparing the performance of these six averages. What a great way to perform the test and communicate its results. Next time you need to decide which average, or averages, are (statistically) different from the overall average, give the ANOM chart a try.

JMP Discovery 2010

2010-09-21T21:41:00.000-04:00

Last week I attended the JMP Discovery 2010 conference. What a great way to learn about the new features coming up in version 9, to network, to see old friends and make new ones, and to enjoy the many keynote speakers like Dan Ariely who talked about how predictably irrational we are. I also had the opportunity to lead a breakout session on Tailor-Made Split-Plot Designs, a short tutorial on JMP® Custom Design.

For us, the conference marked the one year anniversary of the publication of our book, and a time for celebration. We have received very positive feedback from our readers, Prof Phil Ramsey of New Hampshire university used our book in his summer course "Statistics for Engineers and Scientists", and, to top it all off, we won the 2009-2010 International Technical Publications Competition (ITPC) Award of Excellence awarded by the Society for Technical Communications.

John Sall, co-founder and Executive Vice President of SAS, gave a one-hour demo showcasing some of the new features of JMP 9. Windows users will discover a new look, while excel users can now leverage the profiler to optimize and simulate worksheet formulas. R users will be able to run R code from within JMP, and data miners will find a new array of tools. From the engineering and science side there are many enhancements to existing tools, as well as some new ones. Here are some of the new things:

Fit Y by X platform. Analysis of Means (ANOM) has been added when the X variable is categorical. ANOM is a graphical way for comparing means to the overall mean. Think of it as a control chart with decision limits.
Sample Size calculator. A Reliability Test Plan is now available to design reliability studies. There is also a Reliability Demonstration calculator for planning a reliability demonstration study.
Life Distribution. Several new distributions have been added including a 3-parameter Weibull distribution.
Degradation. A new tool for performing degradation analyses within the Analyze>Reliability and Survival platform.
Accelerated Life Test Design. Within the DOE platform to design ALT plans for one and two accelerating factors.
Graph Builder. Several new additions including the ability to use custom maps.

I am looking forward to sharing with you examples of how to use these new tools. The schedule date for the release of JMP 9 is October 12. Stay tuned.

Tolerance (Coverage) Intervals

2010-06-28T22:56:00.000-04:00

You are probably familiar with confidence intervals as a way to place bounds, with a given statistical confidence, on the parameters of a distribution. In my post I Am Confident That I Am 95% Confused!, I used the DC resistance data shown below to illustrate confidence intervals and the meaning of confidence.

From the JMP output we can see that the average resistance for the sample of 40 cables is 49.94 Ohm and that, with 95% confidence, the average DC resistance can be as low as 49.31 Ohm, or as large as 50.57 Ohm. We can also compute a confidence interval for the standard deviation. That's right, even the estimate of noise has noise. This is easily done by selecting Confidence Interval > 0.95 from the contextual menu next to DC Resistance

We can say that, with 95% confidence, the DC resistance standard deviation can be as low as 1.61 Ohm, or as large as 2.52 Ohm.

Although confidence intervals have many uses in engineering and scientific applications, a lot of times we need bounds not on the distribution parameters but on an given proportion of the population. In her post, 3 Is The Magic Number, Brenda discussed why the mean ± 3×(standard deviation) formula is popular for setting specifications. For a Normal distribution, if we know the mean and standard deviation without error, we expect about 99.73% of the population to fall within mean ± 3×(standard deviation). In fact, for any distribution the Chebyshev's inequality guarantees that 88.89% of the population will be contained between ± 3×standard deviations. It gets better. If the distribution is unimodal we expect at least 95.06% of the population to be between ± 3×standard deviations.

In real applications, however, we seldom know the mean and standard deviation without error. These two parameters have to be estimated from a random and, usually small, representative sample. A tolerance interval is a statistical coverage interval that includes at least a given proportion of the population. This type of interval takes into account both the sample size and the noise in the estimates of the mean and standard deviation. For normally distributed data an approximate two-sided tolerance interval is given by

Here g(1-α/2,p,n) takes the place of the magic "3", and is a function of the confidence,1-α/2, the proportion that we want the interval to cover (you can think of this as the yield that we want to bracket), p, and the sample size n. These intervals are readily available in JMP by selecting Tolerance Interval within the Distribution platform, and specifying the confidence and the proportion of the population to be covered by the interval.

For the DC resistance data the mean ± 3×(standard deviation) interval is 49.94±3×1.96 = [44.06 Ohm;55.82 Ohm], while the 95% tolerance interval to cover at least 99.73% of the population is [42.60 Ohm;57.28 Ohm]. The tolerance interval is wider than the mean ± 3×(standard deviation) interval, because it accounts for the error in the estimates of the mean and standard deviation, and the small sample size. Here we use a coverage of 99.73% because this is what is expected between ± 3×standard deviations, if we knew the mean and sigma.

Tolerance intervals are a great practical tool because they can be used to set specification limits, as surrogate for specification limits, or to compare them to a given set of specification limits; e.g., from our customer, to see if we will be able to meet them.

To "Log" or Not to "Log"

2010-06-21T19:48:00.001-04:00

Many statistical techniques rely upon the assumption that the data follows a normal distribution. In fact, questions surrounding the impact of normality on the validity of a statistical analysis ranks right up there with 'how many samples do I need for my results to be meaningful?'. Some statistical techniques derived from an assumption of normality of the data can deal with slight to moderate departures quite well, while others are greatly impacted. For instance, if a response such as bond strength is slightly skewed to the right with a mean well above 0 then a control chart for subgroup averages derived in the typical manner, using 3 sigma limits, should have adequate performance in detecting a large shift in the mean. However, a response which is heavily skewed to the right, with a lower boundary close to 0, for example, will produce unreasonable results using techniques based upon normality.

Let's examine the impact of departures from normality when setting a '3 sigma' upper specification limit based upon process capability, for a response that inherently follows a lognormal distribution. In the figure below, we see a histogram for 50 measurements of a response, which was randomly generated using a lognormal distribution. At first glance it is obvious that the distribution is skewed to the right with a lower bound close to 0, putting the normality assumption in question. However, is it so skewed that it will call into question the accuracy of statistical techniques derived from a normal assumption?

Let's fit a normal distribution to the data and use the normal probability model to determine the value at which the probability of exceeding it is 0.135%, i.e., a '3 sigma' equivalent upper limit. This value is 4.501, with 2% of the actual readings above this upper limit.

Now, let's fit the more appropriate lognormal distribution to this data and use the lognormal probability model to determine the value at which the probability of exceeding it is 0.135%, i.e., a '3 sigma' equivalent upper limit. This value is 9.501, with no actual readings above this limit.

The upper limits derived from these two distributions are quite different from each other; 9.501 (lognormal) vs. 4.501 (normal). Since we know this response is lognormal, we would be significantly underestimating the '3 sigma' upper limit using the normal assumption, resulting in 2% out-of-spec readings, compared to 0.135% that we set out to achieve. In a real world application, this could result in chasing false alarms on a control chart or scrapping product that is well within the distribution of the data. Before moving forward with a statistical analysis of your data, determine the best fitting distribution by evaluating the histogram, distribution based quantile plots, goodness of fit tests, and most importantly, understanding the context of the data. The best distributions to use for many engineering and scientific applications are well known and well documented.

Effective Sample Size

2010-05-24T22:48:00.000-04:00

It is Monday morning and you are traveling from home to work, a 20 miles journey, when you hit a traffic jam. Things are slow, traffic is doing about 20 mph; you start getting anxious about that meeting they scheduled for 8. After 10 miles of slow traffic the road suddenly clears up and you decide to travel the remaining 10 miles at a speed of 80 mph. You are late to the meeting and tell your boss what happened, she listen quietly and asks, what was your average speed?

Some people, when they hear the word average, automatically think of adding the two numbers, 20+80, and dividing the result by 2. The arithmetic mea of these two speeds is 50 mph. However, miles per hour (mph) is a ratio of two quantities for which the arithmetic average is not appropriate. The first 10 miles, traveling @20 mph, took you 30 minutes, while the last 10 miles, traveling @80 mph, took you 7.5 minutes. Since you traveled 20 miles in 37.5 minutes your average speed is 20*(60/37.5) = 32 mph.

The "average" of these two speeds is given not by the arithmetic mean but by the harmonic mean: the product of the two numbers divided by the arithmetic average of the two. The "average" speed is then [20x80] / [(20+80)/2] = 32. In general, for a set of n positive numbers, the harmonic mean H resembles the arithmetic average but in an inverted sort of way,

The harmonic mean is useful for determining the effective sample size when comparing two populations means. This is because the precision of the average, X̄, is given by the standard deviation, which is inversely related to the sample size. The effective sample size for comparing two means with sample sizes n₁ and n₂ is given by

Let's say you are conducting a study two compare two products, A and B, using a two-sample t-test, and you take a random sample of 4 product A, and a random sample of 12 product B units (since you want to be "sure" about product B you take "extra" samples). The effective sample size for comparing the average of the 4 product A samples with the average of the 12 product B samples is then [4x12]/[(4+12)/2] = 6. What this means is that your study with 16 samples is equivalent to a study with 6 samples each from populations A and B, for a total of 12 samples. In other words, you are using 4 extra samples.

As you can see, having a balanced number of samples (n₁=n₂) per population is not just a statistical nicety, but can save you materials, time, and money.

Statistics Is the New Grammar

2010-05-10T22:07:00.001-04:00

That is how Clive Thompson ends his article Do You Speak Statistics? in the May issue of Wired magazine. I like this sentence, it makes you think of statistics in a different way. Grammar has to do with language, with syntax, or "the principles or rules of an art, science, or technique" (Merriam-Webster). Statistics being the "New Grammar" implies a language that we may not totally understand, or even be aware of: a new way of looking at, and interpreting the world around us. Through several examples Thompson makes the point that statistics is crucial for public life.

…our inability to grasp statistics — and the mother of it all, probability — makes us believe stupid things. (Clive Thompson)

Back in the late 80s Prof. John A. Paulos wrote a book Innumeracy: Mathematical Illiteracy and Its Consequences, using a term, innumeracy, coined by Prof. Douglas R Hofstadter to denote a "person's inability to make sense of numbers" (Thompson quotes Prof. Allen but it does not mention innumeracy). In my post Lack of Statistical Reasoning I wrote about Prof. Pinker's observation that lack of statistical reasoning is the most important scientific concept that lay people fail to understand. When someone asks me what I do for a living I tell them that I help people "make sense of data", and when I collaborate with, and teach, engineers and scientists I help them realize what may seem obvious:

variation exists in everything we do;
understanding and reducing variation is key for success.

Thompson argues that "thinking statistically is tricky". Perhaps, but Statistical Thinking starts with the realization that, as Six-Sigma practitioners know full well, variation is everywhere.

A lot of controversy has been generated after the hacking of emails of the Climatic Research Unit (CRU) at the University of East Anglia. The university set up a Scientific Appraisal Panel to "assess the integrity of the research published by the Climatic Research Unit in the light of various external assertions". The first conclusion of the report is "no evidence of any deliberate scientific malpractice in any of the work of the Climatic Research Unit". From the statistics point of view is very interesting to read the second conclusion:

We cannot help remarking that it is very surprising that research in an area that depends so heavily on statistical methods has not been carried out in close collaboration with professional statisticians.

The panel's opinion is that the work of the scientists doing climate research "is fundamentally statistical", echoing Thompson's argument.

A few years ago Gregory F. Treverton, Director, RAND Center for Global Risk and Security, wrote an interesting article, Risks and Riddles, where he made a wonderful distinction between puzzles and mysteries: "Puzzles can be solved; they have answers", "A mystery cannot be answered; it can only be framed". But the connection he made between puzzles and mysteries to information is compelling,

Puzzle-solving is frustrated by a lack of information.
By contrast, mysteries often grow out of too much information. (Gregory F. Treverton)

There is so much information these days that another Wired magazine writer, Chris Anderson, calls it the Petabyte Age. A petabyte is a lot of data, a petabyte is a quadrillion bytes (10¹⁵), or the equivalent to about 13 years of HD-TV video. Google handles so much information that in 2008 was processing over 20 petabytes of data per day. In this data deluge, how do we know what to keep (signal) and what to throw away (noise)? That is why statistics is the new grammar.

What If Einstein Had JMP?

2010-05-03T21:42:00.001-04:00

Last week I had the opportunity to speak at the third Mid-Atlantic JMP Users Group (MAJUG) conference, hosted by the University of Delaware in Newark, Delaware (UDaily). The opening remarks were given by the Dean of the College of Engineering, Michael Chajes, who used the theme of the event, "JMP as Catalyst for Engineering and Science", to emphasize that the government and the general public need to realize that science alone is not enough for the technological advances that are required in the future, we also need engineering. It was a nice lead to my opening slide, the quote by Theodore Von Karman:

Scientists discover the world that exists;
engineers create the world that never was.

I talked about some of the things that made Einstein famous (traveling on a beam of light, turning off gravity, mass being a form of energy), as well as some things that may not be well known. Do you know that he used statistics in his first published paper, Conclusions Drawn from the Phenomena of Capillarity (Annalen der Physik 4 (1901), 513-523)?

Einstein "started from the simple idea of attractive forces amongst the molecules", which allowed him to write the relative potential of two molecules, assuming they are the same, as a sum of all pair interactions P = P_∞ - ½ c²∑∑φ(r). In this equation "c is a characteristic constant of the molecule, φ(r), however, is a function of the distance of the molecules, which is independent of the nature of the molecule". "In analogy with gravitational forces", he postulated that the constant c is an additive function of the number of atoms in the molecule; i.e., c =∑c_α, giving a further simplification for the potential energy per unit volume as P = P_∞ - K(∑c_α)²/ν². In order to study the validity of this relationship Einstein "took all the data from W. Ostwald's book on general chemistry", and determined the values of c for 17 different carbon (C), hydrogen (H), and oxygen (O) molecules.

It is interesting that not a single graph was provided in the paper. Had he had JMP he could have used my favorite tool, the graphics canvas of the graph builder, to display the relationship between the constant c and the number of atoms in each of the elements in the molecule.

We can clearly see that c increases linearly with the number of atoms in C and H (the last point, corresponding to the molecule with 16 hydrogen atoms, seems to fall off the line), but not so much with the number of oxygen atoms. We can also see that for the most part, the 17 molecules chosen have more hydrogen atoms than carbon atoms, and carbon atoms than oxygen atoms. This "look test", as my friend Jim Ford calls it, is the catalyst that can spark new discoveries, or the generation of new hypotheses about our data.

In order to find the fitting constant in the linear equation c =∑c_α, he "used for the calculation of c_α for C, H, and O [by] the least squares method". Einstein was clearly familiar with the method and, as Iglewicz (2007) puts it, this was "an early use of statistical arguments in support of a scientific proposition". These days it is very easy for us to fit a linear model using least squares. We just "add" the data to JMP and with a few clicks voilà, we get the "fitting constants", or parameter estimates as we statisticians call them, of our linear equation.

For a given molecule we can, using the parameter estimates from the above table, write the constant c = 48.05×#Carbon Atoms + 3.63×#Hydrogen Atoms + 45.56×#Oxygen Atoms.

For young Einstein it was another story. Calculations had to be done by hand, and round off and arithmetic errors produced some mistakes. Iglewicz notes "simple arithmetic errors and clumsy data recordings, including use of the floor function rather than rounding". He also didn't have available the wealth of regression diagnostics that have been developed to assess the goodness of the least squares fit. JMP provides many of them within the "Save Columns" menu.

One the first plots we should look at is a plot of the studentized residuals vs. predicted values. Points that scatter around the zero line like "white noise" are an indication that the model has done a good job at extracting the signal from the data leaving behind just noise. For Einstein's data the studentized residuals plot clearly shows a no so "white noise" pattern, putting into question the adequacy of the model. Points beyond ±3 in the y axis, are a possible indication of response "outliers". Note that two points are more than 2 units away from 0 center line. One of them corresponds to the high hydrogen point that we saw in the graph builder plot.

In addition to the studentized residuals and predicted values there are two diagnostics, the Hats (leverage) and the Cook's D Influence, that are particularly useful for determining if a particular observation is influential on the fitted model. A high Hat value indicates a point that is distant from the center of the independent variables (X outliers), while a large Cook's D indicates an influential point, in the sense that excluding from the fit may cause changes in the parameter estimates.

A great way to combine the studentized residuals, the Hats, and the Cook's D is by means of a bubble plot. We plot the studentized residuals on the Y-axis, the Hats on the X-axis, and use the Cook's D to define the size of the bubble. We overlay a line at 0 in the Y-axis, to help the eye assess randomness and gauge distances from 0, and a line at 2×3/17 ≃ 0.35 in the X-axis, to assess how large the Hats are. Here 3 is the number of parameters or "fitting constants" in the model and 17 corresponds to the number of observations.

Right away we can see that molecule #1, Limonene, the one with the largest number of hydrogen atoms (16), is a possible outlier in both the Y (studentized residual=-3.02) and X (Hat=0.47), and also and influential observation (Cook's D=1.99). Molecule #16, Valeraldehyde, could also be an influential observation. What to do? One can fit the model without Limonene and Valeraldehyde to study their effect on the parameter estimates, or perhaps consider a different model.

In the final paragraph of his paper Einstein notes that "Finally, it is noteworthy that the constants c_α in general increase with increasing atomic weight, but not always proportionally", and goes on to explain some of the questions that need further investigation, pointing out that ‘We also assume that the potential of the molecular forces is the same as if the matter were to be evenly distributed in space. This, however, is an assumption which we can expect to be only approximately true." Murrell and Grobert (2002) indicate that this is a very poor approximation because it "greatly overestimate the attractive forces" between the molecules.

For more on Einstein first published paper you can look at the papers by Murrell and Grobert (2002) and Iglewicz (2007), and Chapter 7 of our book Analyzing and Interpreting Continuos Data Using JMP where we analyze Einstein's data in more detail.

Sample Size Statements

2010-04-12T19:34:00.001-04:00

One of our readers, Gary Kelly, who is carefully reading our book Analyzing and Interpreting Continuous Data Using JMP: A Step-by-Step Guide, had, what he called, a curious question:

I am wondering how you typically report out the sample size and power in a statement, given that it is a function of both alpha and beta.

He pointed out that in our book we don't explicitly state it, which is true. He wanted to know

Using your example in chapter 5, starting on page 244, how would you write a statement about the sample size output?

The example in Section 5.3.5, pages 243-248, describes the sample size calculations for comparing two sample means. Figure 5.12, page 247, shows the JMP output from the Sample Size and Power calculator

In my previous post, How Many Do I Need?, I went over the required four "ingredients" for the calculations. We use JMP's default value of Alpha= 0.05 (5%), as the significance level for the test, we state that the noise in the system Std Dev=1 unit, or 1 sigma, that we want to detect a difference of at least 1.5 standard deviations (Difference to detect), and that we want the test to have Power of 0.9 (90%). The calculator indicates that we need a total of 21 samples, or 21/2 = 11 (rounding up) per group, for our study.

So how do we frame our statement about this sample size calculation? We can say that

A total sample size of 22 experimental units, 11 per group, provides a 90% chance of detecting a difference ≥ 1.5 standard deviations between the two populations means with 95% confidence.

Thanks Gary for bringing this up.

How Many Do I Need?

2010-04-05T22:40:00.009-04:00

This seems to be one of the most popular questions faced by statisticians, and one that, although may seem simple, always requires additional information. Let's say we are designing a study to investigate if there is a statistically significant difference between the average performance of two populations, like the average mpg of two types of cars, or the average DC resistance of two cable designs. In this two-sample test of significance scenario the sample size calculation depends on four "ingredients":

1. The smallest difference between the two averages that we want to be able to detect
2. The estimate of the standard deviation of the two populations
3. The probability of declaring that there is a difference when there is none
4. The probability of detecting a difference when the difference exists

The third ingredient is known as the significance level of the test, and is the probability of making a Type I error; i.e., declaring that there is a difference between the populations when there is none. The value of the significance level (Alpha) is usually taken as 5%. It was Sir Ronald Fisher, one of the founding fathers of modern statistics, who suggested the value of 0.05 (1 in 20)

as a limit in judging whether a deviation ought to be considered significant or not.

However, I do not believe Fisher intended 5% to become the default value in tests of significance. Notice what he wrote in his 1956 book Statistical Methods and Scientific Inference.

No scientific worker has a fixed level of significance from year to year, and in all circumstances, he rejects hypothesis; he rather gives his mind to each particular case in the light of his evidence of ideas. (3rd edition, Page 41)

The last ingredient reflects the ability of the test to detect a difference when it exists; i.e. its power. We want our studies to have good power, a suggested value is 80%. But be careful, the more power you want the more samples you are going to need. In a future post I will show you how a Sample Size vs. Power plot is a great tool to evaluate how many samples are needed to achieve certain power.

The research hypothesis under consideration should drive the sample size calculations. Let's say that we want to see if there is a difference in the the average DC resistance performance of two cable designs. Given that the standard deviation for these cable designs is about 0.5 Ohm, the question of "how many samples do we need?" now becomes:

how many samples do we need to be able to detect a difference of at least 1 Ohm between the two cable designs, with a significance level of 5% and a power of 80%.

In JMP it is very easy to calculate sample sizes as a function of the four ingredients described above. From the DOE menu select Sample Size and Power, and then the type of significance test to be performed. The figure below shows the Sample Size and Power dialog window for the DC resistance two-sample test of significance. Note that by default JMP populates Alpha, the significance level of the test, with 0.05. The highlighted values are the required inputs and the "Sample Size" the total required sample size for the study.

The results indicate that we need about 11 samples, or 6 per cable design, to be able to detect a difference of at least 1 Ohm between the two cable designs.

Next time you ask yourself, or your local statistician, how many samples are needed, remember that additional information is required, and that the calculations only tell you how many samples you need but not how and where to take the samples, the sampling scheme (more about sampling schemes on a future post).

Old Dog, New Tricks

2010-02-22T20:56:00.001-05:00

The statistics profession has gotten some good hype over the past year. In the summer the New York Times published "For Todays Graduate, Just One Word: Statistics". In this article, they discuss "the new breed of statisticians. . ." which ". . . use powerful computers and sophisticated mathematical models to hunt for meaningful patterns and insights in vast troves of data." Some of these statisticians can earn a whopping 6 figure salary in their first year after graduating and they get to analyze data from areas which include ". . . sensor signals, surveillance tapes, social network chatter, public records and more." And, I have to agree with the chief economist at Google, Hal Varian, the job does sound kind of "sexy".

As a 20 year veteran supporting engineers and scientists in the Industrial sector, I feel a little left behind when I read such articles or peruse the job openings section of journals and see the type of statisticians being recruited, month after month, and year after year. It is interesting to see how statistics, and statisticians, have adapted to the changing world around us. During the technology boom in the 1980's, the industrial statistician was king (or queen). I consider myself privileged to have actually worked in the Semiconductor industry during its boom, where the need to look for patterns and signals in vast amounts of data were, and still are, common place. If you were pursuing a statistics degree in the past decade, you would be foolish not to consider the specialization of Biostatistics. With the explosion of direct-to-consumer marketing of drugs, drug companies needed these types of statisticians to design and analyze clinical trials to determine the efficacy and safety of the drugs. As we look to the past 5 years or so, we see a new hybrid of statistician, one that combines statistics, mathematics, and computer science to better deal with digital data in a variety of areas, such as finance, web traffic, and marketing.

As I already mentioned, it is hard not to want to "jump ship" to be part of the latest exciting surge of statistics to come along. That is, until I get a dose of reality which brings me back to center. I guess troves of data also require droves of statisticians and data analysts. If you go on to read the New York Times article mentioned above, you will see that these new super statisticians may work in a group with 250 other data analyst, all hoping for that big break through mathematical/statistical algorithm that will better predict consumer behavior or web traffic patterns.

I should consider myself lucky that I actually get to interact with the engineers and scientist that run the experiments that I have designed and take action on the outcomes of the analysis that I presented. Unfortunately, the days where the industrial statistician reined supreme are long past. But luckily, there is still enough manufacturing in the United States to keep the few of us who remain busy and, even though I'm an old dog, I can still learn some new tricks!

What Kind of Trouble Are You In?

2010-02-02T07:31:00.001-05:00

Well, I guess that depends on what you have done and more importantly if you have gotten caught! Many of you are probably wondering what this has to do with statistics. In his book, The Six Sigma Practitioner's Guide to Data Analysis (2005), Wheeler aptly describes the nature of "trouble" as it relates to the stability (predictability) and capability (product conformance) of manufacturing processes. Using these two dimensions, a process can be in one of four states:

1. No Trouble: Conforming product (capable) & predictable process (stable)
2. Process Trouble: Conforming product (capable) & unpredictable process (unstable)
3. Product Trouble: Nonconforming product (incapable) & predictable process (stable)
4. Double Trouble: Nonconforming product (incapable) & unpredictable process (unstable).

Hopefully, most of you have experienced a process that is in the 'No Trouble' zone, which is also referred to as the 'ideal state'. The focus of these processes should be on maintaining and sustaining a state of stability and capability. A process which is in 'Process Trouble' is unstable, but producing product which is within specification limits most of the time. That is, measurements may be out of the control limits but within specification limits. Unless this type of process requires heroic feats, by operators and engineers, to make conforming product, its instability most likely goes undetected because we have not yet gotten caught by a yield bust. Moving on, a process that is in 'Product Trouble' can be thought of as a process that is predictably "bad". In other words, because it is stable the process is doing the best that it can in in currents state, but its best performance results in a consistent amount of nonconforming product. While nonconforming product is undesirable, if the losses are consistent, the job of planning and logistics will be much easier. Finally, the 'Double Trouble' process is both unstable and incapable, which can result in a big headache for the business and for those supporting it!

In order to determine the state of your process, you will first need to determine the key output attributes and measurements and then assess their process stability and capability. Recall from my last post, "Is a Control Chart Enough to Evaluate Process Stability", the stability of a process can be determined using a control chart and looking for nonrandom patterns or trends in the data and unusual points that plot outside of the control limits. In addition, the SR ratio can be added to provide a more objective assessment of the process stability, with a stable process producing an SR ratio close to 1 and an unstable process resulting in an SR ratio > 1. The control chart below is the Tensile strength data presented in my post, "Why Are My Control Limits So Narrow?", with the control limits adjusted for the large batch-to-batch variation. There are no points out of control for this process and the SR ratio = 2.27² / 2.245² = 1.02, indicating a stable process parameter.

How do we assess the capability of a process? This can be done by evaluating the process capability index, C_pk, and determining if it meets our stated goal of "at least 1". For the Tensile data, what if our specification limits for any individual measurement are LSL =45 and USL = 61 and note the target value = 53. From the output below, we see that C_pk = 0.832 with a 95% confidence interval of (0.713, 0.951). Since the upper bound of our confidence interval is less than 1, we have not shown that we can meet our goal of "at least 1". Therefore, by this definition, we would assess this process parameter as incapable.

Based upon results from the process stability assessment (stable) and process capability assessment (incapable) this would put this process parameter in the "Product Trouble" zone. In other words, our process is predictably "bad" and makes out-of-spec product on a regular basis. If we recenter the average Tensile Strength closer to the target value of 53, then we can achieve a C_pk = 1.137, as is shown by C_p in the output above and possible achieve the "ideal state" or "no trouble" zone.

Periodically conducting these assessments to understand the state of your processes is advisable. There is no reason to wait until you are in "double trouble" to pay attention to the health of your processes, because in this context, one of two scenarios has probably occurred. Either your customer was the recipient of bad product and informed you of the problem or, you discovered a rash of bad product through an unexpected yield bust at final inspection. Yes, you got caught! In either event, working through these types of process upsets is draining to the business and potentially dissatisfying to the customer. Remember, ignoring an unstable and incapable process, will eventually catch up with you.

I Am Confident That I Am 95% Confused!

2010-01-24T20:16:00.001-05:00

You recently completed an analysis of DC resistance data that shows that the distribution is centered around 49.94 Ohm with a standard deviation of 1.96 Ohm. The JMP output also includes a 95% confidence interval for the mean DC resistance equal to [49.31;50.57]. This is good news because you can now report to your customer that there is 95% chance that average DC resistance is between 49.31 and 50.57 Ohm, and therefore you are meeting the specified target of 50 Ohm.

Before you talk to the customer you decide to check with your local statistician, to be sure about the claim you are going to make. She tells you that we cannot really say that there is a 95% chance the mean DC resistance is between 49.31 and 50.57 Ohm. You see, she says, this is a long run type of statement in the sense that if you were to construct many of such intervals, on average, 95 out a 100 will contain the true DC resistance mean. You leave her office totally confused because in your mind these two statements sound the same.

Imagine yourself sitting at a poker table. Depending on your imagination you can be sitting at a table at the Bellagio's Poker Room in Vegas, or at your friend's house on a Thursday night. For a 53-card deck, before the draw, you have about a 42% chance of being dealt a hand with "one pair". However, once the hand is dealt you either have "one pair" or you don't. In other words, the frequency of "one pair" showing up in a hand is about 42%. Frequency here means that if you play poker on a regular basis then, on average, in 100 games played you expect to get a hand with "one pair" in about 42 of those games.

The same is true for a confidence interval. Before you generate a 95% confidence interval for the mean, there is 95% chance that the interval will contain the true mean value, but once the interval is generated, [49.31; 50.57] for example, the true mean value is either in the interval or outside it. And the fact is that we really don't know if the true mean value is in the interval because we don't know what the true mean value is! It is just like getting a poker hand without being able to turn the cards to see if you got the "one pair". All we have is the confidence that on average 95% of the intervals will in fact contain the true mean value. The confidence is a statement, not on a given interval, but on the procedure that is used to generate the interval.

Simulations can helps us visualize and understand the meaning of statistical confidence. The video below shows a simulation that generates one hundred 95% confidence intervals for the mean. In the simulation we mimic the DC resistance data in the histogram above by using a sample size of 40 observations, from a population with true mean=50 and true standard deviation=2. For a 95% confidence interval for the mean we expect, on average, that 95% of the intervals will contain the true value of 50. In the simulation those intervals that do not contain 50 are colored red. You can see that for each new sample in the simulation sometimes 93% of the intervals contain 50, other times 97%, 95%, or 98%, but on average 95% of then do contain the true population mean of 50.

I hope this helps dispel some of the confusion regarding the meaning of statistical confidence. You can find more details about the meaning of statistical confidence and statistical intervals in Chapter 2 of our book, or in the white paper Statistical Intervals: Confidence, Prediction, Enclosure.

Today Was a Good "Webinar" Day

2010-01-12T22:02:00.000-05:00

We did it! Brenda and I, sponsored and with the help of the SAS Press team, gave our first webinar today. The SAS Press team said we reached the 50% attendance rate which, according to webinar attendance rate statistics, is pretty good. I must confess that it took a little bit of getting used to. When I give talks or teach I'm always in front of a live audience, and I pay close attention to the participants and their body language for cues as to how my delivery is going, if they are understanding the material, or if they need a break. In a webinar you pretty much talk to your computer screen without any visual or audio feedback from the audience.

For the webinar we used a semiconductor industry example involving the qualification of a temperature-controlled vertical furnace used for thin film deposition on wafers. The goal of the qualification was to show that the average thickness of the silicon dioxide layer, a key fitness-for-use parameter, meets the target value of 90 Angstrom, and to predict how much product will be outside the 90 ± 3 Å specifications.

We walked the participants through a 7-Step Method that includes clearly stating the questions or uncertainties to be answered, translating those into statistical hypothesis that can be tested with data, and the different aspects of data collection, analysis, interpretations of results, and recommendations (more details in Chapter 4 of our book). We featured JMP's Distribution and Control Chart platforms, as well as the Formula Editor to predict the expected yield loss using a normal distribution. Several interesting questions were raised by the participants including what is the meaning of confidence level, what is a good C_pk value, how do we predict yield loss with respect to specifications, and the value of changing the specifications rather than centering the process. Great topics for future posts!

Today was a good day. We had the opportunity to deliver a well attended webinar and, to top it all off, the SAS Press team told us that our book, Analyzing and Interpreting Continuous Data Using JMP: A Step-by-Step Guide, just won the 2009-2010 Society for Technical Communications Distinguished award. For this, we are thankful to the judges, our readers, and the JMP and SAS Press teams. We are also very grateful to those of you who were in attendance today for giving us the chance to try this out.

Is a Control Chart Enough to Evaluate Process Stability?

2010-01-04T20:09:00.000-05:00

A control or process behavior chart is commonly used to determine if the output for a process is in a "state of statistical control", i.e., it is stable or predictable. A fun exercise is to generate random noise, plot it on a control chart and then ask users to interpret what they see. The range of answers is as diverse as asking someone to interpret the meaning behind a surrealist painting by Salvador Dalí. As a case in point, take a look at the control chart below and determine if the output of this process is stable or not.

I suppose a few of you would recognize this as white noise, while others may see some interesting patterns. What about those 2 points that are close to the control limits? Is there more variation in the first half of the series than the second half? Is there a shift in the process mean in the second half of the series? Is there a cycle?

How can we take some of the subjectivity out of interpreting control charts? Western Electric rules are often recommended for assessing process stability. Certainly, this is more reliable than merely eyeballing it ourselves, we humans tend to see patterns when there are none, and they can provide us with important insights about our data. For instance, the same data is shown below with 4 runs tests turned on. We see that we have two violations in runs tests. Test 2 detects a shift in the process mean by looking for at least 8 points in a row falling on the same side of the center line; while Test 5 flags when at least 2 out of 3 successive points fall on the same side, and more than 2 sigma units away from the center line (Zone A or beyond). Does this mean our process output is unstable?

Remember, this data represents random noise. Some of you may be surprised that there are any violations in runs rules, but these are what we call 'false alarms'. Yes, even random data will occasionally violate runs rules with some expected frequency. False alarms add to the complexity of identifying truly unstable processes. Once again, how can we take some of the subjectivity out of interpreting control charts?

Method 1 and Method 2 to the rescue! In José's last post, he described 3 ways for computing the standard deviation. Recall, Method 1 uses all of the data to calculate a global estimate of the standard deviation using the formula for the sample standard deviation. Method 2, however, uses a local estimate of variation by averaging the subgroup ranges, or in this case, moving ranges, and dividing the overall range average by the scaling factor d2. When the process is stable, these two estimates will be close in value, and the ratio of their squared values (SR ratio) will be close to 1. If our process is unstable, then the standard deviation estimate from Method 1 will most likely be larger than than the estimate from Method 2, and the ratio of their squared values will be greater than 1.

For the random data in the control chart shown above, the SR ratio = 1.67²/1.62² = 1.06, which is close to 1, suggesting a stable process or in a state of statistical control. As a counterpoint, lets calculate the SR ratio for the control chart shown in my last post, which is reproduced below. The SR ratio = 2.35²/0.44² = 28.52, which is way bigger than 1. This suggests an unstable process; however, in this case, it is due to the inappropriate control limits for this data.

The SR ratio is a very useful statistic to complement the visual assessment of the stability of a process. It also provides a consistent metric for classifying a process as stable or unstable and, in conjunction with the C_pk, can be used to assess the health of a process (more in a future post). For the two examples shown, it was easy to interpret the SR ratios of 1.06 and 28.52, which represent the two extremes of stability and instability. But what happens if we obtained an SR ratio of 1.5 or 2, is it close to 1 or not? For these situations, we need to obtain the p-value for the SR ratio and determine if it is statistically significant at a given significance level. To learn more about the SR ratio and other stability assessment criteria, see the paper I co-authored with Professor George Runger, Quantitive Assessment to Evaluate Process Stability.

SPC and ANOVA, What's the Connection?

2009-12-15T06:08:00.000-05:00

The plot below shows 3 subgroups of size 8 for each of two different processes. For Process 1 the 3 subgroups look similar, while for Process 2 subgroup 2 has lower readings than subgroups 1 and 3.

Data from Dr. Donald J. Wheeler's SPC Workbook (1994).

Three Estimates of Standard Deviation

For each process, there are three ways we can obtain an estimate of the standard deviation of the population that generated this data. Method 1 consists of computing a global estimate the standard deviation using all the 8x3 = 24 observations. The standard deviation of Process 2 is almost twice as large the standard deviation of Process 1.

In Method 2 we first calculate the range of each of the 3 subgroups, compute the average of the 3 ranges, and then compute an estimate of standard deviation using Rbar/d2, where d2 is a correction factor that depends on the subgroup size. For subgroups of size 8 d2 = 2.847. This is the local estimate from an R chart that is used to compute the control limits for an Xbar chart.

Since for each process the 3 subgroups have the same ranges (5, 5, and 3), they have the same Rbar = 4.3333, giving the same estimate of standard deviation, 4.3333/2.847 = 1.5221.

Finally, for Method 3 we first compute the standard deviation of the 3 subgroup averages,

and then scale up the resulting standard deviation by the square root of the number of observations per subgroup, √8 = 2.8284. For Process 1 the estimate is given by 0.5774×√8 = 1.7322, while for Process 2, 3×√8 = 8.485.

The table below shows the Methods 1, 2, and 3 standard deviation estimates for Process 1 and 2. Readers familiar with ANalysis Of VAriance (ANOVA) will recognize Method 2 as the estimate based on the within sum-of-squares, while Method 3 is the estimate coming from the between sum-of-squares.

You can quickly see that for Process 1 all 3 estimates are similar in magnitude. This is a consequence of Process 1 being stable or in a state of statistical control. Process 2, on the other hand, is out-of-control and therefore the 3 estimates are quite different.

In SPC an R chart answers the question "Is the within subgroup variation consistent across subgroups?" While the XBar chart answers the question “Allowing for the amount of variation within subgroups, are there detectable differences between the subgroup averages?”. In an ANOVA the signal-to-noise ratio, F ratio, is a function of Method 3/Method 2, and signals are detected whenever the F ratio is statistically significant. As you can see there is a one-to-one correspondence between an XBar-R chart and the oneway ANOVA.

A process that is in a state of statistical control is a process with no signals from the ANOVA point of view.

In an upcoming post Brenda will talk about how we can use Method 1 and Method 2 to evaluate process stability.

JMP Summary Statistics Without The Statistics

2009-12-07T21:22:00.000-05:00

One of may favorites, and most used, JMP commands is the Summary command within the Tables menu (Tables > Summary). The Summary command can generate several summary statistics (Mean, Std. Dev., Min, Max, etc.) for the continuous variables in your data table according to the different levels of grouping (classification) variables. But do you know that you can just use the Group variable list in the Summary dialog without requesting any summary statistics?

To illustrate, the Cars sample table, from the JMP Sample Library, contains 352 observations from trials in which stock automobiles are crashed into a wall at 35MPH with dummies in the driver and front passenger seats. The sample table also contains several classification variables including Make, Number of Doors, and Size.

I was curious to know how many different brands where used in the study. We can answer this question is by selecting Table > Summary, placing the variable Make in the Group area of the Summary dialog, and clicking OK.

The resulting table contains a list of the unique makes that were used in the study along with the number of observations belonging to each make. There were 37 different brands used in the study, with 42 Chevrolet cars and only 2 BMW. Another (very) nice feature is that the summary table is linked to the active data table, the source table, so clicking on 'Row 6: Make = Chevrolet' selects in the source table the corresponding 42 rows where Make = Chevrolet.

You can now select the Table > Subset command to create a subset table with only the Chevrolet observations. This is very handy if you have a table with thousands of observations and you need to create subset tables according to the levels of one classification variable, or the combinations of levels of classification variables.

What if you want to add summary statistics to one of these summary tables? No need to go back to the Table > Summary menu. Just click the contextual menu (red triangle) in the upper left-hand corner, the columns area, of the summary table and select Add Statistics Column. This brings up the Summary dialog for you to select the variable, or variables, and the summary statistics you want.

If you use pivot tables in excel to summarize your data I encourage to try the powerful data manipulation tools in the Tables menu of JMP, including the Tabulate platform, which is a fully drag-and-drop interface for creating summary tables.

Why Are My Control Limits So Narrow?

2009-11-30T20:50:00.001-05:00

Statistical Process Control (SPC) charts are widely used in engineering applications to help us determine if our processes are predictable (in control). Below are Xbar and Range charts showing 25 subgroup averages and ranges for 5 Tensile Strength values (ksi) taken from each of 25 heats of steel. The Range chart tells us if our within subgroup variation is consistent from subgroup-to-subgroup and the Xbar chart tells us if our subgroup averages are similar. The Xbar chart has 19 out of 25 points outside of the limits. This process looks totally out-of-control, or does it?

Data taken from Wheeler and Chambers (1992), Understanding Statistical Process Control, 2nd edition, table 9.5, page 222.

The limits for Xbar are calculated using the within subgroup ranges, Rbar/d2. In other words, the within subgroup variation, which is a local measure of variation, is used as a yardstick to determine if the subgroup averages are predictable. In the context of our data, the within subgroup variation represents the variation among 5 samples of steel within one heat (batch) of the steel and the between subgroup variation represents the heat-to-heat variation. While the details are limited, we can imagine that every time we have to heat a batch of steel, we may be changing raw material lots, tweaking the oven conditions, or running them on a different shift, which can lead to more than one basic source of variation in the process.

Having multiple sources of variation is quite common for processes which are batch driven and the batch-to-batch variation is often the larger source of variation. For the Tensile Strength data, the heat-to-heat variation accounts for 89% of the total variation in the data. When we form rational subgroups based upon a batch, the control limits for the Xbar chart will only reflect the within batch variation and may result in control limits which are unusually tight and many points will be outside of the control limits.

In order to make the Xbar chart more useful for this type of data we need to adjust the control limits to incorporate the batch-to-batch variation. While there are several ways to appropriately adjust the limits on the Xbar chart, the easiest way is to treat the subgroup averages as individual measurements and use an Individuals and Moving Range chart to calculate the control limits.

The plot below shows the Tensile Strength data for the 25 heats of steel and was created using a JMP script for a 3-Way control chart. The first chart is the Xbar chart with the adjusted limits using the moving ranges for the subgroup averages and the chart below it is the moving range chart for the subgroup averages. The third chart (not shown here) is the Range chart already presented earlier. Note, the limits on the Range chart do not require any adjustments. Now what do we conclude about the predictability of this process?

Indeed, the picture now looks quite different. No points are outside of the limits and there are no violations in runs rule. The Range chart shows 3 points above the upper control limit suggesting that these three heats of steel had higher within subgroup variation. As Wheeler and Chambers point out, "this approach should not be used indiscriminately, and should only be used when the physical situation warrants its use".

Lack of Statistical Reasoning

2009-11-20T08:23:00.000-05:00

In Sunday Book Review's Up Front: Steven Pinker section of the New York Times, it was interesting to read about Malcom Gladwell's comment on "getting a master's degree in statistics" in order "to break into journalism today". This has been a great year for statistics considering Google's chief economist, Hal Varian, comment earlier this year: “I keep saying that the sexy job in the next 10 years will be statisticians”, and the Wall Street Journal's The Best and Worst Jobs survey which has Mathematician as number 1, and Statistician as number 3.

What really caught my attention in Sunday's Up Front was Prof. Steven Pinker's, who wrote the review on Gladwell's new book "What the Dog Saw", remark when asked "what is the most important scientific concept that lay people fail to understand". He said: “Statistical reasoning. A difficulty in grasping probability underlies fallacies from medical quackery and stock-market scams to misinterpreting sex differences and the theory of evolution.”

I agree with him but I believe that is not only lay people that lack statistical reasoning, but as scientists and engineers we sometimes forget about Statistical Thinking. Statistical Thinking is a philosophy of learning and action that recognizes that:

All work occurs in a system of interconnected processes,
Variation exists in all processes, and
Understanding and reducing variation is key for success

Globalization and a focus on environmental issues is helping us to "think globally", or look at systems rather than individual processes. When it comes to realizing that variation exists in everything we do, we lose sight of it as if we were in a "physics lab where there is no friction". We may believe that if we do things in "exactly" the same way, we'll get the same result. Process engineers know first hand that doing things "exactly" the same way is a challenge because of variation in raw materials, equipment, methods, operators, environmental conditions, etc. They understand the need for operating "on target with minimum variation". Understanding and minimizing variation bring about consistency, more "elbow room" to move within specifications, and makes it possible to achieve six sigma levels of quality.

This understanding of variation is key in other disciplines as well. I am waiting for the day when financial reports do not just compare a given metric with the previous year, but utilize process behavior (control) charts to show the distribution of the metric over time, giving us a picture of its trends, of its variation, helping us not to confuse the signals with the noise.

Happy Birthday JMP!

2009-11-16T20:07:00.000-05:00

We know we're late, JMP's birthday was October 5, but we have been busy with PR activities for our book, which includes creating and maintaining this blog. That said, JMP is 20 years old and, in those 20 years, JMP has become one of our favorite software packages that we use daily.

John Sall, co-founder and Executive Vice President of SAS, who leads the JMP business division recently wrote about JMP's 20th birthday in his blog, bLog-Normal Distribution. John describes the events that lead up to the first release of JMP on October 5, 1989 and the niche that it filled for engineers and scientists as a desktop point-and-click software tool that takes full advantage of the graphical user interface.

As we reflect upon using JMP, both in our own work as statisticians and in collaborating with engineers and scientists, our experiences mirror, almost exactly, what is described in JMP is 20 Years Old. John wrote, "We learned that engineers and scientists were our most important customer segment. These people were smart, motivated and in a hurry - too impatient to spend time learning languages, and eager to just point and click on their data." Things have not changed much. Engineers and scientists are busier than ever, and want to be able to get quick answers to the challenges they face. They really value JMP's powerful and easy-to-use features.

"What was missing was the exploratory role, like a detective, whose job is to discover things we didn't already know", writes John. JMP has made detectives of all of us by giving us the ability to easily conduct Exploratory Data Analysis (EDA) using features such as linked graphs and data tables, excluding/including observations from plots and analysis on the fly, and drag-and-drop tools, such as the Graph Builder and the Table Builder (Tabulate).

Here are some of our old and new JMP favorites that we find ourselves using over and over again.

- Graph Builder: new drag and drop canvas for creating a variety of graphs allowing us to display many data dimensions in one plot.
- Profiler Simulator: awesome tool that gives us the ability to use simulation techniques to define and evaluate process operating windows.
- Variability/Gauge Chart: one of our all time favorites to study and quantify sources of variability and look for systematic patterns or profiles in the data.
- Distribution: a real work horse. Great to examine and fit different distributions to our data, calculate statistical intervals (confidence, prediction, tolerance), conduct simple tests of significance on the mean and standard deviation of a population, and perform capability analysis.
- Control Chart > Presummarize: this function makes it even easier to fit more appropriate control limits to Xbar charts for data from a batch process, which contains multiple sources of variation.
- Bubble Plot: a dynamic visualization tool that shows a scatter plot in motion and is sure to wow your friends.
- Reliability Platform: new and improved reliability tools that make it easy to fit and compare different distributions, as well as, predict product performance.

Happy Birthday JMP. We look forward to 20 more years of discoveries and insights in engineering and science!

Brenda and José

Normal Calculus Scores

2009-11-10T21:57:00.001-05:00

6DZU26SW93B5 A few weeks ago I was reading the post Double Calculus on the Learning Curves blog and the histogram of the grade distribution of the calculus scores really what caught my attention. For starters, the histogram was generated using JMP and I'm always glad to see other users of JMP, but most of all, the distribution looked quite normal. Quoting from the blog: "Can you believe this grade distribution? Way more normal than anything that comes out of my class. Skewness of 0.03."

Images of grading by the "curve", as well as "normal scores", came to mind, and this made me think of my favorite tool for assessing normality: the normal probability plot. The normal probability plot is a plot of the ordered data against the expected normal scores (Z scores) such that, if the normal distribution is a good approximation for the data, the points follow an aproximate straight line.

A normal probability plot is easily generated in JMP using the distribution platform by clicking the contextual menu to the right of the histogram title.

In a normal probability plot the points do not have to fall exactly on a straight line, just hover around it so that a "fat pen" will cover them (the "fat pen" test). JMP also provides confidence bands around the line to facilitate interpretation.

We can clearly see that the calculus scores follow closely the straight line, indicating that the data can be well approximated by a normal distribution. These calculus scores are in fact normal scores!

Practical Significance Always Wins Out

2009-11-03T19:43:00.000-05:00

Engineers and scientists are the most pragmatic people that I know when it comes to analyzing and extracting key information with the statistical tools they have at hand. It is this level of pragmatism that often leads me to recommend equivalence tests for comparing one population mean to a standard value k, in place of the more common test of significance. Think about how a Student's t-test plays out in an analysis to test the hypothesis, Null: μ = 50 ohm vs. Alternative: μ ≠ 50 ohm. If we reject the null hypothesis in favor of the alternative then we say that we have a statistically significant result. Once this is established, the next question is how far is the mean off from the target value of 50? In some cases, this difference is small, say 0.05 ohm, and is of no practical consequence.

The other possible outcome for this test of significance is that we do not reject the null hypothesis and, although we can never prove that μ = 50 ohm, we sometimes behave like we did and assume that the mean is no different from our standard value of 50. The natural question that arises is usually, "can I say that the average resistance = 50 ohm?" to which I reply "not really".

My secret weapon to combining statistical and practical significance in one fell swoop is to use an Equivalence Test. Equivalence tests allow us to prove that our mean is equivalent to a standard value within a stated bound. For instance, we can prove that the average DC resistance of a cable is 50 ohm within ± 0.25 ohm. This is accomplished by using two one-sided t-tests (TOST) on either side of the boundary conditions and we must simultaneously reject both sets of hypothesis to conclude equivalence. These two sets of hypotheses are:

a) H0: μ ≤ 49.75 vs. H1: μ > 49.75 and
b) H0: μ ≥ 50.25 vs. H1: μ < 50.25.

The equivalence test output for this scenario is shown below. Notice that, at the 5% level of significance, both p-values for the 2 one-sided t-tests are not statistically significant and therefore, we have NOT shown that our mean is 50 ± 0.25 ohm. But why not? The test-retest error for our measurement device is 0.2 ohm, which is close to the equivalence bound of 0.25 ohm. As a general rule, the equivalence bound should be larger than the test-retest error.

Let's look at one more example using this data to show that our mean is equivalent to 50 ohm within ± 0.6 ohm. We have chosen our equivalence bound to be 3 times the measurement error of 0.2 ohm. The JMP output below now shows that, at the 5% level of significance, both p-values from the 2 one-sided t-tests are statistically significant. Therefore, we have shown equivalence of the average resistance to the stated bounds of 49.4 and 50.6 ohm and therefore, equivalent to 50 ohm performance.

To learn more about comparing average performance to a standard, and one-sample equivalence tests, see Chapter 4 of out book, Analyzing and Interpreting Continuous Data Using JMP: A Step-by-Step Guide.

Different or Equivalent?

2009-10-25T16:25:00.002-04:00

When we show that the results of our study are "statistically significant" we feel that the study was worth the effort, that we have met our objectives. This is because the current meaning of the word "significant" implies that something is important or consequential but, unfortunately, that was not its intended meaning. (See John Cook's blog "The Endeavor" for a nice post on the Origin of “statistically significant”).

Let's say we need to to make a claim about the average DC resistance of a certain type of cable we manufacture. We set up the null hypothesis as μ=50 Ohm vs. the alternative hypothesis μ≠50 Ohm, and measure the resistance of 40 such cables. If the one-sample t-test, based on the sample of 40 cables, is statistically significant we can claim that the average DC resistance is different from 50 Ohm. Our claim does not imply that this difference is of any practical importance, this depends on the size of the difference, just that the average DC resistance is not 50 Ohm. A test of significance is a test of difference. This is the operational definition given to the term "statistical significance" by Sir Ronald Fisher in his 1925 book Statistical Methods for Research Workers: “Critical tests of this kind may be called tests of significance, and when such tests are available we may discover whether a second sample is or is not significantly different from the first" (emphasis mine).

What if we do not reject the null hypothesis μ=50 Ohm? Although tests of significance are set up to demonstrate difference not equality, we sometimes take this lack of evidence as evidence that the average DC resistance is in fact 50 Ohm. This is because in practice we encounter situations where we need to demonstrate to a customer, or government agency, that the average DC resistance is "close" to 50 Ohm. In the context of significance testing what we need to do is to swap the null and alternative hypothesis and test for equivalence within a given bound; i.e., test μ≠50 Ohm vs. |μ-50 Ohm|< δ, where δ is a small number. In the next post Brenda discusses how a test of equivalence is a great way of combining statistical with practical significance.