Stat Insights

Tuesday, December 7, 2010

Music and Data Visualization (What It's All About)

A couple of weeks ago, mashup artist Girl Talk (Gregg Gillis) released his new album, All Day. If you are not familiar with Girl Talk's music, each track is a mashup of samples from different artists. How do you visualize one of these musical mashups? Benjamin Rahn has come up with an effective and clever way of doing it. If you love music and data visualization check out what Benjamin has done: All Day by Girl Talk - Mashup Breakdown (be aware that the lyrics of some of the sampled tracks are explicit).

Amazing, isn't? You can not only hear the music but also visualize, in real time, which samples are playing within the mashup. This "little side project", as Benjamin put it on his twitter feed, takes visualization to another dimension. Hats off to Benjamin! This mashup breakdown got me thinking about getting a copy of the data, and about the possible visualizations I could do with JMP. Fortunately, the data for ALL Day, as well as the data for Girl Talk's previous album, Feed the Animals, is available from Wikipedia (huge thanks to all the people who have contributed to these pages).

Feed the Animals was released in June 19, 2008, and there have been a few creative visualizations of the music and information in the mashups. Angela Watercutter (Wired 16.09) deconstructed track #4, What It's All About (you can listen to Feed the Animals tracks here), displaying in a circular graph the 35 samples names, their duration, and the pictures of the artists that are part of this 4:14 minutes track. Although it is a nice visual of all the artists and songs, it is not easy to see how the sampled tracks line up within the mashup.

Bunny Greenhouse (Chris Beckman) has created video mashups for each of the tracks in Feed the Animals. Here is the one for What It's All About. You get flashes of Beyoncé at the beginning of the video, and around 0:21 seconds you see Busta Rhymes with The Police's "Every Little Thing She Does Is Magic" in the background. By 0:43 seconds The Police drummer Stewart Copeland appears playing his unmistakable driving drum beat, followed by Sting, Andy and Stewart dancing. At 3:33 we see a very young Michael Jackson singing "ABC", blending, around 4 minutes, with Queen's "Bohemian Rhapsody". Now you got me, I can enjoy the music, see which artists are part of the mashup, and when their tracks appear.

In order to contribute to this collection of visualizations I decided to use the Feed the Animals data and concentrate on track 4, What It's All About. For each of the 14 tracks in Feed the Animals, what does the distribution of sampled tracks lengths looks like, are they similar to each other? Did Girl Talk used mostly short samples? How long are the longest samples? The chart below shows the lengths of sampled tracks, as jittered points, for each of the 14 tracks, along with a boxplot to get a sense for the distribution. I've added a red line to show the overall median (23 seconds) of all the 329 sampled tracks. The distribution of each of the 14 tracks is skewed to the right, and about 2/3 of the samples are 0:30 seconds or less. The color of the points show how many times a sample of a given length was used (1 to 4). For example, in "Like This" (track #7), he used four 0:01 seconds and four 0:16 seconds sampled tracks, Note that most of the red points are in tracks 6, 7 and 8. The outlier for Track 1, "Gimme Some Lovin'" (Spencer Davis Group), shows that Girl Talk favored this sampled track by giving it 2:11 minutes out of the total 4:45 minutes. What It's All About (Track #4) also has a long sample (Busta Rhymes' Woo Hah!! Got You All in Check) lasting 1:15 minutes, or about 30% of the total track.

A nice visualization tool from the genomics world, the cell plot, gives another perspective to the density of sample lengths within a track. A cell plot is a visual representation of a data table, with each cell in the plot representing a data point. The cell plot for What It's All About shows 35 cells, one for each sampled track, with a color shade, from white (short) to dark blue (long), denoting the length of the track. What It's All About kicks in with sampled tracks of lengths between 10 and 20 seconds, followed by the longest track (Woo Hah!! Got You All in Check), the darkest blue cell. Starting with "Every Little Thing She Does Is Magic" (remember second 21 in Bunny Greenhouse's video mashup?), there is a sequence of 6 sampled tracks with lengths between 30 and 55 seconds, the exception (white cell) being "Memory Band" with only 3 seconds. Towards the end we see a sequence of very short sampled tracks, the almost white strip between "What Up Gangsta?" and "Ms. Jackson", followed by the last 4 sampled tracks with lengths around 30 seconds.

What is missing in these plots is the time dimension. One of the nice things about Bejamin's visualization is that one can see where the sampled tracks fall in the overall time sequence of the track, and with respect to each other. We can use the Graph Builder in JMP to create a plot for What It's All About, with sampled track length in the x-axis, the song name in the y-axis, and the start and stop times in a stock-style bar chart. Now it is easier to see that the first 4 sampled tracks have similar lengths and that they occur around the same time. The longest sampled track, "Woo Hah!! Got You All in Check", starts around 0:15 seconds together with "Every Little Thing She Does Is Magic" but it lasts almost twice as long. In the middle of the track there is another long sampled track, "Go!", lasting about 0:65 seconds. We also see the run of very short sampled tracks towards the end of the track, as we saw in the cell plot. The track ends with about 20 seconds (3:53 to 4:14) of Queen's "Bohemian Rhapsody".

The previous plot is an improvement but music happens over time, dynamically. In order to show the dynamic dimension of time, we can use a bubble plot with bubble trails as I illustrated in my previous post, Visualizing Change with Bubble Plots. Since this visualization does not include the music, I decided to speed things up so you don't have to watch it for the 4:14 minutes that What It's All About lasts. Ready? Hit play.

Now you can see bubbles appearing and disappearing in the order they show up in the track. At 0:21 seconds you can see the red and cyan bubbles corresponding to "Woo Hah!! Got You All in Check" and "Every Little Thing She Does Is Magic". At 0:40 The Cure's "Close to Me" comes in, and at 1:04 minutes, when "Every Little Thing She Does Is Magic" drops out, two more sampled track appear: "Here Comes the Hotstepper" and "Land of a Thousand Dances". We can also see the 6 very short sampled tracks in previous plots starting at 3:17 minutes. "Bohemian Rhapsody" enters at 3:53 minutes, riding the last 0:21 seconds of the track. Who knows? Maybe for the next version of JMP we'll be able to add sound to the bubble plot mix.

Monday, November 29, 2010

Visualizing Change with Bubble Plots

In my last post I showed some of the features of JMP's Bubble Plot platform using the US crime rate data (How to Make Bubble Charts) to create a static bubble plot, and the 1973 to 1999 US crime rate data to generate a dynamic bubble plot. Describing the dynamic bubble plot i wrote

Around 1976 Nevada starts to move away from the rest of the states, with both a high burglary and murder rates, reaching a maximum around 1980, and returning to California and Florida levels by 1984. Around 1989 the murder rate in Louisiana starts to increase reaching 20 per 100,000 by 1993, staying between 15 and 20 per 100,00 all the way up to 1997, with a fairly constant burglary rate. We can also see that the crime rates for North Dakota are consistently low

You can see these stories unfold in the animation, but after it is over our brains tend to forget the path a bubble took; the sequence of steps that led to its final position. Fortunately for us, the Bubble Plot platform has an option, Trail Lines, that can help our brains visualize motion. This option can be accessed from the bubble plot contextual menu:

Let's select the bubbles for Nevada, Louisiana, and North Dakota. If you run the animation a trail follows the motion of each of these bubbles. By the end of the sequence, 1999, the plot shows the paths taken by these 3 states.

Now we can clearly see Nevada, the green trail, shooting up to the upper right (high burglary and murder rates), and then coming back. Note how Louisiana (blue line) moves horizontally to the right (higher murder rate), without changing too much in the vertical direction (burglary rate). North Dakota's path (yellow line) is a short zigzag motion, keeping itself around a burglary rate of 435 per 100,000, and a murder rate of 1.18 per 100,000.

In Visualizing Change, data visualization expert Stephen Few discusses four meaningful characteristics of change through time: magnitude, shape, velocity and direction. These four characteristics are easier to visualize by using Trail Bubbles in addition to Trail Lines. The plot below shows the Trail Lines and Trail Bubbles for Louisiana, Nevada, and North Dakota. To help the eye, I've added labels for the starting year of 1973.

The magnitude of change can be assessed by looking at the difference between bubble locations. For Nevada, between 1973 and 1980, you can see big changes in the burglary rate, from about 2000 to 3000 per 100,000, and the murder rate, from 12 to 20 per 100,000. By 1999 Nevada's burglary rate have been cut in half to 1000 per 100,000. The shape of change is given by the overall shape of the bubbles, while the direction and velocity of change can be visualized by the trend of the trails and the rate at which a bubble moves from one place to the next. For Nevada, the shape of change is somewhat concave, with rapid changes (big jumps from one bubble to the next), trending upward and downward in the 45° diagonal.

Louisiana's burglary rate did not change much (vertical changes), but its murder rate went up to 20 per 100,000, ending at 10 per 1000,000, lower than where it started (horizontal changes). The changes did not seem to occur rapidly, because the distance between the bubbles is small. As we saw before, not a lot of changes in North Dakota. Its shape is a circle with a small radius; i.e., neither big, nor rapid changes in either murder or burglary rates (the last bubble is almost where started).

A JMP bubble plot, with line and bubbles trails, can really change the way you visualize change. Go ahead, give it a try.

Tuesday, November 23, 2010

Visualizing Data with Bubble Plots

Bubble plots are a great way of displaying 3 or more variables using a X-Y scatter plot, and are a useful diagnostic tool for detecting outliers and influential observations in both logistic regression (Hosmer and Lemeshow used them in their 1989 book Applied Logistic Regression), and in multiple linear regression (What If Einstein Had JMP). New technologies have made it possible to animate the bubbles according to a given variable, such as time, as it was masterfully demonstrated by Hans Rosling in his talk, New Insights on Poverty, at the March 2007 TED conference.

Today, Nathan Yau posted an entry in his data visualization blog, FlowingData, on How to Make Bubble Charts using R. He gives 5 steps (6 if you count step 0), and the corresponding R code, to create a static bubble plot that shows the 2008 US burglary rate vs murder rate for each of the 50 states, with red bubbles representing the state population.

(From http://flowingdata.com/2010/11/23/how-to-make-bubble-charts/5-edited-version-2/)

Let me show you how easy it is to create static and dynamic bubble plots in JMP. The 2008 crime rate data is available at http://datasets.flowingdata.com/crimeRatesByState2008.csv, and can be conveniently read into JMP version 9 using File>Internet Open, as shown below.

To create a bubble plot we select Graph>Bubble Plot to bring up the Bubble Plot launch panel. Here we select Burglary as the Y, Murder as the X, and Population as the Sizes. The bubbles in Nathan's plot are red and are labeled with the state name. In order to color and label the bubbles we select State for both ID and Coloring. These selections are shown below

Once you click OK our multicolor bubble plot appears (I have modified the axis to match Nathan's plot). We quickly see that Louisiana and Maryland have the highest murder rates, and similar population sizes, and that North Carolina has the highest burglary rate.

In Step 3 Nathan shows how to size the bubbles by making the radius a function of the area of the bubble. Below the X-axis in JMP's bubble plot there is a slider to dynamically control the size of the bubble. You can just move the slider to the right for larger bubbles, or to the left to decrease their size. Very easy; no code required.

The static bubble plot above is a snapshot of the crime rates in 2008. What if we want to visualize how the burglary and murder rates changed over the years? In JMP, a time variable can be used to animate the bubbles. We use the same Bubble Plot selections as before but now we add Year as the Time variable.

Several stories now emerge from this dynamic plot. Around 1976 Nevada starts to move away from the rest of the states, with both a high burglary and murder rates, reaching a maximum around 1980, and returning to California and Florida levels by 1984. Around 1989 the murder rate in Louisiana starts to increase reaching 20 per 100,000 by 1993, staying between 15 and 20 per 100,00 all the way up to 1997, with a fairly constant burglary rate. We can also see that the crime rates for North Dakota are consistently low, and that by 1999 all the states seem to form a more cohesive group moving towards the lower left corner.

Bubble plots can be animated using other variables, not necessarily a time one. I have used the dynamic bubble plot to show how the relationship between a material degradation vs. time, changes from linear to nonlinear as a function of temperature. In the video below you can see that as the temperature increases from 9°C to 50°C the material degrades faster, and that for higher temperatures, 40°C and 50°C, the degradation is nonlinear. This is a nice visual that helps convey the message without the need to show the model equations.

With JMP's static and dynamic bubble plots you can easily display up to 6 variables (seven using ID2) in the 2-dimensional space of a scatter plot. What an efficient way of visualizing data!

Monday, October 25, 2010

Analysis of Means: A Graphical Decision Tool

JMP version 9 has been out for about two weeks now, and I hope you had a chance to play with it. If you are not ready to buy it you can give it a try by downloading a 30-day trial copy.

Today I want to share with you a new feature in JMP version 9: the Analysis of Means (ANOM). An analysis of means is a graphical decision tool for comparing a set of averages with respect to their overall average. You can think of it as a control chart but with decision limits instead of control limits, or as an alternative to an analysis of variance (ANOVA). In an ANOVA a significant F test just indicates that the means are different, but it does not reveal where the differences are coming from. By contrast, in an ANOM chart if an average falls outside the decision limits it is an indication that this average is statistically different, with a given risk α, from the overall average.

Prof. Ellis Ott introduced the analysis of means in 1967 as a logical extension of the Shewhart control chart. Let's look at an example. The plot below shows measurements of an electrical assembly as a function of six different types of ceramic sheets used in their construction. The data appears in Table 13.1 of the first edition of Prof. Ott's book Process Quality Control.

One can see some differences in the average performance of the six ceramic sheets. A Shewhart Xbar and R chart shows that the ranges are in control, indicating consistency within a ceramic sheet, but that the average of the ceramic sheet #6 is outside the lower control limit. Based on this we can say that there is probably an assignable cause responsible for this low average, but we can not claim any statistical significance.

The question of interest, quoting from Prof. Ott's book, is: "Is there evidence from the sample data that some of the ceramic sheets are significantly different from their own group average?". We can perform an analysis of variance to test the hypothesis that the averages are different. The F test is significant at the 5% level, indicating that the average electrical performance of the six ceramic sheets differ from each other. The F test, being an 'omnibus' type test, does not, however, tells which, or which ones, are different. For this we need to perform multiple comparisons tests, or an analysis of means.

A ANOM chart can be easily generated by selecting Analysis of Means Methods > ANOM from within the Analyze>Fit Y by X>Oneway Analysis window

The ANOM chart clearly reveals that the assemblies built using the ceramic sheet #6 have an average that is (statistically) lower than the overall average of 15.952. The other five averages are within the 5% risk decision limits, indicating that their electrical performance can be assumed to be similar.

The ANOM chart with decision limits 15.15 and 16.76, provide a graphical test for simultaneously comparing the performance of these six averages. What a great way to perform the test and communicate its results. Next time you need to decide which average, or averages, are (statistically) different from the overall average, give the ANOM chart a try.

Tuesday, September 21, 2010

JMP Discovery 2010

Last week I attended the JMP Discovery 2010 conference. What a great way to learn about the new features coming up in version 9, to network, to see old friends and make new ones, and to enjoy the many keynote speakers like Dan Ariely who talked about how predictably irrational we are. I also had the opportunity to lead a breakout session on Tailor-Made Split-Plot Designs, a short tutorial on JMP® Custom Design.

For us, the conference marked the one year anniversary of the publication of our book, and a time for celebration. We have received very positive feedback from our readers, Prof Phil Ramsey of New Hampshire university used our book in his summer course "Statistics for Engineers and Scientists", and, to top it all off, we won the 2009-2010 International Technical Publications Competition (ITPC) Award of Excellence awarded by the Society for Technical Communications.

John Sall, co-founder and Executive Vice President of SAS, gave a one-hour demo showcasing some of the new features of JMP 9. Windows users will discover a new look, while excel users can now leverage the profiler to optimize and simulate worksheet formulas. R users will be able to run R code from within JMP, and data miners will find a new array of tools. From the engineering and science side there are many enhancements to existing tools, as well as some new ones. Here are some of the new things:

Fit Y by X platform. Analysis of Means (ANOM) has been added when the X variable is categorical. ANOM is a graphical way for comparing means to the overall mean. Think of it as a control chart with decision limits.
Sample Size calculator. A Reliability Test Plan is now available to design reliability studies. There is also a Reliability Demonstration calculator for planning a reliability demonstration study.
Life Distribution. Several new distributions have been added including a 3-parameter Weibull distribution.
Degradation. A new tool for performing degradation analyses within the Analyze>Reliability and Survival platform.
Accelerated Life Test Design. Within the DOE platform to design ALT plans for one and two accelerating factors.
Graph Builder. Several new additions including the ability to use custom maps.

I am looking forward to sharing with you examples of how to use these new tools. The schedule date for the release of JMP 9 is October 12. Stay tuned.

Monday, June 28, 2010

Tolerance (Coverage) Intervals

You are probably familiar with confidence intervals as a way to place bounds, with a given statistical confidence, on the parameters of a distribution. In my post I Am Confident That I Am 95% Confused!, I used the DC resistance data shown below to illustrate confidence intervals and the meaning of confidence.

From the JMP output we can see that the average resistance for the sample of 40 cables is 49.94 Ohm and that, with 95% confidence, the average DC resistance can be as low as 49.31 Ohm, or as large as 50.57 Ohm. We can also compute a confidence interval for the standard deviation. That's right, even the estimate of noise has noise. This is easily done by selecting Confidence Interval > 0.95 from the contextual menu next to DC Resistance

We can say that, with 95% confidence, the DC resistance standard deviation can be as low as 1.61 Ohm, or as large as 2.52 Ohm.

Although confidence intervals have many uses in engineering and scientific applications, a lot of times we need bounds not on the distribution parameters but on an given proportion of the population. In her post, 3 Is The Magic Number, Brenda discussed why the mean ± 3×(standard deviation) formula is popular for setting specifications. For a Normal distribution, if we know the mean and standard deviation without error, we expect about 99.73% of the population to fall within mean ± 3×(standard deviation). In fact, for any distribution the Chebyshev's inequality guarantees that 88.89% of the population will be contained between ± 3×standard deviations. It gets better. If the distribution is unimodal we expect at least 95.06% of the population to be between ± 3×standard deviations.

In real applications, however, we seldom know the mean and standard deviation without error. These two parameters have to be estimated from a random and, usually small, representative sample. A tolerance interval is a statistical coverage interval that includes at least a given proportion of the population. This type of interval takes into account both the sample size and the noise in the estimates of the mean and standard deviation. For normally distributed data an approximate two-sided tolerance interval is given by

Here g(1-α/2,p,n) takes the place of the magic "3", and is a function of the confidence,1-α/2, the proportion that we want the interval to cover (you can think of this as the yield that we want to bracket), p, and the sample size n. These intervals are readily available in JMP by selecting Tolerance Interval within the Distribution platform, and specifying the confidence and the proportion of the population to be covered by the interval.

For the DC resistance data the mean ± 3×(standard deviation) interval is 49.94±3×1.96 = [44.06 Ohm;55.82 Ohm], while the 95% tolerance interval to cover at least 99.73% of the population is [42.60 Ohm;57.28 Ohm]. The tolerance interval is wider than the mean ± 3×(standard deviation) interval, because it accounts for the error in the estimates of the mean and standard deviation, and the small sample size. Here we use a coverage of 99.73% because this is what is expected between ± 3×standard deviations, if we knew the mean and sigma.

Tolerance intervals are a great practical tool because they can be used to set specification limits, as surrogate for specification limits, or to compare them to a given set of specification limits; e.g., from our customer, to see if we will be able to meet them.

Monday, June 21, 2010

To "Log" or Not to "Log"

Many statistical techniques rely upon the assumption that the data follows a normal distribution. In fact, questions surrounding the impact of normality on the validity of a statistical analysis ranks right up there with 'how many samples do I need for my results to be meaningful?'. Some statistical techniques derived from an assumption of normality of the data can deal with slight to moderate departures quite well, while others are greatly impacted. For instance, if a response such as bond strength is slightly skewed to the right with a mean well above 0 then a control chart for subgroup averages derived in the typical manner, using 3 sigma limits, should have adequate performance in detecting a large shift in the mean. However, a response which is heavily skewed to the right, with a lower boundary close to 0, for example, will produce unreasonable results using techniques based upon normality.

Let's examine the impact of departures from normality when setting a '3 sigma' upper specification limit based upon process capability, for a response that inherently follows a lognormal distribution. In the figure below, we see a histogram for 50 measurements of a response, which was randomly generated using a lognormal distribution. At first glance it is obvious that the distribution is skewed to the right with a lower bound close to 0, putting the normality assumption in question. However, is it so skewed that it will call into question the accuracy of statistical techniques derived from a normal assumption?

Let's fit a normal distribution to the data and use the normal probability model to determine the value at which the probability of exceeding it is 0.135%, i.e., a '3 sigma' equivalent upper limit. This value is 4.501, with 2% of the actual readings above this upper limit.

Now, let's fit the more appropriate lognormal distribution to this data and use the lognormal probability model to determine the value at which the probability of exceeding it is 0.135%, i.e., a '3 sigma' equivalent upper limit. This value is 9.501, with no actual readings above this limit.

The upper limits derived from these two distributions are quite different from each other; 9.501 (lognormal) vs. 4.501 (normal). Since we know this response is lognormal, we would be significantly underestimating the '3 sigma' upper limit using the normal assumption, resulting in 2% out-of-spec readings, compared to 0.135% that we set out to achieve. In a real world application, this could result in chasing false alarms on a control chart or scrapping product that is well within the distribution of the data. Before moving forward with a statistical analysis of your data, determine the best fitting distribution by evaluating the histogram, distribution based quantile plots, goodness of fit tests, and most importantly, understanding the context of the data. The best distributions to use for many engineering and scientific applications are well known and well documented.