Statistics Tip: Formulas for Degrees of Freedom vary by the Statistics and the test they are used in.
A Statistic is a numerical property of a Sample, for example, the Sample Mean or Sample Variance. A Statistic is an estimate of the corresponding property (“Parameter”) in the Population or Process from which the Sample was drawn. Being an estimate, it will likely not have the exact same value as its corresponding population Parameter. The difference is the error in the estimation.
So, if we calculate a Statistic entirely from data values, there is a certain amount of error. For example, the Sample Mean is calculated entirely from the values of the Sample data. It is the sum of all the data values in the Sample divided by the number, n, of items in the Sample. There is one source of error in its formula – the fact that it is an estimate because it does not use all the data in the Population or Process.
Another way that Degrees of Freedom is described is "The number of independent pieces of information that go into the calculation of a Statistic." To illustrate, let's say we have a Sample of n = 5 data values: 2, 4, 6, 8, and 10.
When we calculate the Sample Mean, we have 5 independent pieces of information – the five values of the data. They are independent because none of the values are dependent on the values of another. So, for the Mean, df = 5
Sample Mean = (2 + 4 + 6 + 8 + 10) / 5 = 30 / 5 = 6
But, when we calculate the Sample Variance, we use the Mean as well as the 5 data values. The Mean is not an independent piece of information, because is it dependent on the other 5 values.
Also, when we include the Mean, we only have 4 independent pieces of information left. If we know that the Mean is 30, and we have the data values 2, 4, 6, and 8, then we can calculate that the last data value has to be 10. So, 10 no longer brings independent information to the table.
If we then use that Statistic to calculate another Statistic, it brings its own estimation error into the calculation of the second Statistic. This error is in addition to the second Statistic's estimation error. This happens in the case of the Sample Variance.
Example: Sample Variance
Numerator for Sample Variance:
The numerator of the formula for Sample Variance includes the Sample Mean. It takes each data value (the x's) in the Sample and subtracts from it the Sample Mean. Then it sums all those subtracted values.
So, the Sample Variance has two sources of error:
We don't need to make this adjustment for the Sample Mean, but we do need to do so for the Sample Variance. We divide by n – 1, instead of n.
I uploaded a new video: Design of Experiments (DOE) Part 3 of 3
p is the Probability of an Alpha (False Positive) Error. Alpha (α) is the Level of Significance; its value is selected by the person performing the statistical test. If p < α (some say if p < α) then we Reject the Null Hypothesis. That is, we conclude that any difference, change, or effect observed in the Sample data is Statistically Significant.
The p-value contains the same information as the Test Statistic Value, say z. That is because the value of z is used to determine the p-value. As shown in the following concept flow diagram,
Similarly α contains the same information as the Critical Value.
So comparing p and the Critical Value is the same as comparing Alpha and the Test Statistic value. But the comparison symbols ( ">" and "<") point in the opposite direction. That's because p and Test Statistic have an inverse relation. A smaller value for p means that the Test Statistic value must be larger.
I just uploaded a new video: Design of Experiments (DOE) Part 2 of 2
In an earlier Tip, we said that a Histogram was good for picturing the shape of the data. What a Histogram is not good for is picturing Variation -- as measured by Standard Deviation or Variance. The size of the range for each bar is purely arbitrary. Larger ranges would make for fewer bars and a narrower picture. Also, the width of the bars in the picture can be varied, making the spread appear wider or narrower.
A Dot Plot can be used to picture Variation if the number of data points is relatively small. Each individual point is shown as a dot, and you can show exactly how many go into each bin.
Boxplots, also known as Box and Whiskers Plots can very effectively provide a detailed picture of Variation. In an earlier Statistics Tip, we showed how several Box and Whiskers Plots can enable you to visually choose the most effective of several treatments. Here's an illustration of the anatomy of a Box and Whiskers Plot
In the example above, the IQR box represents the InterQuartile Range, which is a useful measure of Variation. This plot shows us that 50% of the data points (those between the 25th and 75th Percentiles) were within the range of 40 – 60 centimeters. 25% were below 40 and 25% were above 60. The Median, denoted by the vertical line in the box is about 48 cm.
Any data point outside 1.5 box lengths from the box is called an Outlier. Here, the outlier with a value of 2 cm. is shown by a circle. Not shown above, but some plots define an Extreme Outlier as one that is more than 3 box lengths outside the box. Those can be shown by an asterisk
I just uploaded a new video to my channel on You Tube: Design of Experiments -- Part 1 of 3.
I just uploaded a new video to You Tube: Margin of Error. It's part of a playlist on Errors in Statistics.
Both Bar Charts and Histograms use the height of bars (rectangles of the same width) to visually depict data. So, they look similar.
1. Separated or contiguous
2. Types of data
3. How Used
I just uploaded a new video: Alpha and Beta Errors.
And previously, I had uploaded a video on Statistical Errors -- Types, Uses, and Interrelationships.
See the Videos page on this site for a list of my videos previously uploaded.
All other things being equal, an increase in Sample Size (n) reduces all types of Sampling Errors, including Alpha and Beta Errors and the Margin of Error.
A Sampling "Error" is not a mistake. It is simply the reduction in accuracy to be expected when one makes an estimate based on a portion – a Sample – of the data in Population or Process. There are several types of Sampling Error.
Two types of Sampling Errors are described in terms of their Probabilities:
All three types of Sampling Error are reduced when the Sample Size is increased.
This makes intuitive sense, because a very small Sample is more likely to not be a good representative of the properties of the larger Population or Process. But, the values of Statistics calculated from a much larger Sample are likely to be much closer to the values of the corresponding Population or Process Parameters
For more on the statistical concepts mentioned here (p, β, MOE, Confidence Intervals, Statistical Errors, Samples and Sampling), please see my book or my YouTube channel -- both are titled Statistics from A to Z -- Confusing Concepts Clarified.
Andrew A. (Andy) Jawlik is the author of the book, Statistics from A to Z -- Confusing Concepts Clarified, published by Wiley.