The Central Limit Theorem (CLT) is a powerful concept, because it enables us to use the known Probabilities of the Normal Distribution in statistical analyses of data which are not Normally distributed. It is most commonly known as applying to the Means of Samples of data.
The data can be distributed in any way. For example -- as shown above -- it can be double-peaked and asymmetrical, or it can have the same number of points for every value of x. If we take many sufficiently large Samples of data with any Distribution, the Distribution of the Means (x-bar)'s of these Samples will be approximate the Normal Distribution.
There is something intuitive about the CLT. The Mean of a Sample taken from any Distribution is very unlikely to be at the far left or far right of the range of the Distribution. Means (averages), by their very definition, tend to average-out extremes. So, their Probabilities would be highest in the center of a Distribution and lowest at the extreme left or right.
Less intuitively obvious is that the CLT applies to Proportions as well as to Means.
Let's say that pis the Proportion of the count of a category of items in a Sample, say the Proportion of green jelly beans in a candy bin. We take many Samples, with replacement, of the same size n, and we calculate the Proportion for each Sample. When we graph these Proportions, they will approximate a Normal Distribution.
How large of a Sample Size, n, is "sufficiently large"? It depends on the use and the statistic. For Means and most uses n > 30 is considered large enough. But for Proportions, it's a little more complicated -- it depends on what the value of p is. n is large enough if np > 5 and n(1 - p) > 5.
The practical effect of this is:
This table gives us the specifics; the minimum Sample Size, n, is shown in the middle row.
6 Keys to understanding and plenty of concept flow diagrams and other visual aids help the viewer gain a good understanding if these concepts. https://youtu.be/9llhdO8pB-4. For the latest status of available and planned videos, see the videos page in this website.
In determining which Distribution to use in analyzing Discrete (Count) data, we need to know whether we are interested in Occurrences or Units.
Let's say we are inspecting shirts at the end of the manufacturing line. We may be interested in the number of defective Units – shirts, because any defective shirt is likely to be rejected by our customer. However, one defective shirt can contain more than one defect. So, we are also interested in the Count of individual defects – the Occurrences – because that tells us how much of a quality problem we have in our manufacturing process.
For example, if 1 shirt has 3 defects, that would be 3 Occurrences of a defect, but only 1 Unit counted as defective.
We would use the Poisson Distribution in analyzing Probabilities of Occurrences of defects. To analyze the Probability of Units, we could use the Binomial or the Hypergeometric Distribution.
There is an article in the book focusing on the Poisson Distribution. There is also a video, on my YouTube channel, Statistics from A to Z.
These are all terms used in Correlation and Linear Regression (Simple and Multiple). And some of these terms have several names. I don't know about you, but I get confused trying to keep them all straight. So I wrote this compare-and-contrast table, which should help.
First in a playlist on Statistical Tests. 5 Keys to Understanding and compare-and- contrast tables, help the viewer understand the 3 different types of parametric t-tests. https://youtu.be/ZJlrF_yfiPo. For a complete listing of available and planned videos, please see the Videos page on this website.
In Regression, we attempt to fit a line or curve to the data. Let's say we're doing Simple Linear Regression in which we are trying to fit a straight line to a set of (x,y) data.
We test a number of subjects with dosages from 0 to 3 pills. And we find a straight line relationship, y = 3x, between the number of pills (x) and a measure of health of the subjects. So, we can say this.
But we cannot make a statement like the following:
This is called extrapolating the conclusions of your Regression Model beyond the range of the data used to create it. There is no mathematical basis for doing that, and it can have negative consequences, as this little cartoon from my book illustrates.
In the graphs below, the dots are data points. In the graph on the left, it is clear that there is a linear correlation between the drug dosage (x) and the health outcome (y) for the range we tested, 0 to 3 pills. And we can interpolate between the measured points. For example, we might reasonably expect that 1.5 pills would yield a health outcome halfway between that of 1 pill and 2 pills.
For more on this and other aspects of Regression, you can see the YouTube videos in my playlist on Regression. (See my channel: Statistics from A to Z - Confusing Concepts Clarified.
This is the 9th and final video in my channel on Regression. Residuals represent the error in a Regression Model. That is, Residuals represent the Variation in the outcome Variable y, which is not explained by the Regression Model. Residuals must be analyze several ways to ensure that they are random, and that they do no represent the Variation caused by some unidentified x-factor.
See the videos page in this website for a listing of available and planned videos.
The Binomial Distribution is used with Count data. It displays the Probabilities of Count data from Binomial Experiments. In a Binomial Experiment,
There are many Binomial Distributions. Each one is defined by a pair of values for two Parameters, n and p. n is the number of trials, and p is the Probability of each trial.
The graphs below show the effect of varying n, while keeping the Probability the same at 50%. The Distribution retains its shape as n varies. But obviously, the Mean gets larger.
The effect of varying the Probability, p, is more dramatic.
For small values of p, the bulk of the Distribution is heavier on the left. However, as described in my post of July 25, 2018, statistics describes this as being skewed to the right, that is, having a positive skew. (The skew is in the direction of the long tail.) For large values of p, the skew is to the left, because the bulk of the Distribution is on the right.
New video: Simple Nonlinear Regression.
This is the 7th in a playlist on Regression. For a complete list of my available and planned videos, please see the Videos page on this website.
Andrew A. (Andy) Jawlik is the author of the book, Statistics from A to Z -- Confusing Concepts Clarified, published by Wiley.