Let's say we are inspecting shirts at the end of the manufacturing line. We may be interested in the number of defective Units – shirts, because any defective shirt is likely to be rejected by our customer. However, one defective shirt can contain more than one defect. So, we are also interested in the Count of individual defects – the Occurrences – because that tells us how much of a quality problem we have in our manufacturing process.

For example, if 1 shirt has 3 defects, that would be 3 Occurrences of a defect, but only 1 Unit counted as defective.

We would use the Poisson Distribution in analyzing Probabilities of Occurrences of defects. To analyze the Probability of Units, we could use the Binomial or the Hypergeometric Distribution.

There is an article in the book focusing on the Poisson Distribution. There is also a__video__, on my YouTube channel, Statistics from A to Z.

]]>We would use the Poisson Distribution in analyzing Probabilities of Occurrences of defects. To analyze the Probability of Units, we could use the Binomial or the Hypergeometric Distribution.

There is an article in the book focusing on the Poisson Distribution. There is also a

]]>

We test a number of subjects with dosages from 0 to 3 pills. And we find a straight line relationship, y = 3x, between the number of pills (x) and a measure of health of the subjects. So, we

But we __cannot__ make a statement like the following:

This is called extrapolating the conclusions of your Regression Model beyond the range of the data used to create it. There is no mathematical basis for doing that, and it can have negative consequences, as this little cartoon from my book illustrates.

In the graphs below, the dots are data points. In the graph on the left, it is clear that there is a linear correlation between the drug dosage (x) and the health outcome (y) for the range we tested, 0 to 3 pills. And we can interpolate between the measured points. For example, we might reasonably expect that 1.5 pills would yield a health outcome halfway between that of 1 pill and 2 pills.

For more on this and other aspects of Regression, you can see the YouTube videos in my playlist on Regression. (See my channel: Statistics from A to Z - Confusing Concepts Clarified.

]]>This is the 9th and final __video__ in my channel on Regression. Residuals represent the error in a Regression Model. That is, Residuals represent the Variation in the outcome Variable y, which is not explained by the Regression Model. Residuals must be analyze several ways to ensure that they are random, and that they do no represent the Variation caused by some unidentified x-factor.

See the videos page in this website for a listing of available and planned videos.

]]>See the videos page in this website for a listing of available and planned videos.

- There are a fixed number of trials (e.g. coin flips)
- Each trial can have only 1 of 2 outcomes.
- The Probability of a given outcome is the same for each trial.
- Each trial is Independent of the others

There are many Binomial Distributions. Each one is defined by a pair of values for two Parameters,

The graphs below show the effect of varying

The effect of varying the Probability, *p*, is more dramatic.

For small values of *p*, the bulk of the Distribution is heavier on the left. However, as described __in ____my post of July 25, 201__8, statistics describes this as being skewed to the right, that is, having a positive skew. (The skew is in the direction of the long tail.) For large values of *p*, the skew is to the left, because the bulk of the Distribution is on the right.

]]>This is the 7th in a playlist on Regression. For a complete list of my available and planned videos, please see the Videos page on this website.

]]>There are a number of different measures of Variation. This compare-and-contrast table shows the relative merits of each.

- The Range is probably the least useful in statistics. It just tells you the highest and lowest values of a data set, and nothing about what's in between.
- The Interquartile Range (IQR) can be quite useful for visualizing the distribution of the data and for comparing several data sets -- as described in a
__recent post on this blog__. - Variance is the square of the Standard Deviation, and it is used as an interim step in the calculation of the latter. This squaring overly emphasizes the effects very high or very low values. Another drawback is that it is in units of the data squared (e.g. square kilograms, which can be meaningless). There is a Chi-Square Test for the Variance, and Variances are used in F tests and the calculations in ANOVA.
- The Mean Absolute Deviation is the average (unsquared) distance of the data points from the Mean. It is used when it is desirable to avoid emphasizing the effects of high and low values
- The Standard Deviation, being the square root of the Variance, does not overly emphasize the high and low values as the Variance does. Another major benefit is that it is in the same units as the data.