Statistics for Machine Learning — Comprehensive Guide Part-2 of 4
First part of this series available in this link
There’s quite a close relationship between probability and statistics. A lot of statistics has its origins in probability theory. Probability theory can help you make predictions about your data and see patterns. It can help you make sense of apparent randomness.
Basics of probability theory are available — here
Statistics deal with data, but where does it come from ?
A statistical population refers to the entire group of things that you’re trying to measure, study, or analyze. A census is a study or survey involving the entire population. A census can provide you with accurate information about your population, but it’s not always practical. When populations are large or infinite, it’s just not possible to include every member.
A sample is a subset of data from a larger dataset(population). Random Sampling is a process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw.
Data Quality matters more than data quantity when making an estimate or a model based on a sample. Data quality in data science involves completeness, consistency of format, cleanliness and accuracy of individual data points.
The key to creating a good sample is to choose one that is as close a match to our population as possible. If our sample is representative then we can use our sample to predict what characteristics the population will have. Not every sample closely resembles its population. Using the wrong sample could lead us to draw wrong conclusions about population parameters. Unless we are very careful, some sort of bias can creep in to the sample, which can distort our results. Bias is a sort of favoritism that we can unwittingly (or maybe knowingly) introduce into our sample, meaning that our sample is no longer randomly selected from our population.
Unbiased Sample
An unbiased sample is representative of the target population. This means that it has similar characteristics to the population, and we can use these to make inferences about the population itself. The shape of the distribution of an unbiased sample is similar to the shape of the population it comes from. If we know the shape of the sample distribution, we can use it to predict that of the population to a reasonable level of confidence.
Biased samples
A biased sample is not representative of the target population. We can’t use it to make inferences about the population because the sample and population have different characteristics. If we try to predict the shape of the population distribution from that of the sample, we’d end up with the wrong result.
Population Mean µ and Sample Mean x̄
Bias is an indicator that a statistical or machine learning model has been mis-specified or an important variable left out. Drawing elements into a sample at random is Random Sampling. Random sampling is the key way to achieve unbiased sample. Random sampling can reduce bias and facilitate data quality improvement.
Selection Bias refers to bias resulting from the way in which observations are selected.
Random Sampling without replacement — Sampling without replacement means that the sampling unit isn’t replaced back into the population.
Random Sampling with replacement — Sampling with replacement means that when we have selected each unit and recorded relevant information about it, we put it back into the population.
Stratified Sampling
An alternative to random sampling is stratified sampling. With this type of sampling, the population is split into similar groups that share similar characteristics. These characteristics or groups are called strata, and each individual group is called a stratum. Perform simple random sampling on each stratum to ensure that each group is represented in our overall sample.
Cluster Sampling
Cluster sampling is useful if the population has a number of similar groups or clusters. With cluster sampling, instead of taking a random sample of units, you draw a random sample of clusters, and then survey everything within each of these clusters. Cluster sampling works because each cluster is similar to the others. The problem with cluster sampling is that it might not be entirely random.
So with stratified sampling, we make each stratum as different as possible, and with clustered sampling, we make each cluster as similar as possible.
Systematic Sampling
list the population in some sort of order and then survey every kth item, where k is some number. Systematic sampling is relatively quick and easy, but there’s one key disadvantage — If there’s some sort of cyclic pattern in the population, our sample will be biased. This sort of sampling can only be used effectively if there are no repetitive patterns.
Types of Statistical Distributions
Depending on the type of data, distributions are grouped into two categories, discrete distributions for discrete data (finite outcomes) and continuous distributions for continuous data (infinite outcomes).
Discrete Distributions
1. Discrete uniform distribution
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete uniform distribution.
2. Bernoulli Distribution
Any event with a single trial and only two possible outcomes follow a Bernoulli distribution. Flipping a coin or choosing between True and False in a quiz are examples of a Bernoulli distribution.
The graph of a Bernoulli distribution is simple to read. It consists of only two bars, one rising to the associated probability p and the other growing to 1-p.
3. Binomial Distribution
The Binomial Distribution can be thought of as the sum of outcomes of an event following a Bernoulli distribution. Binomial Distribution is used in binary outcome events and the probability of success and failure is the same in all successive trials. An example of a binomial event would be flipping a coin multiple times to count the number of heads and tails.
Consider we are attempting a quiz that contains 10 True/False questions. Trying a single T/F question would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions would be categorized as a Binomial trial.
4. Poisson Distribution
Poisson distribution deals with the frequency with which an event occurs within a specific interval. Instead of the probability of an event, Poisson distribution requires knowing how often it happens in a particular period or distance. It is used to model count-based data, like the number of emails arriving in your mailbox in one hour or the number of customers walking into a shop in one day, for instance.
The main characteristics which describe the Poisson Processes are:
- The events are independent of each other.
- An event can occur any number of times (within the defined period).
- Two events can’t take place simultaneously.
The graph of Poisson distribution plots the number of instances an event occurs in the standard interval of time and the probability of each one.
Continuous Distributions
1. Normal Distribution
Normal distribution also called Gaussian distribution is the most common distribution function for independent and randomly generated variables. A normal distribution is a symmetrical, bell-shaped distribution, with increasingly fewer observations the further from the center of the distribution. The graph of the normal distribution is characterized by two parameters: the mean which is the maximum of the graph and the standard deviation, which determines the amount of dispersion away from the mean.
- 68% of values are within 1 standard deviation of the mean ([µ-σ µ+σ])
- 95% of values are within 2 standard deviations of the mean ([µ-2σ µ+2σ])
- 99.7% of values are within 3 standard deviations of the mean ([µ-3σ µ+3σ])
2. Student t-Test Distribution
The student’s t-distribution, also known as the t distribution, is a type of statistical distribution similar to the normal distribution with its bell shape but has heavier tails. The t distribution is used instead of the normal distribution when you have small sample sizes.
3. Exponential distribution
Exponential distribution is one of the widely used continuous distributions. It is used to model the time taken between different events. The exponential distribution has this memoryless property because the probability of an event occurring within a specified timeframe is not conditional on how long you’ve already been waiting for the event to occur.
An exponential graph is a curved line representing how the probability changes exponentially. Exponential distributions are commonly used in calculations of product reliability or the length of time a product lasts.
Sampling Distribution
The term sampling distribution refers to the distribution of some sample statistic over many samples or resamples drawn from the same population. Sample Statistic is a metric calculated for a sample of data drawn from a larger population. The distribution of individual data points is Data Distribution and the distribution of a sample statistic is Sampling Distribution. Sampling distribution is likely to be more regular and bell-shaped than the Data distribution.
Central Limit Theorem
The central limit theorem states that if we take sufficiently large samples from a population, the samples’ means will be normally distributed, even if the population isn’t normally distributed.
The central limit theorem states that the sampling distribution of the mean will always follow a normal distribution under the following conditions:
- The sample size is sufficiently large.
- The samples are independent and identically distributed random variables. This condition is usually met if the sampling is random.
- The population’s distribution has finite variance. Central limit theorem doesn’t apply to distributions with infinite variance.
Standard Error
Standard error is a key metric that sums up the variability in sampling distribution for a statistic. The standard error can be estimated using a statistic based on the standard deviation(s) of the sample values and the sample size(n). As the sample Size increases, standard error decreases.
Standard Deviation measures the variability of individual data points and Standard Error measures the variability of sample metric.
Resampling
Resampling is the process of taking repeated samples from observed data. Resampling includes bootstrap and permutations(shuffling) procedures.
In permutation procedure multiple samples are combined and the sampling may be done without replacement.
Bootstrap — A bootstrap sample is a sample taken with replacement from an observed data set. Bootstrap is a powerful tool for assessing the variability of a sample statistic. Bootstrap can be applied without extensive study of mathematical approximations to sampling distributions.
The algorithm for a bootstrap resampling of the mean for sample of size(n):
- Draw a sample value, record it and then replace it
- Repeat n times
- Record the mean of the n resampled values
- Repeat steps 1–3 R times, where R is an arbitrary number, which specifies the number of iterations of the bootstrap.
Bootstrap does not compensate for a small sample size. Bootstrap neither create new data nor does it fill in holes in an existing data set. It informs about how lots of additional samples would behave when drawn from a population like our original sample.
Confidence Intervals, Hypothesis Testing, A/B testing will be continued in next part.
To Be Continued…….
References:
- Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python by Bruce, Peter; Bruce, Andrew;
- Head First Statistics — Book by Dawn Griffiths
- https://www.w3schools.com/statistics/statistics_normal_distribution.php
- https://datasciencedojo.com/blog/types-of-statistical-distributions-in-ml/
- https://statisticsbyjim.com/probability/exponential-distribution/