Statistics for Machine Learning — Comprehensive Guide Part-1 of 4
Statistics are numbers that summarize raw facts and figures in some meaningful way.
Study of statistics:
Gather data — At the root of statistics is data. Data can be gathered by looking through existing sources, conducting experiments, or by conducting surveys.
Analyze — Once you have data, you can analyze it and generate statistics. You can calculate probabilities to see how likely certain events are, test ideas, and indicate how confident you are about your results.
Draw conclusions -When you’ve analyzed your data, you make decisions and predictions.
Statistics are based on facts, but even so, they can sometimes be misleading. They can be used to tell the truth — or to lie. Studying statistics is a good way of making sure you don’t get fooled by inaccurate or misleading statistics.
Data
Data comes from many sources: measurements, events, text, images and videos. To apply the statistical concepts unstructured raw data must be processed and manipulated into a structured form (common form is a table with rows and columns). The data type is important to determine the type of visual display, data analysis or statistical model. There are 2 basic types of structured data:
Numeric: Data represented on numeric scale
- Continuous: Data that can take on any value in an interval (Speed, Money)
- Discrete: Data that can take only integer values, such as count (count of occurrence of an event)
Categorical : Data that can take on only a specific set of values representing a set of possible categories(type of cars)
- Nominal: Categorical data that has no measure or explicit order (Types of Cars, Marital Status). Binary is a special case of categorical data with just two categories of values (True/False, 0/1, Dead/Alive)
- Ordinal: Categorical data that has an explicit ordering (Movie rating)
Analyze the data:
1. Location analysis
- Mean — The sum of all values divided by the number of values
- Weighted Mean — The sum of all values times a weight divided by the sum of the weights
- Trimmed Mean — The average of all values after dropping a fixed number of extreme values
- Median — The value such that one-half of the data lies above and below
- Weighted Median — The value such that one-half of the sum of the weights lies above and below the sorted data
- Mode — The mode of a set of data is the most popular value, the value with the highest frequency.
- Outliers — A data value that is very different from most of the data. An extreme high or low value that stands out from the rest of the data
- Skewed data: When outliers pull the data to left or right then the data is skewed. Skewed to the right — Data that is skewed to the right has a “tail” of high outliers that trail off to the right. If you look at a right-skewed chart, you can see this tail. Skewed to the left — the outliers are low, and they pull the mean over to the left. In this situation, the mean is lower than the majority of values.
Mean — When the data is fairly symmetric and show only one trend
Median — When data is skewed because of outliers
Mode — When data is categorical, the only type of average calculated on categorical data is Mode
2. Variability analysis
- Deviations (Residuals) — The difference between the observed values and the estimate of location
- Variance (Mean Squared Error) — The sum of squared deviations from the mean divided by n-1 where n is number of data values
- Standard Deviation(σ) — The Square root of the variance. The standard deviation is much easier to interpret than the variance since it is on the same scale as the original data
- Mean Absolute Deviation — The mean of the absolute values of the deviations from the mean
- Median Absolute Deviation (MAD) — MAD is calculated as the median of the absolute value of each value, xi, minus the median of x. The median absolute deviation is used instead of the mean deviation when the deviation value needs to be less affected by extreme values in the tail.
- Range — The difference between the largest and smallest value in a data set
- Percentile (Quantile)— The value such that P percent of the values take on this value or less and (100-P) percent takes on this value or more
- Interquartile range(IQR) — The difference between the 75th percentile and 25th percentile. The common measurement of variability is interquartile range (IQR). Quartiles are values that split your data into quarters. The lowest quartile is called the lower quartile, and the highest quartile is called the upper quartile. The middle quartile is the median.
The variance and standard deviation are the most widespread and routinely reported statistics of variability. Both are sensitive to outliers as they are based on squared deviations. More robust metrics include mean absolute deviation, median absolute deviation from the median and percentiles(quantiles).
To be continued in Part-2