Statistics

#+ST ARTED: 09-Oct-2022

Basic terms

Symmetry (probability): case where multiple view points are essentially same
- https://math.stackexchange.com/questions/2062947/what-is-symmetry

Average (Arithmetic mean)

     1  ⎛ N-1   ⎞
x̅ = ─── ⎜  ∑  xᵢ⎟
     N  ⎝ i=0   ⎠

There are many kinds of 'mean'-s.

Average is arithmetic mean.

https://www.cuemath.com/data/difference-between-average-and-mean/

Standard deviation

A measure of 'spread' of data.

aka σ, s, SD

         ⎡ 1  ⎛ N-1          ⎞⎤
σ² =     ⎢─── ⎜  ∑  (xᵢ - μ)²⎟⎥
         ⎣ N  ⎝ i=0          ⎠⎦



       --------------------------                    
      /  ⎡ 1  ⎛ N-1          ⎞⎤
σ =  /   ⎢─── ⎜  ∑  (xᵢ - μ)²⎟⎥
    √    ⎣ N  ⎝ i=0          ⎠⎦

σ² is variance.

To get the value of the same unit as the xᵢ values, we take the square root of σ², which is the standard deviation σ.

Mode

Most frequently occurring value.

Eg:

In

10, 23, 42, 23, 20, 24, 19, 39, 24, 28, 24

24 is mode.

DOUBT: What if there are multiple values which occur most frequently?

https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch11/mode/5214873-eng.htm

Median

The middle value when the values are arranged from smallest to largest.

From Britannica:

(mean, mode and median are) the three principal ways of designating the average value of a list of numbers.

DOUBT: How can we get average value from median or mode?

Regression

From Spiegelhalter's popsci book:

any process of fitting lines or curves to data

Difference (or error) of a point from the line: residual

Response variable:

the variable whose values we wish to predict
dependent on the explanatory variable
usually plotted on y-axis

Explanatory variable:

independent variable
used to predict/explain value of response variable
usually plotted on x-axis

The gradient/slope of the regression curve/line: regression coefficient

Statistical model

Model built using available data.
Could be used to predict further data points.

Errors

Type I error	False positive
Type II error	False negative

Algorithm performance

In a classification problem.

Error matrix aka confusion matrix.

Percentage of true positives: sensitivity
Percentage of true negatives: specificity
Percentage correctly classified: accuracy

Markov process

A process where next state depends only the current state.

'Future is independent of the past' in some sense. ˡ

Monte Carlo methods

https://cermics.enpc.fr/~bl/Halmstad/monte-carlo/lecture-1.pdf

Law of large numbers ??

There is strength in numbers. :-D
https://en.wikipedia.org/wiki/Law_of_large_numbers

Gamma function

Generalization of factorial to non-integer argument
A function on positive real numbers (Γ: ℝ⁺ → ℝ⁺)
- Can be extended to even complex numbers though (??)
Often used as normalizing constants for probability distributions like Chi-square and gamma.
Can be seen as a smooth curve on which all n! values lie for n ∈ ℕ (ie, interpolation)
Notation Γ is from the French mathematician Legendre

Γ(z) = (z-1) * Γ(z-1)

Another definitionʳ:

       ∞
Γ(z) = ∫(xᶻ⁻¹). e⁻ˣ. dx
       0

—

For x ∈ ℕ, Γ(x) can be expressed in terms of factorial,

∀x ∈ ℕ,
  Γ(x) = (x-1)!

—

Handy valuesʷ:

x	Γ(x)	Comment
1/2	√π
1	1	Γ(1) = 0!
3/2	√π/2
-3/2	4√π/3
2	1	Γ(2) = 1!
3	2	Γ(3) = 2!

Range: Difference between largest and smallest values in the sample.
ANOVA: Analysis of Variance
ANCOVA: Analysis of covariance
MANCOVA: Multivariate analysis of covariance

Geometric mean

Useful for data where growth/decline is exponential.