Statistics and probability


Basic terms

Average (Arithmetic mean)

     1  ⎛ N-1   ⎞
x̅ = ─── ⎜  ∑  xᵢ⎟
     N  ⎝ i=0   ⎠

There are many kinds of 'mean'-s.

Average is arithmetic mean.

https://www.cuemath.com/data/difference-between-average-and-mean/

Standard deviation

A measure of 'spread' of data.

aka σ, s, SD

         ⎡ 1  ⎛ N-1          ⎞⎤
σ² =     ⎢─── ⎜  ∑  (xᵢ - μ)²⎟⎥
         ⎣ N  ⎝ i=0          ⎠⎦



       --------------------------                    
      /  ⎡ 1  ⎛ N-1          ⎞⎤
σ =  /   ⎢─── ⎜  ∑  (xᵢ - μ)²⎟⎥
    √    ⎣ N  ⎝ i=0          ⎠⎦

σ² is variance.

To get the value of the same unit as the xᵢ values, we take the square root of σ², which is the standard deviation σ.

Mode

Most frequently occurring value.

Eg:

In

10, 23, 42, 23, 20, 24, 19, 39, 24, 28, 24

24 is mode.

DOUBT: What if there are multiple values which occur most frequently?

https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch11/mode/5214873-eng.htm

Median

The middle value when the values are arranged from smallest to largest.

From Britannica:

(mean, mode and median are) the three principal ways of designating the average value of a list of numbers.

DOUBT: How can we get average value from median or mode?

Regression

From Spiegelhalter's popsci book:

any process of fitting lines or curves to data

Difference (or error) of a point from the line: residual

Response variable:

Explanatory variable:

The gradient/slope of the regression curve/line: regression coefficient

Statistical model

Errors

Type I error False positive
Type II error False negative

Algorithm performance

In a classification problem.

Error matrix aka confusion matrix.

Markov process

A process where next state depends only the current state.

'Future is independent of the past' in some sense. ˡ

Probability distribution models

Poisson distribution

Same event that happens multiple times over a time interval.

Probability of k events (probability density/mass function):

        λᵏ.e⁻ᵏ
P(k) = ───────
         k!

Poisson density function is not continuous. It's defined only for integer values of k.

See: https://brilliant.org/wiki/poisson-distribution/

Gamma distribution

DBT: Mean is past the midpoint in the graph always??

Normal distribution

Categorical distribution

Examples:

More

Bayes' theorem

Derivationʳ:

P(A ∩ B) is the probability of A times probablity of B given that A has already happened.

P(A ∩ B) = P(A) * P(B/A)

It could also be defined as the probability of B times probablity of A given that B has already happened.

P(A ∩ B) = P(B) * P(A/B)

Equating the two,

P(A) * P(B/A) = P(B) * P(A/B)


                 P(A) * P(B/A)
 =>    P(A/B) = ──────────────
                     P(B)

Monte Carlo methods

https://cermics.enpc.fr/~bl/Halmstad/monte-carlo/lecture-1.pdf

Law of large numbers ??

Gamma function

Γ(z) = (z-1) * Γ(z-1)

Another definitionʳ:

       ∞
Γ(z) = ∫(xᶻ⁻¹). e⁻ˣ. dx
       0

For x ∈ ℕ, Γ(x) can be expressed in terms of factorial,

∀x ∈ ℕ,
  Γ(x) = (x-1)!

Handy valuesʷ:

x Γ(x) Comment
1/2 √π
1 1 Γ(1) = 0!
3/2 √π/2
-3/2 4√π/3
2 1 Γ(2) = 1!
3 2 Γ(3) = 2!

Central limit theorem

Given a collection of points (of any probability distribution. Need not be normal), if we select k number of points repeatedly with replacement (ie, the k points are considered to be 'put back' after each trial), the mean value of the trials will be normally distributed.

See:

More

Geometric mean

Useful for data where growth/decline is exponential.