Prbabilistic programming from the ground up


D1

PPLs are of interest in PL and AI

Programs which incorporate an element of uncertainity

  1. You want to verify programs
  2. Programs have uncertainity in them
    • Randomized algorithms
    • Distributed systems and networking
    • Differential privacy: doesn't leak sensitive data

Programs should have probabilistic semantics.

1982 Kozen: Coined the term probabilistic programming.

Why PPL from an AI perspective?

  1. Need to create agents that reason rationally about the world
  2. Need to represent the world to computer

World        Encode               Reasoning
          ---------->  Computer ------------> Action
Modeling
paradigm  

The encoding: Sounds like computer programs!

  1. But the world is too complex.
  2. Probabilites to simplify the world model

Juden Pearl: A Turing award winner

Complex PGM is difficult to understand ??

Goal of this lecture series: Learning to build a series of PPLs

Lectures:

  1. Syntax and semantics of PPLs
  2. Sampling
  3. Tractability and expressivity

How does PP make reasoning better?

Probabilistic programs: Programs that denote probability distributions

Pseudo-random number generators

// x and y would get tt with probability 1/2
x <- flip 1/2;
y <- flip 1/2;
return x ∨ y

Probability that return value is true is 3/4.

[tt ↦ 3/4, ff ↦ 1/4]

Currently considering countably finite sample space Ω.

Ω = {tt, ff}
Prob distr is a function: Ω → [0, 1]

such that Σ  Pr(ω) = 1
         ω∈Ω

Semantic brackets ⟦⋅⟧

Tiny PPL

Syntax

# Pure terms
p := true | false
   | x
   | if p then p else p
   | p ∧ p
   | p ∨ p

# Effectful or probabilistic terms
e := x <- e; e  
   | return p   # Take a pure thing and make it impure. ie, lift to a probabilistic term
   | flip θ     # Introduce randomness. Where θ is a probability ∈ ℝ

This is a valid program as per above:

x <- return true
return x

Semantics

Semantics of pure terms:

Let's go with an env way. Free variables are in env.

⟦p⟧ₚ: Env -> Bool

This is a partial function. If a var is not in env, it's value under this function is undefined.

Example:

Semantics of probabilistic terms:

⟦e⟧: Env -> ({tt, ff} -> [0, 1])

Given an env and value, there is an associated probability with it.

⟦flip r⟧(ρ) = {tt -> r, ff -> 1-r}, where r ∈ [0,1]

A Bernoulli distribution.

⟦return p⟧(ρ) = ∀v, v ↦ {if ⟦p⟧ₚ == v then 1, else 0}

Now for bind, the main thing.

⟦x ← e₁; e₂⟧(ρ)

Got to be compositional. Meaning needs to be in terms of e₁ and e₂.

An example:

⟦x ← flip 1/2⟧(ρ)

Got to do case analysis on x, which would be a pure term

⟦x ← flip 1/2; e⟧(ρ)
  = if ⟦x⟧ₚ == tt then ⟦e⟧(ρ[x ↦ tt])
    else ⟦e⟧(ρ[x ↦ ff])

Now the real one:

⟦x ← e₁; e₂⟧(ρ) = 

Σ

  v ↦ ⟦e₁⟧(ρ)(v) * 
  ⟦e₂⟧(ρ[x ↦ v])

Considering that events are independent ??

Monadic semantics.

Tractability

Complexity: #P-Hard

Redn of TinyPPL to #P.

Example: φ = (x ∨ y) ∧ (y ∨ z)

x <- flip 1/2
y <- flip 1/2
z <- flip 1/2
return (x ∨ y) ∧ (y ∨ z)

#SAT(φ) = 2(number of vars)

2(#vars) * ⟦encoded prog⟧(∅)(tt)

??

Systems in the wild

Tradeoffs in design decisions.

HW: TinyPPL + Tiny Cond, WebPPL

D2: Conditioning and sampling

Reachability in a graph/network

     R2
     /\
    /  \
   /    \
R1      R4
   \    /
    \  /
     \/
     R3

Each link has a failure probability lᵢ.

# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
return 
  if route then l12 * l24
  else l13 * l34

TinyCond

A PPL with conditionals.

Syntax:

p := true | false
   | x
   | if p then p else p
   | p ∧ p
   | p ∨ p

# Effectful or probabilistic terms
e := x <- e; e  
   | return p
   | flip θ
   | observe p; e  # tells the prob that e happens provided p happened

Bayesian conditioning.

Example 2: R1 to R4 doesn't happen. What's the probability that l12 failed?

# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
observe 
  (if route then l12 * l24
  else l13 * l34);
return l1

Example 3: Flip two coins. At least one of them is head. What's the prob that first one is head?

# Prob that R1 to R4:
x <- flip 0.5
y <- flip 0.5
observe (x ∨ y);
return x

Probability is 1/4 ???

x=F, y=F can be eliminated. Because at least one is true. Left with 3 choices?? Left with something that's not a prob distr? Got to normalize. Becasue prob sum is no longer 1. Left with 3/4. Divied each possibility by 3/4.

x=F, y=T => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=F => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=T => 1/4 becomes (1/4)/(3/4) = 1/3

Bayes' rule

We are interested in just x being true.

So rule out x=F,y=T.

=> Result of the above prog = 2/3

Semantics of TinyCond

Unnormalized semantics => Doesn't sum to 1 Sub-sth: Has all features of a prob distr except that it doesn't sum to 1.

Subscript u for unnormalized.

Now, the normalized semantics. From the unnormalized version.

                  ⟦e⟧ᵤ(ρ)(v)     
⟦e⟧(ρ)(v) = ────────────────────────
             ⟦e⟧(ρ)(tt) + ⟦e⟧(ρ)(ff)

deterministic ??

observe:

x <- flip 1/2           |   x <- flip 2/3   |  x <- flip 2/3
y <- flip 1/2           |   y <- flip 2/3   |  return x
----------------->      |   return x        |
observe x ∨ y           |                   |
return x                |                   |

Are these same ????

while true: pass

and

observe false

???


Conditioning will still make things more complex ??

Operational sampling semantics

Expectation of a random fn = average val of the fn

We can approximate the E with a finite number of samples.

E(Pr)[f] = Σ Pr(ω) * f(ω)
           ω


            1   N
         ≈ ---  Σ  f(ω)
        N  ω~Pr

Monte Carlo ??

w ~ Pr means w is drawn from a prob distr Pr.

There is a bound on the number of samples needed to approximate mean of a random var Concentration inequalities. Will only work on some classes of problems though. Which classes ???

Efficiency

To prove soundness of our runtime strategy, we relate it to an expectation.


Semantics. Trying a big step. Small-step too possible.

PAPER: Culpepper and Cobb: Contextual Eq 2017)

σ ⊢ <c,ρ> ⇓ v means in the context of σ, c under env ρ reduces to v

Externalize randomness. Put it all in one place. Considering only flip 1/2

πₗ(σ) ⊢ <e1, ρ> ⇓ v        πᵣ(σ) ⊢ <e2, ρ> ⇓ v'
────────────────────────────────────────────── 
           σ ⊢ <x←e1;e2, ρ> ⇓ v'

πₗ splits σ into two and takes left part. Likewise for right with πᵣ To split between e1 and e2.

──────────────────────
σ ⊢ <return p, ρ> ⇓ v

OT:


 E   [𝟙 (eval(e, σ, ρ)=tt)] = ⟦e⟧(ρ)(tt)
σ~Pr

where Pr is uniform distr on infinite bit streams.

Bind:

E[f g] = E[f] * E[g]

considering f and g are independent ??


OT:

x := 0 while flip 1/2: x := x+1

Should approach a limit

Measure theory stuff refs:

D3: Tractability and expressivity

When does sampling work well and poorly?

Eg:

x <- flip 10^-6
y <- flip 10^-6
return x ∧ y
// DBT: what if x ∨ y ??

Here, it would take around a million samples before we get a true.

ie, sampling works poorly for low probabilty estimates.

Conditioning and sampling doesn't play well together. Why??

x <- flip 10^-6
y <- flip 10^-6
observe x ∧ y
return x

Let's explore alternatives to sampling.

Search-based:

x <- flip 1/2
y <- if x then flip 1/3 else flip 1/4
z <- if y then flip 1/6 else flip 1/7
return z

It's a probabilistic if (which we haven't defined yet).

Builing a search tree:


                  x
                 / \
                /   \
               y     y
              / \   / \
             /   \ /   \  



   +----
   |
x--+────

Work inductively from leaves.

But there's some redundancy. z trees are identical. ∵ it doesn't depend on x, so there are copies.

Could we do better and use reuse the substructure? Some sharing. Let's try.

<graph here>

This is what's called Binary Decision Diagrams (BDDs).

Can we build the BDD without explicitly building the other tree?

Tractability and expressivity

Lang desing landscape:

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
+-----------------------------------------------------------------

We could compile a more expressive lang to a more tractable lang ????


Knowledge compilation - Adron Darwiche

A DNF: AB + ACD' + …

SAT is NP-C but this one is easy ??

What kind of struct on this enable efficient reasoning?

Succinctness:

PAPER: On the expressive power of programming languages - Matthias Felleisen

Compiling TinyPPL to BDD

PAPER: Holtzen 2020 OOPSLA Desired: compositional compilaton: This would allow locality

Composing BDDs

Base cases:

A──B1
|
+--B2

∧

BDD needs some structural invariants

  1. Ordering: Vars are encountered in same order along all paths, and at most once
  2. Redn: No isomorphic sub-trees
    • Also means no vacuous nodes

These two together: Canonicity

⟦ψ₁⟧= ⟦ψ₂⟧ <=> BDD(ψ₁) = BDD(ψ₂)

See: https://github.com/SHoltzen/oplss24-code/blob/main/6-tinybdd/bdd.ml

This was bottom-up way to combine BDDs ??

An advantage of BDD: Lot of efficient BDD engineering has already been done. Can piggyback.

D4: Separation logic, perspectives

Probability + mutable states

Disjoint parts of the heap

{⊤}
  x := alloc 10
{x ↦ 10}
  y := alloc 20
{x ↦ 10 * y ↦ 20}

* means it holds over disjoint spaces in program.

{⊤}
  x := flip 1/2
{x~Bernoulli(1/2)}
  y := flip 1/2
{x~Bernoulli(1/2) * y~Bernoulli(1/2)}

Frame rule?? in seplog: {P} c {Q}

  {P} c {Q}
--------------
{P*F} c {Q*F}

Seplog for heap programs

(s,h)

(s,h) ⊨ P      (s,h) ⊨ Q   
--------------------------
       (s,h) ⊨ P∧Q

Table of analogies

Heap (seplog) Probability
Disjoint union Independent combination
Heap subset Sub-probability space
Heap disjunction
ownership loc l is in heap

(S, (Ω, E, μ)) ⊨ ⊤

    Ω
+------+
|      |
| 1/2  | ---------> T
|      |
+------+
|      |
| 1/2  | ---------> F
|      |
+------+

Got to carve some space from Ω that's independent from what's already allocated.

    Ω
+------+------+          (for x)
|      |      |
|      |      | ---------> T
|      |      |
+------+------+
|      |      |
|      |      | ---------> F
|      |      |
+------+------+
   |      |
   |      |
   v      v
   T      F      (for y)

Statistic independence between Pr of x and y means x ⟂ y if Pr(x=T ∧ y=T) = Pr(x)*Pr(y)

Independent combination

How to show this models {x~Bern 1/2, y~Bern 1/2}

    Ω
+------+------+          (for x)
|      |      |
|      |      | ---------> T
|      |      |
+------+------+
|      |      |
|      |      | ---------> F
|      |      |
+------+------+
   |      |
   |      |
   v      v
   T      F      (for y)

We should be able to split Ω into two independent spaces.

Le'ts split the Ω into the two independent spaces.

For x~Bern 1/2

    Ω
+-------------+
|             |
+-------------|
|             |
+-------------+

For y~Bern 1/2

    Ω
+------+------+
|      |      |
|      |      |
|      |      |
+------+------+
+-------------+     +------+------+      +----+----+
|             |     |      |      |      |    |    |
+-------------|  ⋅  |      |      |  =   |    |    |    
|             |     |      |      |      +----+----+ 
+-------------+     +------+------+      |    |    |
                     |    |    |
                     +----+----+


(Ω, E1, μ1)  ⋅ (Ω, E2, μ2)  = (Ω, E1⊔E2, ρ)

This was like ⊎ (disjoint union), but for probability.

(S, (ω, E, μ)) ⊨ x~Bern θ iff

Measure theory stuff: The first two items mean x is measurable

Ownership is measurability.

These stuff are in close anology with stuff in seplog.

NQ: Conditional independence

Perspectives and future

OT: