D1

PPLs are of interest in PL and AI

Programs which incorporate an element of uncertainity

You want to verify programs
Programs have uncertainity in them
- Randomized algorithms
- Distributed systems and networking
- Differential privacy: doesn't leak sensitive data

Programs should have probabilistic semantics.

1982 Kozen: Coined the term probabilistic programming.

Why PPL from an AI perspective?

Need to create agents that reason rationally about the world
Need to represent the world to computer


World        Encode               Reasoning
          ---------->  Computer ------------> Action
Modeling
paradigm

The encoding: Sounds like computer programs!

But the world is too complex.
Probabilites to simplify the world model

Juden Pearl: A Turing award winner

Ended up making a new conf: Uncertainity in reasoning ??
Pearl called it Probabilistic graphical models (PGM)
- Relation between unknown and random quantities ??

Complex PGM is difficult to understand ??

=> PPL: An expressive way to model the world

Goal of this lecture series: Learning to build a series of PPLs

Lectures:

Syntax and semantics of PPLs
Sampling
Tractability and expressivity

How does PP make reasoning better?

Networks of Plausible Inference - Jaden Pearl
- https://dl.acm.org/doi/pdf/10.5555/534975
Ordinary logic is monotonic. Not sufficient for changing world?
World is not deterministic
CYC
PPLs

Probabilistic programs: Programs that denote probability distributions

Pseudo-random number generators

tt: semantic true
true: syntactic true

// x and y would get tt with probability 1/2
x <- flip 1/2;
y <- flip 1/2;
return x ∨ y

Probability that return value is true is 3/4.

[tt ↦ 3/4, ff ↦ 1/4]

Currently considering countably finite sample space Ω.

Ω = {tt, ff}
Prob distr is a function: Ω → [0, 1]

such that Σ  Pr(ω) = 1
         ω∈Ω

Semantic brackets ⟦⋅⟧

Tiny PPL

Syntax

# Pure terms
p := true | false
   | x
   | if p then p else p
   | p ∧ p
   | p ∨ p

# Effectful or probabilistic terms
e := x <- e; e  
   | return p   # Take a pure thing and make it impure. ie, lift to a probabilistic term
   | flip θ     # Introduce randomness. Where θ is a probability ∈ ℝ

This is a valid program as per above:

x <- return true
return x

Semantics

Semantics of pure terms:

Let's go with an env way. Free variables are in env.

⟦p⟧ₚ: Env -> Bool

This is a partial function. If a var is not in env, it's value under this function is undefined.

Example:

⟦x⟧ₚ(x ↦ tt) = tt
- There's no var updation, so there won't be multiple vals for x in this
⟦true⟧ₚ(ρ) = tt, where ρ is any env
⟦p₁ ∧ p₂⟧(ρ) = ⟦p₁⟧(ρ) ∧ ⟦p₁⟧(ρ), where the RHS ∧ is semantic and LHS one syntactic

—

Semantics of probabilistic terms:

⟦e⟧: Env -> ({tt, ff} -> [0, 1])

Given an env and value, there is an associated probability with it.

⟦flip r⟧(ρ) = {tt -> r, ff -> 1-r}, where r ∈ [0,1]

A Bernoulli distribution.

⟦return p⟧(ρ) = ∀v, v ↦ {if ⟦p⟧ₚ == v then 1, else 0}

Now for bind, the main thing.

⟦x ← e₁; e₂⟧(ρ)

Got to be compositional. Meaning needs to be in terms of e₁ and e₂.

An example:

⟦x ← flip 1/2⟧(ρ)

Got to do case analysis on x, which would be a pure term

⟦x ← flip 1/2; e⟧(ρ)
  = if ⟦x⟧ₚ == tt then ⟦e⟧(ρ[x ↦ tt])
    else ⟦e⟧(ρ[x ↦ ff])

Now the real one:

⟦x ← e₁; e₂⟧(ρ) = 

Σ

  v ↦ ⟦e₁⟧(ρ)(v) * 
  ⟦e₂⟧(ρ[x ↦ v])

Considering that events are independent ??

because only way to introduce randomness here is flip, which is independent

Monadic semantics.

PAPER: Ramsey and Pfeffer 2002
PAPER: M. Givy 1986, categorical semantics of probability distributions

Tractability

sum of products not so efficient ??

Complexity: #P-Hard

Toda's theorem
Input: boolean formula φ
Output: number of solns to φ
Example: x ∨ y
- #SAT(x ∨ y) = 3

—

Redn of TinyPPL to #P.

Example: φ = (x ∨ y) ∧ (y ∨ z)

x <- flip 1/2
y <- flip 1/2
z <- flip 1/2
return (x ∨ y) ∧ (y ∨ z)

#SAT(φ) = 2^{(number of vars)}

2^(#vars) * ⟦encoded prog⟧(∅)(tt)

—

Why do we care about complexity of deno sem? Computer would be opsem.
- Gives a lower bound on how good the opsem can be ??
What's the price of adding new features? Like tuples/fns.
- TinyPPL is almost nothing, still quite computationally inefficient. It will only get worse.

Systems in the wild

Stan from Columbia: https://mc-stan.org/
- Doesn't have conditions
Pyro
Tensorflow probability
Sth from microsoft

Tradeoffs in design decisions.

HW: TinyPPL + Tiny Cond, WebPPL

—

See sth about the world and then update my belief about it
- Observe the probability of an event and then infer probability of some other event. => conditionaling
- observe e; e: goes back in time and change stuff. Non-local

—

This is Bayesian learning model: Different style from deep learning
- PPL is very interpretable. We can tell why a decision was made. Very difficult with DL.

D2: Conditioning and sampling

Reachability in a graph/network

packet getting routed from a router when there are two options

Each link has a failure probability lᵢ.

# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
return 
  if route then l12 * l24
  else l13 * l34

TinyCond

A PPL with conditionals.

Syntax:

p := true | false
   | x
   | if p then p else p
   | p ∧ p
   | p ∨ p

# Effectful or probabilistic terms
e := x <- e; e  
   | return p
   | flip θ
   | observe p; e  # tells the prob that e happens provided p happened

Bayesian conditioning.

Example 2: R1 to R4 doesn't happen. What's the probability that l12 failed?

# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
observe 
  (if route then l12 * l24
  else l13 * l34);
return l1

Example 3: Flip two coins. At least one of them is head. What's the prob that first one is head?

# Prob that R1 to R4:
x <- flip 0.5
y <- flip 0.5
observe (x ∨ y);
return x

Probability is 1/4 ???

x=F, y=F can be eliminated. Because at least one is true. Left with 3 choices?? Left with something that's not a prob distr? Got to normalize. Becasue prob sum is no longer 1. Left with 3/4. Divied each possibility by 3/4.

x=F, y=T => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=F => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=T => 1/4 becomes (1/4)/(3/4) = 1/3

Bayes' rule

We are interested in just x being true.

So rule out x=F,y=T.

=> Result of the above prog = 2/3

semantics(prog)(tt) = 2/3

Semantics of TinyCond

Unnormalized semantics => Doesn't sum to 1 Sub-sth: Has all features of a prob distr except that it doesn't sum to 1.

Subscript u for unnormalized.

⟦flip 1/2⟧ᵤ(ρ)(tt) = 1/2
⟦true⟧ᵤ(ρ)(tt) = 1
⟦observe p;e⟧ᵤ(ρ)(v)
- if ⟦p⟧ₚ(ρ) = tt then ⟦e⟧ᵤ(ρ)(v)
- otherwise 0

—

Now, the normalized semantics. From the unnormalized version.

                  ⟦e⟧ᵤ(ρ)(v)     
⟦e⟧(ρ)(v) = ────────────────────────
             ⟦e⟧(ρ)(tt) + ⟦e⟧(ρ)(ff)

deterministic ??

pure
Probability monad

observe:

kind of goes back in time and changes prob
non-locality
Still #P-hard ∵ only extra is the addition to normalize ??

x <- flip 1/2           |   x <- flip 2/3   |  x <- flip 2/3
y <- flip 1/2           |   y <- flip 2/3   |  return x
----------------->      |   return x        |
observe x ∨ y           |                   |
return x                |                   |

Are these same ????

while true: pass

and

observe false

???

Conditioning will still make things more complex ??

Operational sampling semantics

Expectation of a random fn = average val of the fn

We can approximate the E with a finite number of samples.

E(Pr)[f] = Σ Pr(ω) * f(ω)
           ω


            1   N
         ≈ ---  Σ  f(ω)
        N  ω~Pr

Monte Carlo ??

w ~ Pr means w is drawn from a prob distr Pr.

There is a bound on the number of samples needed to approximate mean of a random var Concentration inequalities. Will only work on some classes of problems though. Which classes ???

Efficiency

To prove soundness of our runtime strategy, we relate it to an expectation.

Semantics. Trying a big step. Small-step too possible.

PAPER: Culpepper and Cobb: Contextual Eq 2017)

σ ∈ 𝔹^N: (infinte stream of boolean) random values
ρ: env

σ ⊢ <c,ρ> ⇓ v means in the context of σ, c under env ρ reduces to v

Externalize randomness. Put it all in one place. Considering only flip 1/2

v::σ ⊢ flip 1/2 ⇓ v

πₗ(σ) ⊢ <e1, ρ> ⇓ v        πᵣ(σ) ⊢ <e2, ρ> ⇓ v'
────────────────────────────────────────────── 
           σ ⊢ <x←e1;e2, ρ> ⇓ v'

πₗ splits σ into two and takes left part. Likewise for right with πᵣ To split between e1 and e2.

──────────────────────
σ ⊢ <return p, ρ> ⇓ v

OT:

Hilbert cube: [0,1]^ℕ
v ∈ [0,1]^ℕ

 E   [𝟙 (eval(e, σ, ρ)=tt)] = ⟦e⟧(ρ)(tt)
σ~Pr

where Pr is uniform distr on infinite bit streams.

–

Bind:

E[f g] = E[f] * E[g]

considering f and g are independent ??

OT:

x := 0 while flip 1/2: x := x+1

Should approach a limit

Measure theory stuff refs:

Measure integration and real analysis - Sheldon Axler

D3: Tractability and expressivity

When does sampling work well and poorly?

Eg:

x <- flip 10^-6
y <- flip 10^-6
return x ∧ y
// DBT: what if x ∨ y ??

Here, it would take around a million samples before we get a true.

ie, sampling works poorly for low probabilty estimates.

Conditioning and sampling doesn't play well together. Why??

x <- flip 10^-6
y <- flip 10^-6
observe x ∧ y
return x

Rejection sampling
Downside of sampling.
Low probability of evidence

Let's explore alternatives to sampling.

Worst case hardness
approximation. sampling
- Problem: approximate condition

Search-based:

x <- flip 1/2
y <- if x then flip 1/3 else flip 1/4
z <- if y then flip 1/6 else flip 1/7
return z

It's a probabilistic if (which we haven't defined yet).

Builing a search tree:


                  x
                 / \
                /   \
               y     y
              / \   / \
             /   \ /   \  



   +----
   |
x--+────

Work inductively from leaves.

But there's some redundancy. z trees are identical. ∵ it doesn't depend on x, so there are copies.

Could we do better and use reuse the substructure? Some sharing. Let's try.

<graph here>

This is what's called Binary Decision Diagrams (BDDs).

adding new var => just another layer. It's not growing that fast.
Knuth said: the fundamental data struct that's has been introduced in the last 25 years

Can we build the BDD without explicitly building the other tree?

Tractability and expressivity

Computing semantics is #P-Hard
But with BDD, it becomes easy?? As in P??
- Langs which can be computed in P-time: d-DNNF, BDD, SDD, tree

Lang desing landscape:

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
+-----------------------------------------------------------------

We could compile a more expressive lang to a more tractable lang ????

Could think of the tractable langs like assembly lang.
The compilation has a cost
A syntactic way to study this cost ??

Knowledge compilation - Adron Darwiche

Hardness of propositional/boolean reasoning tasks is related to formulae syntax
PAPER: Knowledge compilation map 2002

A DNF: AB + ACD' + …

SAT is NP-C but this one is easy ??

What kind of struct on this enable efficient reasoning?

Darwiche identified BDD as one of them ??

Succinctness:

L1 is more succinct than L2 if there is an efficient translation from all L2 programs to L1 progs
Eg: BDD is not more succint than TinyPPL
- ∵ there can't be an efficient conversion from TinyPPL (#P-H) to BDD (P)

PAPER: On the expressive power of programming languages - Matthias Felleisen

Compiling TinyPPL to BDD

PAPER: Holtzen 2020 OOPSLA Desired: compositional compilaton: This would allow locality

Composing BDDs

Base cases:

T*B = B
F*B = F

A──B1
|
+--B2

∧

BDD needs some structural invariants

Ordering: Vars are encountered in same order along all paths, and at most once
Redn: No isomorphic sub-trees
- Also means no vacuous nodes

These two together: Canonicity

⟦ψ₁⟧= ⟦ψ₂⟧ <=> BDD(ψ₁) = BDD(ψ₂)

assuming the 2 BDDs have same var order

See: https://github.com/SHoltzen/oplss24-code/blob/main/6-tinybdd/bdd.ml

This was bottom-up way to combine BDDs ??

An advantage of BDD: Lot of efficient BDD engineering has already been done. Can piggyback.

D4: Separation logic, perspectives

Probability + mutable states

PAPER: POPL'20 Barthe, Hsu, Liao
PAPER: Lilac: John Li, Ahmed, Holtzen
PAPER: Bluebell

—

Disjoint parts of the heap

{⊤}
  x := alloc 10
{x ↦ 10}
  y := alloc 20
{x ↦ 10 * y ↦ 20}

* means it holds over disjoint spaces in program.

—

{⊤}
  x := flip 1/2
{x~Bernoulli(1/2)}
  y := flip 1/2
{x~Bernoulli(1/2) * y~Bernoulli(1/2)}

x is a rand var that follows Bernouilli(1/2)
x and y are statistically independent rand vars

—

Frame rule?? in seplog: {P} c {Q}

composition using other rules difficult due to possible aliases.
But if the disjoint stuff is there, composition is possible

  {P} c {Q}
--------------
{P*F} c {Q*F}

—

Seplog for heap programs

(s,h)

store = s: names → heap locations
heap = h: loc → value
(s,h) ⊨ ⊤
(s,h) ⊭ ⊥
(s,h) ⊨ [x↦v] iff h(s(x))=v

(s,h) ⊨ P      (s,h) ⊨ Q   
--------------------------
       (s,h) ⊨ P∧Q

(s1,h1) ⟂ (s2,h2) if dom(h1) ∩ dom(h2) = ∅
(s1,h1) ⊎ (s2,h2)
- l ↦ h1(l) if l ∈ dom(h1)
- l ↦ h2(l) if l ∈ dom(h2)
- partial fn ??
(s1,h1) ⊑ (s2,h2) if dom(h1) ⊆ dom(h2)
- ∀l ∈ dom(h1), h1(l) = h2(l)
- Notion of sub-heap. For everything in the smaller one, they agree

Table of analogies

	Heap (seplog)	Probability
⊎	Disjoint union	Independent combination
⊑	Heap subset	Sub-probability space
⟂	Heap disjunction
ownership	loc l is in heap

—

(S, (Ω, E, μ)) ⊨ ⊤

Ω: Sample space
- Kinda like heap in seplog as in it's ambient, in the bg
- An ambient space with enough room for all the randomness that wish to allocate
E: collection of subsets of Ω
μ: Probability measure
- asigns a probab to each element in E
- OT: σ-algebra
x~Bern 1/2
- x is behaving like a pointer/location
- x is like a fn. x: Ω → {⊤,⊥}
Ω = [0,1]² (unit square)
Area of ⊤ and ⊥ is one half each

    Ω
+------+
|      |
| 1/2  | ---------> T
|      |
+------+
|      |
| 1/2  | ---------> F
|      |
+------+

Got to carve some space from Ω that's independent from what's already allocated.

    Ω
+------+------+          (for x)
|      |      |
|      |      | ---------> T
|      |      |
+------+------+
|      |      |
|      |      | ---------> F
|      |      |
+------+------+
   |      |
   |      |
   v      v
   T      F      (for y)

Statistic independence between Pr of x and y means x ⟂ y if Pr(x=T ∧ y=T) = Pr(x)*Pr(y)

Independent combination

How to show this models {x~Bern 1/2, y~Bern 1/2}

    Ω
+------+------+          (for x)
|      |      |
|      |      | ---------> T
|      |      |
+------+------+
|      |      |
|      |      | ---------> F
|      |      |
+------+------+
   |      |
   |      |
   v      v
   T      F      (for y)

We should be able to split Ω into two independent spaces.

One where x is Bern(1/2)
One where y is Bern(1/2)

Le'ts split the Ω into the two independent spaces.

For x~Bern 1/2

    Ω
+-------------+
|             |
+-------------|
|             |
+-------------+

For y~Bern 1/2

    Ω
+------+------+
|      |      |
|      |      |
|      |      |
+------+------+

+-------------+     +------+------+      +----+----+
|             |     |      |      |      |    |    |
+-------------|  ⋅  |      |      |  =   |    |    |    
|             |     |      |      |      +----+----+ 
+-------------+     +------+------+      |    |    |
                     |    |    |
                     +----+----+


(Ω, E1, μ1)  ⋅ (Ω, E2, μ2)  = (Ω, E1⊔E2, ρ)

Ω isn't changing, but the subset of Ω that we are allowed to talk about does.
E1⊔E2 is the smallest σ that contains E1 and E2
ρ needs to make sense
- ∀e1 ∈ E1, e2 ∈ E2, ρ(e1 ∩ e2) = μ1(e1) * μ2(e2)

This was like ⊎ (disjoint union), but for probability.

(S, (ω, E, μ)) ⊨ x~Bern θ iff

S'(x=T) ∈ E
S'(x=F) ∈ E
S'(T)=θ ???

Measure theory stuff: The first two items mean x is measurable

Ownership is measurability.

—

These stuff are in close anology with stuff in seplog.

—

NQ: Conditional independence

Perspectives and future

Scalability: Rare to find ppl progs >1k lines
Usability: Error messages, debuggers
- Profilers
- PPLs still used only by domain experts??
Semantics
- ⟦e⟧ ↦ probability is not easy if there are things like higher order fns. TinyPPL was too simple, so it was possible there.
Making PL more compatible with probability stuff

OT:

Dirac distribution