PPLs are of interest in PL and AI
Programs which incorporate an element of uncertainity
Programs should have probabilistic semantics.
1982 Kozen: Coined the term probabilistic programming.
Why PPL from an AI perspective?
World Encode Reasoning
----------> Computer ------------> Action
Modeling
paradigm
The encoding: Sounds like computer programs!
Juden Pearl: A Turing award winner
Complex PGM is difficult to understand ??
Goal of this lecture series: Learning to build a series of PPLs
Lectures:
How does PP make reasoning better?
Probabilistic programs: Programs that denote probability distributions
Pseudo-random number generators
// x and y would get tt with probability 1/2
x <- flip 1/2;
y <- flip 1/2;
return x ∨ y
Probability that return value is true is 3/4.
[tt ↦ 3/4, ff ↦ 1/4]
Currently considering countably finite sample space Ω.
Ω = {tt, ff}
Prob distr is a function: Ω → [0, 1]
such that Σ Pr(ω) = 1
ω∈Ω
Semantic brackets ⟦⋅⟧
# Pure terms
p := true | false
| x
| if p then p else p
| p ∧ p
| p ∨ p
# Effectful or probabilistic terms
e := x <- e; e
| return p # Take a pure thing and make it impure. ie, lift to a probabilistic term
| flip θ # Introduce randomness. Where θ is a probability ∈ ℝ
This is a valid program as per above:
x <- return true
return x
Semantics of pure terms:
Let's go with an env way. Free variables are in env.
⟦p⟧ₚ: Env -> Bool
This is a partial function. If a var is not in env, it's value under this function is undefined.
Example:
—
Semantics of probabilistic terms:
⟦e⟧: Env -> ({tt, ff} -> [0, 1])
Given an env and value, there is an associated probability with it.
⟦flip r⟧(ρ) = {tt -> r, ff -> 1-r}, where r ∈ [0,1]
A Bernoulli distribution.
⟦return p⟧(ρ) = ∀v, v ↦ {if ⟦p⟧ₚ == v then 1, else 0}
Now for bind, the main thing.
⟦x ← e₁; e₂⟧(ρ)
Got to be compositional. Meaning needs to be in terms of e₁ and e₂.
An example:
⟦x ← flip 1/2⟧(ρ)
Got to do case analysis on x, which would be a pure term
⟦x ← flip 1/2; e⟧(ρ)
= if ⟦x⟧ₚ == tt then ⟦e⟧(ρ[x ↦ tt])
else ⟦e⟧(ρ[x ↦ ff])
Now the real one:
⟦x ← e₁; e₂⟧(ρ) =
Σ
v ↦ ⟦e₁⟧(ρ)(v) *
⟦e₂⟧(ρ[x ↦ v])
Considering that events are independent ??
Monadic semantics.
Tractability
Complexity: #P-Hard
—
Redn of TinyPPL to #P.
Example: φ = (x ∨ y) ∧ (y ∨ z)
x <- flip 1/2
y <- flip 1/2
z <- flip 1/2
return (x ∨ y) ∧ (y ∨ z)
#SAT(φ) = 2(number of vars)
2(#vars) * ⟦encoded prog⟧(∅)(tt)
??
—
Systems in the wild
Tradeoffs in design decisions.
HW: TinyPPL + Tiny Cond, WebPPL
—
observe e; e
: goes back in time and change stuff.
Non-local—
Reachability in a graph/network
R2
/\
/ \
/ \
R1 R4
\ /
\ /
\/
R3
Each link has a failure probability lᵢ.
# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
return
if route then l12 * l24
else l13 * l34
A PPL with conditionals.
Syntax:
p := true | false
| x
| if p then p else p
| p ∧ p
| p ∨ p
# Effectful or probabilistic terms
e := x <- e; e
| return p
| flip θ
| observe p; e # tells the prob that e happens provided p happened
Bayesian conditioning.
Example 2: R1 to R4 doesn't happen. What's the probability that l12 failed?
# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
observe
(if route then l12 * l24
else l13 * l34);
return l1
Example 3: Flip two coins. At least one of them is head. What's the prob that first one is head?
# Prob that R1 to R4:
x <- flip 0.5
y <- flip 0.5
observe (x ∨ y);
return x
Probability is 1/4 ???
x=F, y=F can be eliminated. Because at least one is true. Left with 3 choices?? Left with something that's not a prob distr? Got to normalize. Becasue prob sum is no longer 1. Left with 3/4. Divied each possibility by 3/4.
x=F, y=T => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=F => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=T => 1/4 becomes (1/4)/(3/4) = 1/3
Bayes' rule
We are interested in just x being true.
So rule out x=F,y=T.
=> Result of the above prog = 2/3
Unnormalized semantics => Doesn't sum to 1 Sub-sth: Has all features of a prob distr except that it doesn't sum to 1.
Subscript u for unnormalized.
—
Now, the normalized semantics. From the unnormalized version.
⟦e⟧ᵤ(ρ)(v)
⟦e⟧(ρ)(v) = ────────────────────────
⟦e⟧(ρ)(tt) + ⟦e⟧(ρ)(ff)
deterministic ??
observe:
x <- flip 1/2 | x <- flip 2/3 | x <- flip 2/3
y <- flip 1/2 | y <- flip 2/3 | return x
-----------------> | return x |
observe x ∨ y | |
return x | |
Are these same ????
while true: pass
and
observe false
???
Conditioning will still make things more complex ??
Expectation of a random fn = average val of the fn
We can approximate the E with a finite number of samples.
E(Pr)[f] = Σ Pr(ω) * f(ω)
ω
1 N
≈ --- Σ f(ω)
N ω~Pr
Monte Carlo ??
w ~ Pr
means w is drawn from a prob
distr Pr
.
There is a bound on the number of samples needed to approximate mean of a random var Concentration inequalities. Will only work on some classes of problems though. Which classes ???
Efficiency
To prove soundness of our runtime strategy, we relate it to an expectation.
Semantics. Trying a big step. Small-step too possible.
PAPER: Culpepper and Cobb: Contextual Eq 2017)
σ ⊢ <c,ρ> ⇓ v means in the context of σ, c under env ρ reduces to v
Externalize randomness. Put it all in one place. Considering only
flip 1/2
πₗ(σ) ⊢ <e1, ρ> ⇓ v πᵣ(σ) ⊢ <e2, ρ> ⇓ v'
──────────────────────────────────────────────
σ ⊢ <x←e1;e2, ρ> ⇓ v'
πₗ splits σ into two and takes left part. Likewise for right with πᵣ To split between e1 and e2.
──────────────────────
σ ⊢ <return p, ρ> ⇓ v
OT:
E [𝟙 (eval(e, σ, ρ)=tt)] = ⟦e⟧(ρ)(tt)
σ~Pr
where Pr is uniform distr on infinite bit streams.
–
Bind:
E[f g] = E[f] * E[g]
considering f and g are independent ??
OT:
x := 0 while flip 1/2: x := x+1
Should approach a limit
Measure theory stuff refs:
When does sampling work well and poorly?
Eg:
x <- flip 10^-6
y <- flip 10^-6
return x ∧ y
// DBT: what if x ∨ y ??
Here, it would take around a million samples before we get a true.
ie, sampling works poorly for low probabilty estimates.
Conditioning and sampling doesn't play well together. Why??
x <- flip 10^-6
y <- flip 10^-6
observe x ∧ y
return x
Let's explore alternatives to sampling.
Search-based:
x <- flip 1/2
y <- if x then flip 1/3 else flip 1/4
z <- if y then flip 1/6 else flip 1/7
return z
It's a probabilistic if
(which we haven't defined
yet).
Builing a search tree:
x
/ \
/ \
y y
/ \ / \
/ \ / \
+----
|
x--+────
Work inductively from leaves.
But there's some redundancy. z trees are identical. ∵ it doesn't depend on x, so there are copies.
Could we do better and use reuse the substructure? Some sharing. Let's try.
<graph here>
This is what's called Binary Decision Diagrams (BDDs).
Can we build the BDD without explicitly building the other tree?
Lang desing landscape:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
+-----------------------------------------------------------------
We could compile a more expressive lang to a more tractable lang ????
Knowledge compilation - Adron Darwiche
A DNF: AB + ACD' + …
SAT is NP-C but this one is easy ??
What kind of struct on this enable efficient reasoning?
Succinctness:
PAPER: On the expressive power of programming languages - Matthias Felleisen
PAPER: Holtzen 2020 OOPSLA Desired: compositional compilaton: This would allow locality
Base cases:
A──B1
|
+--B2
∧
BDD needs some structural invariants
These two together: Canonicity
⟦ψ₁⟧= ⟦ψ₂⟧ <=> BDD(ψ₁) = BDD(ψ₂)
See: https://github.com/SHoltzen/oplss24-code/blob/main/6-tinybdd/bdd.ml
This was bottom-up way to combine BDDs ??
An advantage of BDD: Lot of efficient BDD engineering has already been done. Can piggyback.
Probability + mutable states
—
Disjoint parts of the heap
{⊤}
x := alloc 10
{x ↦ 10}
y := alloc 20
{x ↦ 10 * y ↦ 20}
*
means it holds over disjoint spaces in program.
—
{⊤}
x := flip 1/2
{x~Bernoulli(1/2)}
y := flip 1/2
{x~Bernoulli(1/2) * y~Bernoulli(1/2)}
—
Frame rule?? in seplog: {P} c {Q}
{P} c {Q}
--------------
{P*F} c {Q*F}
—
Seplog for heap programs
(s,h)
store = s: names → heap locations
heap = h: loc → value
(s,h) ⊨ ⊤
(s,h) ⊭ ⊥
(s,h) ⊨ [x↦v] iff h(s(x))=v
(s,h) ⊨ P (s,h) ⊨ Q
--------------------------
(s,h) ⊨ P∧Q
(s1,h1) ⟂ (s2,h2) if dom(h1) ∩ dom(h2) = ∅
(s1,h1) ⊎ (s2,h2)
(s1,h1) ⊑ (s2,h2) if dom(h1) ⊆ dom(h2)
Heap (seplog) | Probability | |
---|---|---|
⊎ | Disjoint union | Independent combination |
⊑ | Heap subset | Sub-probability space |
⟂ | Heap disjunction | |
ownership | loc l is in heap |
—
(S, (Ω, E, μ)) ⊨ ⊤
Ω
+------+
| |
| 1/2 | ---------> T
| |
+------+
| |
| 1/2 | ---------> F
| |
+------+
Got to carve some space from Ω that's independent from what's already allocated.
Ω
+------+------+ (for x)
| | |
| | | ---------> T
| | |
+------+------+
| | |
| | | ---------> F
| | |
+------+------+
| |
| |
v v
T F (for y)
Statistic independence between Pr of x and y means x ⟂ y if Pr(x=T ∧ y=T) = Pr(x)*Pr(y)
How to show this models {x~Bern 1/2, y~Bern 1/2}
Ω
+------+------+ (for x)
| | |
| | | ---------> T
| | |
+------+------+
| | |
| | | ---------> F
| | |
+------+------+
| |
| |
v v
T F (for y)
We should be able to split Ω into two independent spaces.
Le'ts split the Ω into the two independent spaces.
For x~Bern 1/2
Ω
+-------------+
| |
+-------------|
| |
+-------------+
For y~Bern 1/2
Ω
+------+------+
| | |
| | |
| | |
+------+------+
+-------------+ +------+------+ +----+----+
| | | | | | | |
+-------------| ⋅ | | | = | | |
| | | | | +----+----+
+-------------+ +------+------+ | | |
| | |
+----+----+
(Ω, E1, μ1) ⋅ (Ω, E2, μ2) = (Ω, E1⊔E2, ρ)
This was like ⊎ (disjoint union), but for probability.
(S, (ω, E, μ)) ⊨ x~Bern θ iff
Measure theory stuff: The first two items mean x is measurable
Ownership is measurability.
—
These stuff are in close anology with stuff in seplog.
—
NQ: Conditional independence
⟦e⟧ ↦ probability
is not easy if there are things like
higher order fns. TinyPPL was too simple, so it was possible there.OT: