D1
PPLs are of interest in PL and AI
Programs which incorporate an element of uncertainity
- You want to verify programs
- Programs have uncertainity in them
- Randomized algorithms
- Distributed systems and networking
- Differential privacy: doesn't leak sensitive data
Programs should have probabilistic semantics.
1982 Kozen: Coined the term probabilistic programming.
Why PPL from an AI perspective?
- Need to create agents that reason rationally about the world
- Need to represent the world to computer
World Encode Reasoning
----------> Computer ------------> Action
Modeling
paradigm
The encoding: Sounds like computer programs!
- But the world is too complex.
- Probabilites to simplify the world model
Juden Pearl: A Turing award winner
- Ended up making a new conf: Uncertainity in reasoning ??
- Pearl called it Probabilistic graphical models (PGM)
- Relation between unknown and random quantities ??
Complex PGM is difficult to understand ??
- => PPL: An expressive way to model the world
Goal of this lecture series: Learning to build a series of PPLs
Lectures:
- Syntax and semantics of PPLs
- Sampling
- Tractability and expressivity
How does PP make reasoning better?
- Networks of Plausible Inference - Jaden Pearl
- Ordinary logic is monotonic. Not sufficient for changing world?
- World is not deterministic
- CYC
- PPLs
Probabilistic programs: Programs that denote probability distributions
Pseudo-random number generators
- tt: semantic true
- true: syntactic true
// x and y would get tt with probability 1/2
x <- flip 1/2;
y <- flip 1/2;
return x ∨ y
Probability that return value is true is 3/4.
[tt ↦ 3/4, ff ↦ 1/4]
Currently considering countably finite sample space Ω.
Ω = {tt, ff}
Prob distr is a function: Ω → [0, 1]
such that Σ Pr(ω) = 1
ω∈Ω
Semantic brackets ⟦⋅⟧
Tiny PPL
Syntax
# Pure terms
p := true | false
| x
| if p then p else p
| p ∧ p
| p ∨ p
# Effectful or probabilistic terms
e := x <- e; e
| return p # Take a pure thing and make it impure. ie, lift to a probabilistic term
| flip θ # Introduce randomness. Where θ is a probability ∈ ℝ
This is a valid program as per above:
x <- return true
return x
Semantics
Semantics of pure terms:
Let's go with an env way. Free variables are in env.
⟦p⟧ₚ: Env -> Bool
This is a partial function. If a var is not in env, it's value under this function is undefined.
Example:
- ⟦x⟧ₚ(x ↦ tt) = tt
- There's no var updation, so there won't be multiple vals for x in this
- ⟦true⟧ₚ(ρ) = tt, where ρ is any env
- ⟦p₁ ∧ p₂⟧(ρ) = ⟦p₁⟧(ρ) ∧ ⟦p₁⟧(ρ), where the RHS ∧ is semantic and LHS one syntactic
—
Semantics of probabilistic terms:
⟦e⟧: Env -> ({tt, ff} -> [0, 1])
Given an env and value, there is an associated probability with it.
⟦flip r⟧(ρ) = {tt -> r, ff -> 1-r}, where r ∈ [0,1]
A Bernoulli distribution.
⟦return p⟧(ρ) = ∀v, v ↦ {if ⟦p⟧ₚ == v then 1, else 0}
Now for bind, the main thing.
⟦x ← e₁; e₂⟧(ρ)
Got to be compositional. Meaning needs to be in terms of e₁ and e₂.
An example:
⟦x ← flip 1/2⟧(ρ)
Got to do case analysis on x, which would be a pure term
⟦x ← flip 1/2; e⟧(ρ)
= if ⟦x⟧ₚ == tt then ⟦e⟧(ρ[x ↦ tt])
else ⟦e⟧(ρ[x ↦ ff])
Now the real one:
⟦x ← e₁; e₂⟧(ρ) =
Σ
v ↦ ⟦e₁⟧(ρ)(v) *
⟦e₂⟧(ρ[x ↦ v])
Considering that events are independent ??
- because only way to introduce randomness here is flip, which is independent
Monadic semantics.
- PAPER: Ramsey and Pfeffer 2002
- PAPER: M. Givy 1986, categorical semantics of probability distributions
Tractability
- sum of products not so efficient ??
Complexity: #P-Hard
- Toda's theorem
- Input: boolean formula φ
- Output: number of solns to φ
- Example: x ∨ y
- #SAT(x ∨ y) = 3
—
Redn of TinyPPL to #P.
Example: φ = (x ∨ y) ∧ (y ∨ z)
x <- flip 1/2
y <- flip 1/2
z <- flip 1/2
return (x ∨ y) ∧ (y ∨ z)
#SAT(φ) = 2(number of vars)
2(#vars) * ⟦encoded prog⟧(∅)(tt)
??
—
- Why do we care about complexity of deno sem? Computer would be opsem.
- Gives a lower bound on how good the opsem can be ??
- What's the price of adding new features? Like tuples/fns.
- TinyPPL is almost nothing, still quite computationally inefficient. It will only get worse.
Systems in the wild
- Stan from Columbia: https://mc-stan.org/
- Doesn't have conditions
- Pyro
- Tensorflow probability
- Sth from microsoft
Tradeoffs in design decisions.
HW: TinyPPL + Tiny Cond, WebPPL
—
- See sth about the world and then update my belief about it
- Observe the probability of an event and then infer probability of some other event. => conditionaling
observe e; e
: goes back in time and change stuff. Non-local
—
- This is Bayesian learning model: Different style from deep learning
- PPL is very interpretable. We can tell why a decision was made. Very difficult with DL.
D2: Conditioning and sampling
Reachability in a graph/network
- packet getting routed from a router when there are two options
R2
/\
/ \
/ \
R1 R4
\ /
\ /
\/
R3
Each link has a failure probability lᵢ.
# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
return
if route then l12 * l24
else l13 * l34
TinyCond
A PPL with conditionals.
Syntax:
p := true | false
| x
| if p then p else p
| p ∧ p
| p ∨ p
# Effectful or probabilistic terms
e := x <- e; e
| return p
| flip θ
| observe p; e # tells the prob that e happens provided p happened
Bayesian conditioning.
Example 2: R1 to R4 doesn't happen. What's the probability that l12 failed?
# Prob that R1 to R4:
route <- flip 0.5
l1 <- flip 0.99
l2 <- flip _
l3 <- flip _
l4 <- flip _
observe
(if route then l12 * l24
else l13 * l34);
return l1
Example 3: Flip two coins. At least one of them is head. What's the prob that first one is head?
# Prob that R1 to R4:
x <- flip 0.5
y <- flip 0.5
observe (x ∨ y);
return x
Probability is 1/4 ???
x=F, y=F can be eliminated. Because at least one is true. Left with 3 choices?? Left with something that's not a prob distr? Got to normalize. Becasue prob sum is no longer 1. Left with 3/4. Divied each possibility by 3/4.
x=F, y=T => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=F => 1/4 becomes (1/4)/(3/4) = 1/3 x=T, y=T => 1/4 becomes (1/4)/(3/4) = 1/3
Bayes' rule
We are interested in just x being true.
So rule out x=F,y=T.
=> Result of the above prog = 2/3
- semantics(prog)(tt) = 2/3
Semantics of TinyCond
Unnormalized semantics => Doesn't sum to 1 Sub-sth: Has all features of a prob distr except that it doesn't sum to 1.
Subscript u for unnormalized.
- ⟦flip 1/2⟧ᵤ(ρ)(tt) = 1/2
- ⟦true⟧ᵤ(ρ)(tt) = 1
- ⟦observe p;e⟧ᵤ(ρ)(v)
- if ⟦p⟧ₚ(ρ) = tt then ⟦e⟧ᵤ(ρ)(v)
- otherwise 0
—
Now, the normalized semantics. From the unnormalized version.
⟦e⟧ᵤ(ρ)(v)
⟦e⟧(ρ)(v) = ────────────────────────
⟦e⟧(ρ)(tt) + ⟦e⟧(ρ)(ff)
deterministic ??
- pure
- Probability monad
observe:
- kind of goes back in time and changes prob
- non-locality
- Still #P-hard ∵ only extra is the addition to normalize ??
x <- flip 1/2 | x <- flip 2/3 | x <- flip 2/3
y <- flip 1/2 | y <- flip 2/3 | return x
-----------------> | return x |
observe x ∨ y | |
return x | |
Are these same ????
while true: pass
and
observe false
???
Conditioning will still make things more complex ??
Operational sampling semantics
Expectation of a random fn = average val of the fn
We can approximate the E with a finite number of samples.
E(Pr)[f] = Σ Pr(ω) * f(ω)
ω
1 N
≈ --- Σ f(ω)
N ω~Pr
Monte Carlo ??
w ~ Pr
means w is drawn from a prob distr Pr
.
There is a bound on the number of samples needed to approximate mean of a random var Concentration inequalities. Will only work on some classes of problems though. Which classes ???
Efficiency
To prove soundness of our runtime strategy, we relate it to an expectation.
Semantics. Trying a big step. Small-step too possible.
PAPER: Culpepper and Cobb: Contextual Eq 2017)
- σ ∈ 𝔹N: (infinte stream of boolean) random values
- ρ: env
σ ⊢ <c,ρ> ⇓ v means in the context of σ, c under env ρ reduces to v
Externalize randomness. Put it all in one place. Considering only flip 1/2
- v::σ ⊢ flip 1/2 ⇓ v
πₗ(σ) ⊢ <e1, ρ> ⇓ v πᵣ(σ) ⊢ <e2, ρ> ⇓ v'
──────────────────────────────────────────────
σ ⊢ <x←e1;e2, ρ> ⇓ v'
πₗ splits σ into two and takes left part. Likewise for right with πᵣ To split between e1 and e2.
──────────────────────
σ ⊢ <return p, ρ> ⇓ v
OT:
- Hilbert cube: [0,1]ℕ
- v ∈ [0,1]ℕ
E [𝟙 (eval(e, σ, ρ)=tt)] = ⟦e⟧(ρ)(tt)
σ~Pr
where Pr is uniform distr on infinite bit streams.
–
Bind:
E[f g] = E[f] * E[g]
considering f and g are independent ??
OT:
x := 0 while flip 1/2: x := x+1
Should approach a limit
Measure theory stuff refs:
- Measure integration and real analysis - Sheldon Axler
D3: Tractability and expressivity
When does sampling work well and poorly?
Eg:
x <- flip 10^-6
y <- flip 10^-6
return x ∧ y
// DBT: what if x ∨ y ??
Here, it would take around a million samples before we get a true.
ie, sampling works poorly for low probabilty estimates.
Conditioning and sampling doesn't play well together. Why??
x <- flip 10^-6
y <- flip 10^-6
observe x ∧ y
return x
- Rejection sampling
- Downside of sampling.
- Low probability of evidence
Let's explore alternatives to sampling.
- Worst case hardness
- approximation. sampling
- Problem: approximate condition
Search-based:
x <- flip 1/2
y <- if x then flip 1/3 else flip 1/4
z <- if y then flip 1/6 else flip 1/7
return z
It's a probabilistic if
(which we haven't defined yet).
Builing a search tree:
x
/ \
/ \
y y
/ \ / \
/ \ / \
+----
|
x--+────
Work inductively from leaves.
But there's some redundancy. z trees are identical. ∵ it doesn't depend on x, so there are copies.
Could we do better and use reuse the substructure? Some sharing. Let's try.
<graph here>
This is what's called Binary Decision Diagrams (BDDs).
- adding new var => just another layer. It's not growing that fast.
- Knuth said: the fundamental data struct that's has been introduced in the last 25 years
Can we build the BDD without explicitly building the other tree?
Tractability and expressivity
- Computing semantics is #P-Hard
- But with BDD, it becomes easy?? As in P??
- Langs which can be computed in P-time: d-DNNF, BDD, SDD, tree
Lang desing landscape:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
+-----------------------------------------------------------------
We could compile a more expressive lang to a more tractable lang ????
- Could think of the tractable langs like assembly lang.
- The compilation has a cost
- A syntactic way to study this cost ??
Knowledge compilation - Adron Darwiche
- Hardness of propositional/boolean reasoning tasks is related to formulae syntax
- PAPER: Knowledge compilation map 2002
A DNF: AB + ACD' + …
SAT is NP-C but this one is easy ??
What kind of struct on this enable efficient reasoning?
- Darwiche identified BDD as one of them ??
Succinctness:
- L1 is more succinct than L2 if there is an efficient translation from all L2 programs to L1 progs
- Eg: BDD is not more succint than TinyPPL
- ∵ there can't be an efficient conversion from TinyPPL (#P-H) to BDD (P)
PAPER: On the expressive power of programming languages - Matthias Felleisen
Compiling TinyPPL to BDD
PAPER: Holtzen 2020 OOPSLA Desired: compositional compilaton: This would allow locality
Composing BDDs
Base cases:
- T*B = B
- F*B = F
A──B1
|
+--B2
∧
BDD needs some structural invariants
- Ordering: Vars are encountered in same order along all paths, and at most once
- Redn: No isomorphic sub-trees
- Also means no vacuous nodes
These two together: Canonicity
⟦ψ₁⟧= ⟦ψ₂⟧ <=> BDD(ψ₁) = BDD(ψ₂)
- assuming the 2 BDDs have same var order
See: https://github.com/SHoltzen/oplss24-code/blob/main/6-tinybdd/bdd.ml
This was bottom-up way to combine BDDs ??
An advantage of BDD: Lot of efficient BDD engineering has already been done. Can piggyback.
D4: Separation logic, perspectives
Probability + mutable states
- PAPER: POPL'20 Barthe, Hsu, Liao
- PAPER: Lilac: John Li, Ahmed, Holtzen
- PAPER: Bluebell
—
Disjoint parts of the heap
{⊤}
x := alloc 10
{x ↦ 10}
y := alloc 20
{x ↦ 10 * y ↦ 20}
*
means it holds over disjoint spaces in program.
—
{⊤}
x := flip 1/2
{x~Bernoulli(1/2)}
y := flip 1/2
{x~Bernoulli(1/2) * y~Bernoulli(1/2)}
- x is a rand var that follows Bernouilli(1/2)
- x and y are statistically independent rand vars
—
Frame rule?? in seplog: {P} c {Q}
- composition using other rules difficult due to possible aliases.
- But if the disjoint stuff is there, composition is possible
{P} c {Q}
--------------
{P*F} c {Q*F}
—
Seplog for heap programs
(s,h)
store = s: names → heap locations
heap = h: loc → value
(s,h) ⊨ ⊤
(s,h) ⊭ ⊥
(s,h) ⊨ [x↦v] iff h(s(x))=v
(s,h) ⊨ P (s,h) ⊨ Q
--------------------------
(s,h) ⊨ P∧Q
(s1,h1) ⟂ (s2,h2) if dom(h1) ∩ dom(h2) = ∅
(s1,h1) ⊎ (s2,h2)
- l ↦ h1(l) if l ∈ dom(h1)
- l ↦ h2(l) if l ∈ dom(h2)
- partial fn ??
(s1,h1) ⊑ (s2,h2) if dom(h1) ⊆ dom(h2)
- ∀l ∈ dom(h1), h1(l) = h2(l)
- Notion of sub-heap. For everything in the smaller one, they agree
Table of analogies
Heap (seplog) | Probability | |
---|---|---|
⊎ | Disjoint union | Independent combination |
⊑ | Heap subset | Sub-probability space |
⟂ | Heap disjunction | |
ownership | loc l is in heap |
—
(S, (Ω, E, μ)) ⊨ ⊤
- Ω: Sample space
- Kinda like heap in seplog as in it's ambient, in the bg
- An ambient space with enough room for all the randomness that wish to allocate
- E: collection of subsets of Ω
- μ: Probability measure
- asigns a probab to each element in E
- OT: σ-algebra
- x~Bern 1/2
- x is behaving like a pointer/location
- x is like a fn. x: Ω → {⊤,⊥}
- Ω = [0,1]² (unit square)
- Area of ⊤ and ⊥ is one half each
Ω
+------+
| |
| 1/2 | ---------> T
| |
+------+
| |
| 1/2 | ---------> F
| |
+------+
Got to carve some space from Ω that's independent from what's already allocated.
Ω
+------+------+ (for x)
| | |
| | | ---------> T
| | |
+------+------+
| | |
| | | ---------> F
| | |
+------+------+
| |
| |
v v
T F (for y)
Statistic independence between Pr of x and y means x ⟂ y if Pr(x=T ∧ y=T) = Pr(x)*Pr(y)
Independent combination
How to show this models {x~Bern 1/2, y~Bern 1/2}
Ω
+------+------+ (for x)
| | |
| | | ---------> T
| | |
+------+------+
| | |
| | | ---------> F
| | |
+------+------+
| |
| |
v v
T F (for y)
We should be able to split Ω into two independent spaces.
- One where x is Bern(1/2)
- One where y is Bern(1/2)
Le'ts split the Ω into the two independent spaces.
For x~Bern 1/2
Ω
+-------------+
| |
+-------------|
| |
+-------------+
For y~Bern 1/2
Ω
+------+------+
| | |
| | |
| | |
+------+------+
+-------------+ +------+------+ +----+----+
| | | | | | | |
+-------------| ⋅ | | | = | | |
| | | | | +----+----+
+-------------+ +------+------+ | | |
| | |
+----+----+
(Ω, E1, μ1) ⋅ (Ω, E2, μ2) = (Ω, E1⊔E2, ρ)
- Ω isn't changing, but the subset of Ω that we are allowed to talk about does.
- E1⊔E2 is the smallest σ that contains E1 and E2
- ρ needs to make sense
- ∀e1 ∈ E1, e2 ∈ E2, ρ(e1 ∩ e2) = μ1(e1) * μ2(e2)
This was like ⊎ (disjoint union), but for probability.
(S, (ω, E, μ)) ⊨ x~Bern θ iff
- S'(x=T) ∈ E
- S'(x=F) ∈ E
- S'(T)=θ ???
Measure theory stuff: The first two items mean x is measurable
Ownership is measurability.
—
These stuff are in close anology with stuff in seplog.
—
NQ: Conditional independence
Perspectives and future
- Scalability: Rare to find ppl progs >1k lines
- Usability: Error messages, debuggers
- Profilers
- PPLs still used only by domain experts??
- Semantics
⟦e⟧ ↦ probability
is not easy if there are things like higher order fns. TinyPPL was too simple, so it was possible there.
- Making PL more compatible with probability stuff
OT:
- Dirac distribution