knitr::opts_chunk$set(dev = "ragg_png")
options(digits=3) # set number of digits equal to 5
# tidyverse
library(tidyverse)
# others
library(gridExtra)
library(pracma)
library(comprehenr)
library(stats)
library(latex2exp)
library(highcharter)
library(glue)
I want to start with a simple exponential discrete distribution, with a finite number of possible outcomes: \[ P(X=x) = \frac{e^{-\mu x}}{Z} \quad , \quad x\in \{1, 2, ..., N\} \] First of all, given \(\mu\), we need to find the normalization factor \(Z\), that is given by \[ Z=\sum_{x=1}^{N}e^{-\mu x} \]
# parameters
mu <- 0.1
N <- 8
# probability distribution
possible_outcomes <- seq(1, N)
Z <- sum(exp(-mu*possible_outcomes))
p <- function(x){return(exp(-mu*x)/Z)}
# check
cat('Is the distribution normalized? Total sum =', sum(p(possible_outcomes)))
## Is the distribution normalized? Total sum = 1
df_distrib <- data.frame('x'=possible_outcomes, 'P'=p(possible_outcomes))
hc_pdf <- df_distrib %>%
hchart('column', hcaes(x = x, y = P))%>%
hc_title(text='Probability distribution') %>%
hc_xAxis(title = list(text = 'x')) %>%
hc_yAxis(title = list(text = 'P(x)'))
hc_pdf
Let’s generate some data from this distribution.
set.seed(111)
ndata <- 100000
x <- sample(x=possible_outcomes, size=ndata, replace=TRUE, prob=p(possible_outcomes))
# data frame
hc_pdf <- as.data.frame(table(x)) %>%
hchart('column', hcaes(x = x, y = Freq))%>%
hc_title(text='Sampling results from the defined distribution') %>%
hc_xAxis(title = list(text = 'x')) %>%
hc_yAxis(title = list(text = 'Count'))
hc_pdf
The MaxEnt principle can be used to find the most general probability distribution compatible with a set of constraints, by maximizing the Shannon entropy \(S_I\) subject to these constraints.
Given a discrete event space \(E={i}, i=1, ..., N\), with probabilities \(p_i\geq0\), such that \(\sum_i p_i =1\), we define the Shannon entropy (also called information entropy) of this probability distribution as
\[ S_I(p_1, ..., p_k) = - \sum_{i=1}^{N} p_i \ln p_i \]
In this case, the constraints come from the measurements, not from some conservation law. Assume that the data obtained are sampled from an unknown distribution \(\{p_i\}\), which is what we want to infer. We can impose some constraints on it by measuring the \(k-th\) moments, for \(k: 1,..., m\), where \(m\) is a given number. In this case we can chose to set \(m=3\).
We define the vector of the \(k-th\) moments as \(\mathbf{g}\), so that
\[ E[X^a] \equiv \sum_{i=1}^{N}x_i^{a}\cdot p(x_i) = g^{(a)} \]
m = 3
g <- to_vec(for (a in 1:m) mean(x^a))
The application of the MaxEnt principle leads to finding the probabilities \(\mathbf{p}=(p_1, ..., p_N)\) that maximize \(S_I(\mathbf{p})\) and satisfy the \(m+1\) constraints (\(m\) from the moments and \(1\) from the normalization). This is done by using the Lagrange’s multipliers method.
By doing the calculation (see [2], pages 73-74), it is possible to find that \(\mathbf{p_{max}}\) is given by
\[ p_j^{\text{max}} \equiv p_j(\mathbf{\lambda}) \equiv \frac{1}{Z(\mathbf{\lambda})} \exp\left(-\sum_{a=1}^m \lambda_a x_j^a \right) \]
where
\[ Z(\mathbf{\lambda}) \equiv \sum_{j=1}^N \exp\left(-\sum_{a=1}^m \lambda_a x_j^a \right) = \exp\left(1 + \lambda_0 \right) \]
with \(\mathbf{\lambda} = (\lambda_1, \dots, \lambda_m)^T\).
To find \(\mathbf{\lambda}\) we must impose the defined constraints. Following Manzali’s notes ([2], page 74) we end up with the optimization problem \[ \mathbf{\nabla_x}h(\mathbf{x})|_\mathbf{x=\lambda} = 0 \] where \[ h(\mathbf{x}) \equiv \mathbf{g}\cdot\mathbf{x} + \ln Z(\mathbf{x}) \] Let’s solve this problem numerically.
h <- function(lambda_vec){
if (length(lambda_vec) != m) {
stop('lambda_vec must be a vector of size m')
}
Z <- sum(
to_vec(
for (j in 1:N) exp(-sum(dot(lambda_vec, to_vec(for (a in 1:m) possible_outcomes[j]^a))))
)
)
return(
dot(g, lambda_vec) + log(Z)
)
}
lambdas <- optim(numeric(m), h)
if (lambdas$convergence != 0) {
cat('An error in convergence occurred')
} else {
cat('Lagrange multipliers:\n')
for (i in seq_along(lambdas$par))
{
cat('lambda', i, '=', lambdas$par[i], '\n')
}
}
## Lagrange multipliers:
## lambda 1 = 0.106
## lambda 2 = -0.000554
## lambda 3 = -5.22e-06
As expected, we found that \(\lambda_1\) is close to \(\mu=0.1\), while \(\lambda_2\) and \(\lambda_3\) are much smaller than \(\lambda_1\) and almost zero. The resulting distribution is thus very close to that from which data were generated, and the constraints on \(\langle x^2 \rangle\) and \(\langle x^3 \rangle\) do not provide any further information on the distribution. Indeed, in this specific case, by using only the value of \(\langle x \rangle\) we would have found a very similar result, even closer to the original distribution.