knitr::opts_chunk$set(dev = "ragg_png")
options(digits=3) # set number of digits equal to 5

Packages

# tidyverse
library(tidyverse)

# others
library(gridExtra)
library(pracma)
library(comprehenr)
library(stats)
library(latex2exp)
library(highcharter)
library(glue)

Generate the data to analyze, according to a given probability distribution

I want to start with a simple exponential discrete distribution, with a finite number of possible outcomes: \[ P(X=x) = \frac{e^{-\mu x}}{Z} \quad , \quad x\in \{1, 2, ..., N\} \] First of all, given \(\mu\), we need to find the normalization factor \(Z\), that is given by \[ Z=\sum_{x=1}^{N}e^{-\mu x} \]

# parameters
mu <- 0.1
N <- 8

# probability distribution
possible_outcomes <- seq(1, N)
Z <- sum(exp(-mu*possible_outcomes))

p <- function(x){return(exp(-mu*x)/Z)}

# check
cat('Is the distribution normalized? Total sum =', sum(p(possible_outcomes)))

## Is the distribution normalized? Total sum = 1

df_distrib <- data.frame('x'=possible_outcomes, 'P'=p(possible_outcomes))
hc_pdf <- df_distrib %>% 
  hchart('column', hcaes(x = x, y = P))%>%
  hc_title(text='Probability distribution') %>% 
  hc_xAxis(title = list(text = 'x')) %>%
  hc_yAxis(title = list(text = 'P(x)')) 

hc_pdf

Let’s generate some data from this distribution.

set.seed(111)
ndata <- 100000
x <- sample(x=possible_outcomes, size=ndata, replace=TRUE, prob=p(possible_outcomes))

# data frame
hc_pdf <- as.data.frame(table(x)) %>% 
  hchart('column', hcaes(x = x, y = Freq))%>%
  hc_title(text='Sampling results from the defined distribution') %>% 
  hc_xAxis(title = list(text = 'x')) %>%
  hc_yAxis(title = list(text = 'Count')) 

hc_pdf

Application of the MaxEnt Principle to estimate the distribution from the generated data

The MaxEnt principle can be used to find the most general probability distribution compatible with a set of constraints, by maximizing the Shannon entropy \(S_I\) subject to these constraints.

Given a discrete event space \(E={i}, i=1, ..., N\), with probabilities \(p_i\geq0\), such that \(\sum_i p_i =1\), we define the Shannon entropy (also called information entropy) of this probability distribution as

\[ S_I(p_1, ..., p_k) = - \sum_{i=1}^{N} p_i \ln p_i \]

Definition of the contraints

In this case, the constraints come from the measurements, not from some conservation law. Assume that the data obtained are sampled from an unknown distribution \(\{p_i\}\), which is what we want to infer. We can impose some constraints on it by measuring the \(k-th\) moments, for \(k: 1,..., m\), where \(m\) is a given number. In this case we can chose to set \(m=3\).

We define the vector of the \(k-th\) moments as \(\mathbf{g}\), so that

\[ E[X^a] \equiv \sum_{i=1}^{N}x_i^{a}\cdot p(x_i) = g^{(a)} \]

m = 3

g <- to_vec(for (a in 1:m) mean(x^a))

The MaxEnt Principle

The application of the MaxEnt principle leads to finding the probabilities \(\mathbf{p}=(p_1, ..., p_N)\) that maximize \(S_I(\mathbf{p})\) and satisfy the \(m+1\) constraints (\(m\) from the moments and \(1\) from the normalization). This is done by using the Lagrange’s multipliers method.

By doing the calculation (see [2], pages 73-74), it is possible to find that \(\mathbf{p_{max}}\) is given by

\[ p_j^{\text{max}} \equiv p_j(\mathbf{\lambda}) \equiv \frac{1}{Z(\mathbf{\lambda})} \exp\left(-\sum_{a=1}^m \lambda_a x_j^a \right) \]

where

\[ Z(\mathbf{\lambda}) \equiv \sum_{j=1}^N \exp\left(-\sum_{a=1}^m \lambda_a x_j^a \right) = \exp\left(1 + \lambda_0 \right) \]

with \(\mathbf{\lambda} = (\lambda_1, \dots, \lambda_m)^T\).

The optimization problem

To find \(\mathbf{\lambda}\) we must impose the defined constraints. Following Manzali’s notes ([2], page 74) we end up with the optimization problem \[ \mathbf{\nabla_x}h(\mathbf{x})|_\mathbf{x=\lambda} = 0 \] where \[ h(\mathbf{x}) \equiv \mathbf{g}\cdot\mathbf{x} + \ln Z(\mathbf{x}) \] Let’s solve this problem numerically.

h <- function(lambda_vec){
  if (length(lambda_vec) != m) {
    stop('lambda_vec must be a vector of size m')
  }
  Z <- sum(
    to_vec(
      for (j in 1:N) exp(-sum(dot(lambda_vec, to_vec(for (a in 1:m) possible_outcomes[j]^a))))
    )
  )
  
  return(
    dot(g, lambda_vec) + log(Z)
  )
}

lambdas <- optim(numeric(m), h)
if (lambdas$convergence != 0) {
  cat('An error in convergence occurred')
} else {
  cat('Lagrange multipliers:\n')
  for (i in seq_along(lambdas$par))
  {
    cat('lambda', i, '=', lambdas$par[i], '\n')
  }
}

## Lagrange multipliers:
## lambda 1 = 0.106 
## lambda 2 = -0.000554 
## lambda 3 = -5.22e-06

Results

As expected, we found that \(\lambda_1\) is close to \(\mu=0.1\), while \(\lambda_2\) and \(\lambda_3\) are much smaller than \(\lambda_1\) and almost zero. The resulting distribution is thus very close to that from which data were generated, and the constraints on \(\langle x^2 \rangle\) and \(\langle x^3 \rangle\) do not provide any further information on the distribution. Indeed, in this specific case, by using only the value of \(\langle x \rangle\) we would have found a very similar result, even closer to the original distribution.

Conclusions

References

[1]: James P. Sethna. Statistical Mechanics: Entropy, Order Parameters, and Complexity. Oxford University Press, 2006. isbn: 9780198566779.
[2]: Francesco Manzali. Notes of the course Statistical Mechanics of Complex System. url: https://goldshish.it/notes/statistical-mechanics-of-complex-systems/notes-4

Application of the MaxEnt Principle to estimate the less biased discrete probability distribution from data

Nicola Zomer

15 April, 2022