Formalization of optimization algorithms

2.Β Some examplesπŸ”—

In this page, we present two examples of algorithms encompassed by our framework: the Pure Random Search (PRS) and the LIPO algorithms. The formalization of these algorithms in our framework relies on the definition of the initial probability measure and the Markov kernels that define how to sample the next element based on the previous ones.

The Pure Random Search (PRS) algorithm is a simple stochastic iterative global optimization algorithm that samples uniformly from the input space at each iteration using a fixed probability measure ΞΌ. It can be represented in our framework as follows:

πŸ”—def
PRS.{u_1, u_2} {Ξ± : Type u_1} {Ξ² : Type u_2} [MeasurableSpace Ξ±] [MeasurableSpace Ξ²] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] : Algorithm Ξ± Ξ²
PRS.{u_1, u_2} {Ξ± : Type u_1} {Ξ² : Type u_2} [MeasurableSpace Ξ±] [MeasurableSpace Ξ²] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] : Algorithm Ξ± Ξ²

The Pure Random Search (PRS) algorithm for global optimization. This baseline algorithm samples uniformly from the input space at each iteration using a fixed probability measure ΞΌ.

noncomputable def PRS : Algorithm Ξ± Ξ² where Ξ½ := ΞΌ kernel_iter _ := Kernel.const _ ΞΌ

2.2.Β Decision-based methodsπŸ”—

We provide an interface for decision-based optimization algorithms. Such algorithm are described by a sequence of decision rules to decide wether to accept or reject a candidate solution at each iteration, based on the observed data. Decision rules are functions with the signature D_n : \Omega^{n + 1} \times \beta^{n + 1} \to [0, 1], where \Omega is the search space and \beta is the space of function values (most likely \mathbb{R}). More formally, given a function to optimize V : \Omega \to \beta at each iteration n > 0, the algorithm samples a candidate x from a distribution \mu on \Omega and accepts it with probability D_n(C_n, V(C_n), x).

We define Decision as a special case of the Algorithm structure, for which the Markov kernel at iteration n is defined through the decision rules (D_n)_{n \in \mathbb{N}}. Measurability assumptions are required on the decision rules for the kernels to be well-defined and we only support \{0, 1\}-valued decision rules at the moment.

πŸ”—def
Decision.{u_1, u_2} {Ξ± : Type u_1} {Ξ² : Type u_2} [MeasurableSpace Ξ±] [MeasurableSpace Ξ²] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] {decision : (n : β„•) β†’ prod_iter_image Ξ± Ξ² n β†’ Set Ξ±} (measurableSet_decision_prod : βˆ€ (n : β„•), MeasurableSet {p | p.2 ∈ decision n p.1}) (h : βˆ€ (n : β„•) (data : prod_iter_image Ξ± Ξ² n), ΞΌ (decision n data) β‰  0) : Algorithm Ξ± Ξ²
Decision.{u_1, u_2} {Ξ± : Type u_1} {Ξ² : Type u_2} [MeasurableSpace Ξ±] [MeasurableSpace Ξ²] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] {decision : (n : β„•) β†’ prod_iter_image Ξ± Ξ² n β†’ Set Ξ±} (measurableSet_decision_prod : βˆ€ (n : β„•), MeasurableSet {p | p.2 ∈ decision n p.1}) (h : βˆ€ (n : β„•) (data : prod_iter_image Ξ± Ξ² n), ΞΌ (decision n data) β‰  0) : Algorithm Ξ± Ξ²

The interface for decision-based optimization algorithms.

2.3.Β LIPOπŸ”—

The LIPO algorithm has been introduced in Β (Malherbe and Vayatis,Β 2017)CΓ©dric Malherbe and Nicolas Vayatis,Β 2017. β€œGlobal optimization of Lipschitz functions”. InΒ International conference on machine learning. and is a decision-based global optimization algorithm made for Lipschitz functions. It uses the Lipschitz constant to adaptively construct an upper bound that guides the sampling of the search space. More formally, at each iteration, LIPO samples from the set of potential maximizers defined as \left \{ x \in \alpha \; \middle| \; \max_{1 \le i \le n} f(X_i) \le \min_{1 \le i \le n} f(X_i) + \kappa \|X_i - x \|_2 \right\}, where Ξ± is a pseudo-metric, measurable, Borel, and second-countable space and \kappa is the Lipschitz constant of the function f : \alpha \to \mathbb{R}.

It can be represented in our framework as a special case of the Decision structure, where the decision rules return 1 if the candidate solution is in the set of potential maximizers and 0 otherwise:

noncomputable def LIPO : Algorithm Ξ± ℝ := Decision ΞΌ (fun _ ↦ measurableSet_potential_max_prod ΞΊ) h

Note that LIPO requires the set of potential maximizers to have non-zero measure at each iteration, ensuring that the algorithm can sample from it. This is a non-trivial assumption that depends on the choice of the initial probability measure ΞΌ, the function to optimize, and the \sigma-algebra on the search space.

πŸ”—def
LIPO.{u_1} {Ξ± : Type u_1} [PseudoMetricSpace Ξ±] [MeasurableSpace Ξ±] [BorelSpace Ξ±] [SecondCountableTopology Ξ±] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] (ΞΊ : NNReal) (h : βˆ€ (n : β„•) (data : prod_iter_image Ξ± ℝ n), ΞΌ (LIPO.potential_max ΞΊ data) β‰  0) : Algorithm Ξ± ℝ
LIPO.{u_1} {Ξ± : Type u_1} [PseudoMetricSpace Ξ±] [MeasurableSpace Ξ±] [BorelSpace Ξ±] [SecondCountableTopology Ξ±] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] (ΞΊ : NNReal) (h : βˆ€ (n : β„•) (data : prod_iter_image Ξ± ℝ n), ΞΌ (LIPO.potential_max ΞΊ data) β‰  0) : Algorithm Ξ± ℝ

The LIPO (LIPschitz Optimization) algorithm for global optimization. This algorithm optimizes an unknown function assuming only that it has a finite Lipschitz constant ΞΊ. It starts with an arbitrary probability measure ΞΌ as initial distribution and iteratively samples from the set of potential maximizers, ensuring consistency and convergence to the global optimum (Malherbe et al., 2017).

2.4.Β RankOptπŸ”—

The RankOpt algorithm has been introduced in Β (Malherbe and Vayatis,Β 2017)CΓ©dric Malherbe and Nicolas Vayatis,Β 2017. β€œA Ranking Approach to Global Optimization”. InΒ International conference on machine learning. and is a sophisticated decision-based global optimization algorithm. It is based on the notion of a ranking rule, which is a function induced by another function f and is defined as r_f (x, y) = \begin{cases} 1 & \text{if } f(x) > f(y) \\ 0 & \text{if } f(x) = f(y) \\ -1 & \text{if } f(x) < f(y) \end{cases}. Ranking rules define equivalence classes of functions that share the same induced ranking. To use RankOpt, one needs to define a countable set of ranking rules \mathcal{R} (e.g. a subset of the ranking rules induced by continuous functions). The algorithm samples points x such that, for any ranking rule r \in \mathcal{R} that is consistent with the observed data, 0 \le r(x, \arg \max_{1 \le i \le n} f(X_i)). The set of such points is the set of potential maximizers of the algorithm and is defined as \left \{x \in \alpha \; \middle| \; \exists r \in \mathcal{R}, \; \mathcal{L}_n(r) = 0 \land 0 \le r(x, \arg \max_{1 \le i \le n} f(X_i))\right\}, where Ξ± is a measurable space and \mathcal{L}_n(r) is the ranking loss of r on the observed data, defined as \mathcal{L}_n(r) \triangleq \frac{2}{n (n + 1)} \sum_{1 \le i \le j \le n} \mathbb{I}\left[r(X_i, X_j) \neq r_f(X_i, X_j)\right]. Note that RankOpt does not require to know r_f explicitly as it is evaluated only on observed data points.

It can be represented in our framework as a special case of the Decision structure, where the decision rules return 1 if the candidate solution is in the set of potential maximizers and 0 otherwise:

noncomputable def RankOpt : Algorithm Ξ± Ξ² := Decision ΞΌ (fun _ ↦ measurableSet_potential_max_prod h𝓑) h

Note that RankOpt requires the set of potential maximizers to have non-zero measure at each iteration, ensuring that the algorithm can sample from it. This is a non-trivial assumption that depends on the choice of the initial probability measure ΞΌ, the function to optimize, and the \sigma-algebra on the search space.

πŸ”—def
RankOpt.{u_1, u_2} {Ξ± : Type u_1} {Ξ² : Type u_2} [MeasurableSpace Ξ±] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] [TopologicalSpace Ξ²] [MeasurableSpace Ξ²] [BorelSpace Ξ²] [LinearOrder Ξ²] [SecondCountableTopology Ξ²] [OpensMeasurableSpace Ξ²] [OrderClosedTopology Ξ²] {𝓑 : Set (RankRule Ξ±)} (h𝓑 : 𝓑.Countable) (h : βˆ€ (n : β„•) (data : prod_iter_image Ξ± Ξ² n), ΞΌ (RankOpt.potential_max data 𝓑) β‰  0) : Algorithm Ξ± Ξ²
RankOpt.{u_1, u_2} {Ξ± : Type u_1} {Ξ² : Type u_2} [MeasurableSpace Ξ±] (ΞΌ : MeasureTheory.Measure Ξ±) [MeasureTheory.IsProbabilityMeasure ΞΌ] [TopologicalSpace Ξ²] [MeasurableSpace Ξ²] [BorelSpace Ξ²] [LinearOrder Ξ²] [SecondCountableTopology Ξ²] [OpensMeasurableSpace Ξ²] [OrderClosedTopology Ξ²] {𝓑 : Set (RankRule Ξ±)} (h𝓑 : 𝓑.Countable) (h : βˆ€ (n : β„•) (data : prod_iter_image Ξ± Ξ² n), ΞΌ (RankOpt.potential_max data 𝓑) β‰  0) : Algorithm Ξ± Ξ²

The RankOpt algorithm for global optimization. This algorithm uses a ranking approach to optimize an unknown function. It maintains a hypothesis class 𝓑 of ranking rules. It starts with an arbitrary probability measure ΞΌ as initial distribution and samples from the set of points that could be optimal according to ranking rules consistent with the observed data (Malherbe et al., 2017).

The term RankRule is defined as the subtype of \{-1, 0, 1\}-valued functions that are jointly measurable:

def RankRule (Ξ± : Type*) [MeasurableSpace Ξ±] := {f : Ξ± β†’ Ξ± β†’ ({-1, 0, 1} : Set ℝ) // Measurable <| Function.uncurry f}

2.5.Β CMA-ESπŸ”—

A general implementation of the CMA-ES algorithm in any dimension. As CMA-ES samples \lambda points at each iteration, the input space of the algorithm is \mathbb{R}^{d \times \lambda}, which represents a sequence of \lambda points in \mathbb{R}^{d}. The initial measure is the product of \lambda standard multivariate Gaussian measures on \mathbb{R}^{d}, and the kernel is defined as a product of \lambda multivariate Gaussian measures, where the mean and covariance matrix are given by measurable functions of the past evaluations. These functions can be anything as long as they are measurable w.r.t. the history of the algorithm, thus allowing for any CMA-ES variant to be implemented in this framework.

πŸ”—def
CMA_ES (d lam : β„•) {mean : (n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ EuclideanSpace ℝ (Fin d)} (hmean : βˆ€ (n : β„•), Measurable (mean n)) {covar : (n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ Matrix (Fin d) (Fin d) ℝ} (hcovar : βˆ€ (n : β„•), Measurable (covar n)) (m : EuclideanSpace ℝ (Fin d)) (S : Matrix (Fin d) (Fin d) ℝ) : Algorithm (ℝ_ d lam) (ℝ_ d lam)
CMA_ES (d lam : β„•) {mean : (n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ EuclideanSpace ℝ (Fin d)} (hmean : βˆ€ (n : β„•), Measurable (mean n)) {covar : (n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ Matrix (Fin d) (Fin d) ℝ} (hcovar : βˆ€ (n : β„•), Measurable (covar n)) (m : EuclideanSpace ℝ (Fin d)) (S : Matrix (Fin d) (Fin d) ℝ) : Algorithm (ℝ_ d lam) (ℝ_ d lam)

The Covariance Matrix Adaptation - Evolution Strategy (CMA-ES) algorithm for global optimization, given the mean and covariance update rules as measurable functions of the history.

noncomputable def CMA_ES : Algorithm (ℝ_ d lam) (ℝ_ d lam) where Ξ½ := Measure.pi (fun _ ↦ multivariateGaussian m S) kernel_iter := CMAKernel d lam hmean hcovar markov_kernel n := ⟨fun a => d:β„•lam:β„•mean:(n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ EuclideanSpace ℝ (Fin d)hmean:βˆ€ (n : β„•), Measurable (mean n)covar:(n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ Matrix (Fin d) (Fin d) ℝhcovar:βˆ€ (n : β„•), Measurable (covar n)m:EuclideanSpace ℝ (Fin d)S:Matrix (Fin d) (Fin d) ℝn:β„•a:prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n⊒ IsProbabilityMeasure ((CMAKernel d lam hmean hcovar n) a) d:β„•lam:β„•mean:(n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ EuclideanSpace ℝ (Fin d)hmean:βˆ€ (n : β„•), Measurable (mean n)covar:(n : β„•) β†’ prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n β†’ Matrix (Fin d) (Fin d) ℝhcovar:βˆ€ (n : β„•), Measurable (covar n)m:EuclideanSpace ℝ (Fin d)S:Matrix (Fin d) (Fin d) ℝn:β„•a:prod_iter_image (ℝ_ d lam) (ℝ_ d lam) n⊒ IsProbabilityMeasure (Measure.pi fun x => multivariateGaussian (mean n a) (covar n a)); All goals completed! πŸ™βŸ©