ensmallen
flexible C++ library for efficient mathematical optimization

# ensmallen documentation

ensmallen is a library for function optimization. It is devoted to efficiently solving the problem

for many different types of $f(x)$. The documentation is split into three parts:

Everything in ensmallen is in the ens namespace so it is often useful to put a using directive in your code:

using namespace ens;


# Citation

Please cite the following paper if you use ensmallen in your research and/or software.
Citations are useful for the continued development and maintenance of the library.

# Function type documentation

## Arbitrary functions

The least restrictive type of function that can be implemented in ensmallen is a function for which only the objective can be evaluated. For this, a class with the following API must be implemented:

class ArbitraryFunctionType
{
public:
// This should return f(x).
double Evaluate(const arma::mat& x);
};


For this type of function, we assume that the gradient f'(x) is not computable. If it is, see differentiable functions.

The Evaluate() method is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used to optimize an arbitrary function:

Each of these optimizers has an Optimize() function that is called as Optimize(f, x) where f is the function to be optimized (which implements Evaluate()) and x holds the initial point of the optimization. After Optimize() is called, x will hold the final result of the optimization (that is, the best x found that minimizes f(x)).

#### Example: Linear Regression

An example program that implements the objective function f(x) = 2 |x|^2 is shown below, using the simulated annealing optimizer.

#include <ensmallen.hpp>

class SquaredFunction
{
public:
// This returns f(x) = 2 |x|^2.
double Evaluate(const arma::mat& x)
{
return 2 * std::pow(arma::norm(x), 2.0);
}
};

int main()
{
// The minimum is at x = [0 0 0].  Our initial point is chosen to be
// [1.0, -1.0, 1.0].
arma::mat x("1.0 -1.0 1.0");

// Create simulated annealing optimizer with default options.
// The ens::SA<> type can be replaced with any suitable ensmallen optimizer
// that is able to handle arbitrary functions.
ens::SA<> optimizer;
SquaredFunction f; // Create function to be optimized.
optimizer.Optimize(f, x);

std::cout << "Minimum of squared function found with simulated annealing is "
<< x;
}


## Differentiable functions

Probably the most common type of function that can be optimized with ensmallen is a differentiable function, where both f(x) and f’(x) can be calculated. To optimize a differentiable function with ensmallen, a class must be implemented that follows the API below:

class DifferentiableFunctionType
{
public:
// Given parameters x, return the value of f(x).
double Evaluate(const arma::mat& x);

// Given parameters x and a matrix g, store f'(x) in the provided matrix g.
// g should have the same size (rows, columns) as x.

// of---Evaluate() and Gradient().  If this is the only function implemented,
// implementations of Evaluate() and Gradient() will be automatically
// generated using template metaprogramming.  Often, implementing
// EvaluateWithGradient() can result in more efficient optimizations.
//
// Given parameters x and a matrix g, return the value of f(x) and store
// f'(x) in the provided matrix g.  g should have the same size (rows,
// columns) as x.
double EvaluateWithGradient(const arma::mat& x, arma::mat& g);
};


Note that you may implement either Evaluate() and Gradient() or EvaluateWithGradient(), but it is not mandatory to implement both. (Of course, supplying both is okay too.) It often results in faster code when EvaluateWithGradient() is implemented, especially for functions where f(x) and f’(x) compute some of the same intermediate quantities.

Each of the implemented methods is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used with differentiable functions:

Each of these optimizers has an Optimize() function that is called as Optimize(f, x) where f is the function to be optimized and x holds the initial point of the optimization. After Optimize() is called, x will hold the final result of the optimization (that is, the best x found that minimizes f(x)).

An example program is shown below. In this program, we optimize a linear regression model. In this setting, we have some matrix data of data points that we’ve observed, and some vector responses of the observed responses to this data. In our model, we assume that each response is the result of a weighted linear combination of the data:

where $x$ is a vector of parameters. This gives the objective function $f(x) = (\operatorname{responses} - x’\operatorname{data})^2$.

In the example program, we optimize this objective function and compare the runtime of an implementation that uses Evaluate() and Gradient(), and the runtime of an implementation that uses EvaluateWithGradient().

#include <ensmallen.hpp>

// Define a differentiable objective function by implementing both Evaluate()
class LinearRegressionFunction
{
public:
// Construct the object with the given data matrix and responses.
LinearRegressionFunction(const arma::mat& dataIn,
const arma::rowvec& responsesIn) :
data(dataIn), responses(responsesIn) { }

// Return the objective function for model parameters x.
double Evaluate(const arma::mat& x)
{
return std::pow(arma::norm(responses - x.t() * data), 2.0);
}

// Compute the gradient for model parameters x.
void Gradient(const arma::mat& x, arma::mat& g)
{
g = -2 * data * (responses - x.t() * data);
}

private:
// The data.
const arma::mat& data;
// The responses to each data point.
const arma::rowvec& responses;
};

// Define the same function, but only implement EvaluateWithGradient().
class LinearRegressionEWGFunction
{
public:
// Construct the object with the given data matrix and responses.
LinearRegressionEWGFunction(const arma::mat& dataIn,
const arma::rowvec& responsesIn) :
data(dataIn), responses(responsesIn) { }

// Simultaneously compute both the objective function and gradient for model
// parameters x.  Note that this is faster than implementing Evaluate() and
// Gradient() individually because it caches the computation of
// (responses - x.t() * data)!
double EvaluateWithGradient(const arma::mat& x, arma::mat& g)
{
const arma::rowvec v = (responses - x.t() * data);
g = -2 * data * v;
return arma::accu(v % v); // equivalent to \| v \|^2
}
};

int main()
{
// We'll run a simple speed comparison between both objective functions.

// First, generate some random data, with 10000 points and 10 dimensions.
// This data has no pattern and as such will make a model that's not very
// useful---but the purpose here is just demonstration. :)
//
// For a more "real world" situation, load a dataset from file using X.load()
// and y.load() (but make sure the matrix is column-major, so that each
// observation/data point corresponds to a *column*, *not* a row.
arma::mat data(10, 10000, arma::fill::randn);
arma::rowvec responses(10000, arma::fill::randn);

// Create a starting point for our optimization randomly.  The model has 10
// parameters, so the shape is 10x1.
arma::mat startingPoint(10, 1, arma::fill::randn);

// We'll use Armadillo's wall_clock class to do a timing comparison.
arma::wall_clock clock;

// Construct the first objective function.
LinearRegressionFunction lrf1(data, responses);
arma::mat lrf1Params(startingPoint);

// Create the L_BFGS optimizer with default parameters.
// The ens::L_BFGS type can be replaced with any ensmallen optimizer that can
// handle differentiable functions.
ens::L_BFGS lbfgs;

// Time how long L-BFGS takes for the first Evaluate() and Gradient()
// objective function.
clock.tic();
lbfgs.Optimize(lrf1, lrf1Params);
const double time1 = clock.toc();

std::cout << "LinearRegressionFunction with Evaluate() and Gradient() took "
<< time1 << " seconds to converge to the model: " << std::endl;
std::cout << lrf1Params.t();

// Create the second objective function, which uses EvaluateWithGradient().
LinearRegressionEWGFunction lrf2(data, responses);
arma::mat lrf2Params(startingPoint);

// Time how long L-BFGS takes for the EvaluateWithGradient() objective
// function.
clock.tic();
lbfgs.Optimize(lrf2, lrf2Params);
const double time2 = clock.toc();

std::cout << "LinearRegressionEWGFunction with EvaluateWithGradient() took "
<< time2 << " seconds to converge to the model: " << std::endl;
std::cout << lrf2Params.t();

// When this runs, the output parameters will be exactly on the same, but the
// LinearRegressionEWGFunction will run more quickly!
}


### Partially differentiable functions

Some differentiable functions have the additional property that the gradient f'(x) can be decomposed along a different axis j such that the gradient is sparse. This makes the most sense in machine learning applications where the function f(x) is being optimized for some dataset data, which has d dimensions. A partially differentiable separable function is partially differentiable with respect to each dimension j of data. This property is useful for coordinate descent type algorithms.

To use ensmallen optimizers to minimize these types of functions, only two functions needs to be added to the differentiable function type:

// Compute the partial gradient f'_j(x) with respect to data coordinate j and
// store it in the sparse matrix g.
void Gradient(const arma::mat& x, const size_t j, arma::sp_mat& g);

// Get the number of features that f(x) can be partially differentiated with.
size_t NumFeatures();


Note: many partially differentiable function optimizers do not require a regular implementation of the Gradient(), so that function may be omitted.

If these functions are implemented, the following partially differentiable function optimizers can be used:

## Arbitrary separable functions

Often, an objective function f(x) may be represented as the sum of many functions:

f(x) = f_0(x) + f_1(x) + ... + f_N(x).


In this function type, we assume the gradient f'(x) is not computable. If it is, see [#Differentiable-separable-functions].

For machine learning tasks, the objective function may be, e.g., the sum of a function taken across many data points. Implementing an arbitrary separable function type in ensmallen is similar to implementing an arbitrary objective function, but with a few extra utility methods:

class ArbitrarySeparableFunctionType
{
public:
// Given parameters x, return the sum of the individual functions
// f_i(x) + ... + f_{i + batchSize - 1}(x).  i will always be greater than 0,
// and i + batchSize will be less than or equal to the value of NumFunctions().
double Evaluate(const arma::mat& x, const size_t i, const size_t batchSize);

// Shuffle the ordering of the functions f_i(x).
// (For machine learning problems, this would be equivalent to shuffling the
// data points, e.g., before an epoch of learning.)
void Shuffle();

// Get the number of functions f_i(x).
// (For machine learning problems, this is often just the number of points in
// the dataset.)
size_t NumFunctions();
};


Each of the implemented methods is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used with arbitrary separable functions:

Each of these optimizers has an Optimize() function that is called as Optimize(f, x) where f is the function to be optimized and x holds the initial point of the optimization. After Optimize() is called, x will hold the final result of the optimization (that is, the best x found that minimizes f(x)).

Note: using an arbitrary non-separable function optimizer will call Evaluate(x, 0, NumFunctions() - 1); if this is a very computationally intensive operation for your objective function, it may be best to avoid using a non-separable arbitrary function optimizer.

Note: if possible, it’s often better to try and use a gradient-based approach. See [#Differentiable-separable-functions] for separable f(x) where the gradient f’(x) can be computed.

The example program below demonstrates the implementation and use of an arbitrary separable function. The function used is the linear regression objective function, described in differentiable functions. Given some dataset data and responses responses, the linear regression objective function is separable as

where $\operatorname{data}(i)$ represents the data point indexed by $i$ and $\operatorname{responses}(i)$ represents the observed response indexed by $i$.

#include <ensmallen.hpp>

// This class implements the linear regression objective function as an
// arbitrary separable function type.
class LinearRegressionFunction
{
public:
// Create the linear regression function with the given data and the given
// responses.
LinearRegressionFunction(const arma::mat& dataIn,
const arma::rowvec& responsesIn) :
data(data), responses(responses) { }

// Given parameters x, compute the sum of the separable objective
// functions starting with f_i(x) and ending with
// f_{i + batchSize - 1}(x).
double Evaluate(const arma::mat& x, const size_t i, const size_t batchSize)
{
// A more complex implementation could avoid the for loop and use
// submatrices, but it is easier to understand when implemented this way.
double objective = 0.0;
for (size_t j = i; j < i + batchSize; ++j)
{
objective += std::pow(responses[j] - x.t() * data.col(j), 2.0);
}
}

// Shuffle the ordering of the functions f_i(x).  We do this by simply
// shuffling the data and responses.
void Shuffle()
{
// Generate a random ordering of data points.
arma::uvec ordering = arma::shuffle(
arma::linspace<arma::uvec>(0, data.n_cols - 1, data.n_cols));

// This reorders the data and responses with our randomly-generated
// ordering above.
data = data.cols(ordering);
responses = responses.cols(ordering);
}

// Return the number of functions f_i(x).  In our case this is simply the
// number of data points.
size_t NumFunctions() { return data.n_cols; }
};

int main()
{
// First, generate some random data, with 10000 points and 10 dimensions.
// This data has no pattern and as such will make a model that's not very
// useful---but the purpose here is just demonstration. :)
//
// For a more "real world" situation, load a dataset from file using X.load()
// and y.load() (but make sure the matrix is column-major, so that each
// observation/data point corresponds to a *column*, *not* a row.
arma::mat data(10, 10000, arma::fill::randn);
arma::rowvec responses(10000, arma::fill::randn);

// Create a starting point for our optimization randomly.  The model has 10
// parameters, so the shape is 10x1.
arma::mat params(10, 1, arma::fill::randn);

// Use the CMAES optimizer with default parameters to minimize the
// LinearRegressionFunction.
// The ens::CMAES type can be replaced with any suitable ensmallen optimizer
// that can handle arbitrary separable functions.
ens::CMAES cmaes;
LinearRegressionFunction lrf(data, responses);
cmaes.Optimize(lrf, params);

std::cout << "The optimized linear regression model found by CMAES has the "
<< "parameters " << params.t();
}


## Differentiable separable functions

Likely the most important type of function to be optimized in machine learning and some other fields is the differentiable separable function. A differentiable separable function f(x) can be represented as the sum of many functions:

f(x) = f_0(x) + f_1(x) + ... + f_N(x).


And it also has a computable separable gradient f'(x):

f'(x) = f'_0(x) + f'_1(x) + ... + f'_N(x).


For machine learning tasks, the objective function may be, e.g., the sum of a function taken across many data points. Implementing a differentiable separable function type in ensmallen is similar to implementing an ordinary differentiable function, but with a few extra utility methods:

class ArbitrarySeparableFunctionType
{
public:
// Given parameters x, return the sum of the individual functions
// f_i(x) + ... + f_{i + batchSize - 1}(x).  i will always be greater than 0,
// and i + batchSize will be less than or equal to the value of NumFunctions().
double Evaluate(const arma::mat& x, const size_t i, const size_t batchSize);

// Given parameters x and a matrix g, store the sum of the gradient of
// individual functions f'_i(x) + ... + f'_{i + batchSize - 1}(x) into g.  i
// will always be greater than 0, and i + batchSize will be less than or
// equal to the value of NumFunctions().
const size_t i,
arma::mat& g,
const size_t batchSize);

// Shuffle the ordering of the functions f_i(x).
// (For machine learning problems, this would be equivalent to shuffling the
// data points, e.g., before an epoch of learning.)
void Shuffle();

// Get the number of functions f_i(x).
// (For machine learning problems, this is often just the number of points in
// the dataset.)
size_t NumFunctions();

// of---Evaluate() and Gradient().  If this is the only function implemented,
// implementations of Evaluate() and Gradient() will be automatically
// generated using template metaprogramming.  Often, implementing
// EvaluateWithGradient() can result in more efficient optimizations.
//
// Given parameters x and a matrix g, return the sum of the individual
// functions f_i(x) + ... + f_{i + batchSize - 1}(x), and store the sum of
// the gradient of individual functions f'_i(x) + ... +
// f'_{i + batchSize - 1}(x) into the provided matrix g.  g should have the
// same size (rows, columns) as x.  i will always be greater than 0, and i +
// batchSize will be less than or equal to the value of NumFunctions().
const size_t i,
arma::mat& g,
const size_t batchSize);
};


Note that you may implement either Evaluate() and Gradient() or EvaluateWithGradient(), but it is not mandatory to implement both. (Of course, supplying both is okay too.) It often results in faster code when EvaluateWithGradient() is implemented, especially for functions where f(x) and f’(x) compute some of the same intermediate quantities.

Each of the implemented methods is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used with differentiable functions:

The example program below demonstrates the implementation and use of an arbitrary separable function. The function used is the linear regression objective function, described in differentiable functions. Given some dataset data and responses responses, the linear regression objective function is separable as

f_i(x) = (responses(i) - x' * data(i))^2


where data(i) represents the data point indexed by i and responses(i) represents the observed response indexed by i. This example implementation only implements EvaluateWithGradient() in order to avoid redundant calculations.

#include <ensmallen.hpp>

// This class implements the linear regression objective function as an
// arbitrary separable function type.
class LinearRegressionFunction
{
public:
// Create the linear regression function with the given data and the given
// responses.
LinearRegressionFunction(const arma::mat& dataIn,
const arma::rowvec& responsesIn) :
data(data), responses(responses) { }

// Given parameters x, compute the sum of the separable objective
// functions starting with f_i(x) and ending with
// f_{i + batchSize - 1}(x), and also compute the gradient of those functions
// and store them in g.
const size_t i,
arma::mat& g,
const size_t batchSize)
{
// This slightly complex implementation uses Armadillo submatrices to
// compute the objective functions and gradients simultaneously for
// multiple points.
//
// The shared computation between the objective and gradient is the term
// (response - x * data) so we compute that first, only for points in the
// batch.
const arma::rowvec v = (responses.cols(i, i + batchSize - 1) - x.t() *
data.cols(i, i + batchSize - 1));
g = -2 * data.cols(i, i + batchSize - 1) * v;
return arma::accu(v % v); // equivalent to |v|^2
}

// Shuffle the ordering of the functions f_i(x).  We do this by simply
// shuffling the data and responses.
void Shuffle()
{
// Generate a random ordering of data points.
arma::uvec ordering = arma::shuffle(
arma::linspace<arma::uvec>(0, data.n_cols - 1, data.n_cols));

// This reorders the data and responses with our randomly-generated
// ordering above.
data = data.cols(ordering);
responses = responses.cols(ordering);
}

// Return the number of functions f_i(x).  In our case this is simply the
// number of data points.
size_t NumFunctions() { return data.n_cols; }
};

int main()
{
// First, generate some random data, with 10000 points and 10 dimensions.
// This data has no pattern and as such will make a model that's not very
// useful---but the purpose here is just demonstration. :)
//
// For a more "real world" situation, load a dataset from file using X.load()
// and y.load() (but make sure the matrix is column-major, so that each
// observation/data point corresponds to a *column*, *not* a row.
arma::mat data(10, 10000, arma::fill::randn);
arma::rowvec responses(10000, arma::fill::randn);

// Create a starting point for our optimization randomly.  The model has 10
// parameters, so the shape is 10x1.
arma::mat params(10, 1, arma::fill::randn);

// Use RMSprop to find the best parameters for the linear regression model.
// The type 'ens::RMSprop' can be changed for any ensmallen optimizer able to
// handle differentiable separable functions.
ens::RMSProp rmsprop;
LinearRegressionFunction lrf(data, responses);
rmsprop.Optimize(lrf, params);

std::cout << "The optimized linear regression model found by RMSprop has the"
<< " parameters " << params.t();
}


### Sparse differentiable separable functions

Some differentiable separable functions have the additional property that the gradient f'_i(x) is sparse. When this is true, one additional method can be implemented as part of the class to be optimized:

// Add this definition to use sparse differentiable separable function
// optimizers.  Given x, store the sum of the sparse gradient f'_i(x) + ... +
// f'_{i + batchSize - 1}(x) into the provided matrix g.
const size_t i,
arma::sp_mat& g,
const size_t batchSize);


It’s also possible to instead use templates to provide only one Gradient() function for both sparse and non-sparse optimizers:

// This provides Gradient() for both sparse and non-sparse optimizers.
const size_t i,
const size_t batchSize);


If either of these methods are available, then any ensmallen optimizer that optimizes sparse separable differentiable functions may be used. This includes:

## Categorical functions

A categorical function is a function f(x) where some of the values of x are categorical variables (i.e. they take integer values [0, c - 1] and each value 0, 1, …, c - 1 is unrelated). In this situation, the function is not differentiable, and so only the objective f(x) can be implemented. Therefore the class requirements for a categorical function are exactly the same as for an ArbitraryFunctionType—but for any categorical dimension x_i in x, the value will be in the range [0, c_i - 1] where c_i is the number of categories in dimension x_i.

class CategoricalFunction
{
public:
// Return the objective function for the given parameters x.
double Evaluate(const arma::mat& x);
};


However, when an optimizer’s Optimize() method is called, two additional parameters must be specified, in addition to the function to optimize and the matrix holding the parameters:

• const std::vector<bool>& categoricalDimensions: a vector of length equal to the number of elements in x (the number of dimensions). If an element is true, then the dimension is categorical.

• const arma::Row<size_t>& numCategories: a vector of length equal to the number of elements in x (the number of dimensions). If a dimension is categorical, then the corresponding value in numCategories should hold the number of categories in that dimension.

The following optimizers can be used in this way to optimize a categorical function:

An example program showing usage of categorical optimization is shown below.

#include <ensmallen.hpp>

// An implementation of a simple categorical function.  The parameters can be
// understood as x = [c1 c2 c3].  When c1 = 0, c2 = 2, and c3 = 1, the value of
// f(x) is 0.  In any other case, the value of f(x) is 10.  Therefore, the
// optimum is found at [0, 2, 1].
class SimpleCategoricalFunction
{
public:
// Return the objective function f(x) as described above.
double Evaluate(const arma::mat& x)
{
if (size_t(x[0]) == 0 &&
size_t(x[1]) == 2 &&
size_t(x[2]) == 1)
return 0.0;
else
return 10.0;
}
};

int main()
{
// Create and optimize the categorical function with the GridSearch
// optimizer.  We must also create a std::vector<bool> that holds the types
// of each dimension, and an arma::Row<size_t> that holds the number of
// categories in each dimension.
SimpleCategoricalFunction c;

// We have three categorical dimensions only.
std::vector<bool> categoricalDimensions;
categoricalDimensions.push_back(true);
categoricalDimensions.push_back(true);
categoricalDimensions.push_back(true);

// The first category can take 5 values; the second can take 3; the third can
// take 12.
arma::Row<size_t> numCategories("5 3 12");

// The initial point for our optimization will be to set all categories to 0.
arma::mat params("0 0 0");

// Now create the GridSearch optimizer with default parameters, and run the
// optimization.
// The ens::GridSearch type can be replaced with any ensmallen optimizer that
// is able to handle categorical functions.
ens::GridSearch gs;
gs.Optimize(c, params, categoricalDimensions, numCategories);

std::cout << "The ens::GridSearch optimizer found the optimal parameters to "
<< "be " << params;
}


## Constrained functions

A constrained function is an objective function f(x) that is also subject to some constraints on x. (For instance, perhaps a constraint could be that x is a positive semidefinite matrix.) ensmallen is able to handle differentiable objective functions of this type—so, f'(x) must also be computable. Given some set of constraints c_0(x), … c_M(x), we can re-express our constrained objective function as

f_C(x) = f(x) + c_0(x) + ... + c_M(x)


where the constraint c_i(x) is DBL_MAX if it is not satisfied, and otherwise takes some real value. For a “hard constraint”, we can simply take c_i(x) = 0 when it is satisfied. But allowing c_i(x) to return anything allows us to handle “soft” constraints also.

In order to optimize a constrained function with ensmallen, a class implementing the API below is required.

class ConstrainedFunctionType
{
public:
// Return the objective function f(x) for the given x.
double Evaluate(const arma::mat& x);

// Compute the gradient of f(x) for the given x and store the result in g.
void Gradient(const arma::mat& x, arma::mat& g);

// Get the number of constraints on the objective function.
size_t NumConstraints();

// Evaluate constraint i at the parameters x.  If the constraint is
// unsatisfied, DBL_MAX should be returned.  If the constraint is satisfied,
// any real value can be returned.  The optimizer will add this value to its
// overall objective that it is trying to minimize.  (So, a hard constraint
// can just return 0 if it's satisfied.)
double EvaluateConstraint(const size_t i, const arma::mat& x);

// Evaluate the gradient of constraint i at the parameters x, storing the
// result in the given matrix g.  If this is a hard constraint you can set
// the gradient to 0.  If the constraint is not satisfied, it could be
// direction where the constraint would be satisfied.
void GradientConstraint(const size_t i, const arma::mat& x, arma::mat& g);
};


A constrained function can be optimized with the following optimizers:

### Semidefinite programs

A special class of constrained function is a semidefinite program. ensmallen has support for creating and optimizing semidefinite programs via the ens::SDP<> class. For this, the SDP must be expressed in the primal form:

min_x dot(C, x) such that

dot(A_i, x) = b_i, for i = 1, ..., M;
x >= 0


In this case, A_i and C are square matrices (sparse or dense), and b_i is a scalar.

Once the problem is expressed in this form, a class of type ens::SDP<> can be created. SDP<arma::mat> indicates an SDP with a dense C, and SDP<arma::sp_mat> indicates an SDP with a sparse C. The class has the following useful methods:

• SDP(cMatrixSize, numDenseConstraints, numSparseConstraints): create a new SDP
• size_t NumSparseConstraints(): get number of sparse constraint matrices A_i
• size_t NumDenseConstraints(): get number of dense constraint matrices A_i
• std::vector<arma::mat>& DenseA(): get vector of dense A_i matrices
• std::vector<arma::sp_mat>& SparseA(): get vector of sparse A_i matrices
• arma::vec& DenseB(): get vector of b_i values for dense A_i constraints
• arma::vec& SparseB(): get vector of b_i values for sparse A_i constraints

Once these methods are used to set each A_i matrix and corresponding b_i value, and C objective matrix, the SDP object can be used with any ensmallen SDP solver. The list of SDP solvers is below:

Example code showing how to solve an SDP is given below.

int main()
{
// We will build a toy semidefinite program and then use the PrimalDualSolver to find a solution

// The semi-definite constraint looks like:
//
// [ 1  x_12  x_13  0  0  0  0 ]
// [     1    x_23  0  0  0  0 ]
// [            1   0  0  0  0 ]
// [               s1  0  0  0 ]  >= 0
// [                  s2  0  0 ]
// [                     s3  0 ]
// [                        s4 ]

// x_11 == 0
arma::sp_mat A0(7, 7); A0.zeros();
A0(0, 0) = 1.;

// x_22 == 0
arma::sp_mat A1(7, 7); A1.zeros();
A1(1, 1) = 1.;

// x_33 == 0
arma::sp_mat A2(7, 7); A2.zeros();
A2(2, 2) = 1.;

// x_12 <= -0.1  <==>  x_12 + s1 == -0.1, s1 >= 0
arma::sp_mat A3(7, 7); A3.zeros();
A3(1, 0) = A3(0, 1) = 1.; A3(3, 3) = 2.;

// -0.2 <= x_12  <==>  x_12 - s2 == -0.2, s2 >= 0
arma::sp_mat A4(7, 7); A4.zeros();
A4(1, 0) = A4(0, 1) = 1.; A4(4, 4) = -2.;

// x_23 <= 0.5  <==>  x_23 + s3 == 0.5, s3 >= 0
arma::sp_mat A5(7, 7); A5.zeros();
A5(2, 1) = A5(1, 2) = 1.; A5(5, 5) = 2.;

// 0.4 <= x_23  <==>  x_23 - s4 == 0.4, s4 >= 0
arma::sp_mat A6(7, 7); A6.zeros();
A6(2, 1) = A6(1, 2) = 1.; A6(6, 6) = -2.;

std::vector<arma::sp_mat> ais({A0, A1, A2, A3, A4, A5, A6});

SDP<arma::sp_mat> sdp(7, 7 + 4 + 4 + 4 + 3 + 2 + 1, 0);

for (size_t j = 0; j < 3; j++)
{
// x_j4 == x_j5 == x_j6 == x_j7 == 0
for (size_t i = 0; i < 4; i++)
{
arma::sp_mat A(7, 7); A.zeros();
A(i + 3, j) = A(j, i + 3) = 1;
ais.emplace_back(A);
}
}

// x_45 == x_46 == x_47 == 0
for (size_t i = 0; i < 3; i++)
{
arma::sp_mat A(7, 7); A.zeros();
A(i + 4, 3) = A(3, i + 4) = 1;
ais.emplace_back(A);
}

// x_56 == x_57 == 0
for (size_t i = 0; i < 2; i++)
{
arma::sp_mat A(7, 7); A.zeros();
A(i + 5, 4) = A(4, i + 5) = 1;
ais.emplace_back(A);
}

// x_67 == 0
arma::sp_mat A(7, 7); A.zeros();
A(6, 5) = A(5, 6) = 1;
ais.emplace_back(A);

std::swap(sdp.SparseA(), ais);

sdp.SparseB().zeros();
sdp.SparseB()[0] = sdp.SparseB()[1] = sdp.SparseB()[2] = 1.;
sdp.SparseB()[3] = -0.2; sdp.SparseB()[4] = -0.4;
sdp.SparseB()[5] = 1.; sdp.SparseB()[6] = 0.8;

sdp.C().zeros();
sdp.C()(0, 2) = sdp.C()(2, 0) = 1.;

// That took a long time but we finally set up the problem right!  Now we can
// use the PrimalDualSolver to solve it.
// ens::PrimalDualSolver could be replaced with ens::LRSDP or other ensmallen
// SDP solvers.
PrimalDualSolver<SDP<arma::sp_mat>> solver(sdp);
arma::mat X, Z;
arma::vec ysparse, ydense;
// ysparse, ydense, and Z hold the primal and dual variables found during the
// optimization.
const double obj = solver.Optimize(X, ysparse, ydense, Z);

std::cout << "SDP optimized with objective " << obj << "." << std::endl;
}


# Optimizer documentation

An optimizer for differentiable separable functions.

#### Constructors

• AdaDelta()
• AdaDelta(stepSize)
• AdaDelta(stepSize, batchSize)
• AdaDelta(stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle)
• AdaDelta(stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 1.0
size_t batchSize Number of points to process in one step. 32
double rho Smoothing constant. Corresponding to fraction of gradient to keep at each time step. 0.95
double epsilon Value used to initialise the mean squared gradient parameter. 1e-6
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), Rho(), Epsilon(), MaxIterations(), Shuffle(), and ResetPolicy().

#### Examples:

AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

AdaGrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. Larger updates for more sparse parameters and smaller updates for less sparse parameters.

#### Constructors

• AdaGrad()
• AdaGrad(stepSize)
• AdaGrad(stepSize, batchSize)
• AdaGrad(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle)
• AdaGrad(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Number of points to process in one step. 32
double epsilon Value used to initialise the mean squared gradient parameter. 1e-8
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. tolerance
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples:

AdaGrad optimizer(1.0, 1, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

Adam is an an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.

#### Constructors

• Adam()
• Adam(stepSize, batchSize)
• Adam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
• Adam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the Adam class is based on the AdamType<UpdateRule> class with UpdateRule = AdamUpdate.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Adam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

AdaMax is simply a variant of Adam based on the infinity norm.

#### Constructors

• AdaMax()
• AdaMax(stepSize, batchSize)
• AdaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
• AdaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the AdaMax class is based on the AdamType<UpdateRule> class with UpdateRule = AdaMaxUpdate.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

AdaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

#### Constructors

• AMSGrad()
• AMSGrad(stepSize, batchSize)
• AMSGrad(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
• AMSGrad(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the AMSGrad class is based on the AdamType<UpdateRule> class with UpdateRule = AMSGradUpdate.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

AMSGrad optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


## Augmented Lagrangian

An optimizer for differentiable constrained functions.

The AugLagrangian class implements the Augmented Lagrangian method of optimization. In this scheme, a penalty term is added to the Lagrangian. This method is also called the “method of multipliers”. Internally, the optimizer uses L-BFGS.

#### Constructors

• AugLagrangian(maxIterations, penaltyThresholdFactor sigmaUpdateFactor)

#### Attributes

type name description default
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
double penaltyThresholdFactor When penalty threshold is updated, set it to this multiplied by the penalty. 10.0
double sigmaUpdateFactor When sigma is updated, multiply it by this. 0.25
L_BFGS& lbfgs Internal l-bfgs optimizer. L_BFGS()

The attributes of the optimizer may also be modified via the member methods MaxIterations(), PenaltyThresholdFactor(), SigmaUpdateFactor() and L_BFGS().

/**
* Optimize the function.  The value '1' is used for the initial value of each
* Lagrange multiplier.  To set the Lagrange multipliers yourself, use the
*
* @tparam LagrangianFunctionType Function which can be optimized by this
*     class.
* @param function The function to optimize.
* @param coordinates Output matrix to store the optimized coordinates in.
*/
template<typename LagrangianFunctionType>
bool Optimize(LagrangianFunctionType& function,
arma::mat& coordinates);

/**
* Optimize the function, giving initial estimates for the Lagrange
* multipliers.  The vector of Lagrange multipliers will be modified to
* contain the Lagrange multipliers of the final solution (if one is found).
*
* @tparam LagrangianFunctionType Function which can be optimized by this
*      class.
* @param function The function to optimize.
* @param coordinates Output matrix to store the optimized coordinates in.
* @param initLambda Vector of initial Lagrange multipliers.  Should have
*     length equal to the number of constraints.
* @param initSigma Initial penalty parameter.
*/
template<typename LagrangianFunctionType>
bool Optimize(LagrangianFunctionType& function,
arma::mat& coordinates,
const arma::vec& initLambda,
const double initSigma);


#### Examples

GockenbachFunction f;
arma::mat coordinates = f.GetInitialPoint();

AugLagrangian optimizer;
optimizer.Optimize(f, coords);


## Big Batch SGD

An optimizer for differentiable separable functions.

Big-batch stochastic gradient descent adaptively grows the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation, so the Big Batch SGD optimizer is able to adaptively adjust batch sizes without user oversight.

#### Constructors

• BigBatchSGD<UpdatePolicy>()
• BigBatchSGD<UpdatePolicy>(stepSize)
• BigBatchSGD<UpdatePolicy>(stepSize, batchSize)
• BigBatchSGD<UpdatePolicy>(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle)

The UpdatePolicy template parameter refers to the way that a new step size is computed. The AdaptiveStepsize and BacktrackingLineSearch classes are available for use; custom behavior can be achieved by implementing a class with the same method signatures.

For convenience the following typedefs have been defined:

• BBS_Armijo = BigBatchSGD<BacktrackingLineSearch>
• BBS_BB = BigBatchSGD<AdaptiveStepsize>

#### Attributes

type name description default
size_t batchSize Initial batch size. 1000
double stepSize Step size for each iteration. 0.01
double batchDelta Factor for the batch update step. 0.1
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true

Attributes of the optimizer may also be changed via the member methods BatchSize(), StepSize(), BatchDelta(), MaxIterations(), Tolerance(), and Shuffle().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Big-Batch SGD with the adaptive stepsize policy.
BBS_BB optimizer(batchSize, 0.01, 0.1, 8000, 1e-4);
optimizer.Optimize(f, coordinates);

// Big-Batch SGD with backtracking line search.
BBS_Armijo optimizer2(batchSize, 0.01, 0.1, 8000, 1e-4);
optimizer2.Optimize(f, coordinates);


## CMAES

An optimizer for separable functions.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a stochastic search algorithm. CMA-ES is a second order approach estimating a positive definite matrix within an iterative procedure using the covariance matrix.

#### Constructors

• CMAES<SelectionPolicyType>()
• CMAES<SelectionPolicyType>(lambda, lowerBound, upperBound)
• CMAES<SelectionPolicyType>(lambda, lowerBound, upperBound, batchSize)
• CMAES<SelectionPolicyType>(lambda, lowerBound, upperBound, batchSize, maxIterations, tolerance, selectionPolicy)

The SelectionPolicyType template parameter refers to the strategy used to compute the (approximate) objective function. The FullSelection and RandomSelection classes are available for use; custom behavior can be achieved by implementing a class with the same method signatures.

For convenience the following types can be used:

• CMAES<> (equivalent to CMAES<FullSelection>): uses all separable functions to compute objective
• ApproxCMAES (equivalent to CMAES<RandomSelection>): uses a small amount of separable functions to compute approximate objective

#### Attributes

type name description default
size_t lambda The population size (0 uses a default size). 0
double lowerBound Lower bound of decision variables. -10.0
double upperBound Upper bound of decision variables. 10.0
size_t batchSize Batch size to use for the objective calculation. 32
size_t maxIterations Maximum number of iterations. 1000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
SelectionPolicyType selectionPolicy Instantiated selection policy used to calculate the objective. SelectionPolicyType()

Attributes of the optimizer may also be changed via the member methods Lambda(), LowerBound(), UpperBound(), BatchSize(), MaxIterations(), Tolerance(), and SelectionPolicy().

The selectionPolicy attribute allows an instantiated SelectionPolicyType to be given. The FullSelection policy has no need to be instantiated and thus the option is not relevant when the CMAES<> optimizer type is being used; the RandomSelection policy has the constructor RandomSelection(fraction) where fraction specifies the percentage of separable functions to use to estimate the objective function.

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// CMAES with the FullSelection policy.
CMAES<> optimizer(0, -1, 1, 32, 200, 0.1e-4);
optimizer.Optimize(f, coordinates);

// CMAES with the RandomSelection policy.
ApproxCMAES<> approxOptimizer(batchSize, 0.01, 0.1, 8000, 1e-4);
approxOptimizer.Optimize(f, coordinates);


## CNE

An optimizer for arbitrary functions.

Conventional Neural Evolution is an optimizer that works like biological evolution which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population.

#### Constructors

• CNE()
• CNE(populationSize, maxGenerations)
• CNE(populationSize, maxGenerations, mutationProb, mutationSize)
• CNE(populationSize, maxGenerations, mutationProb, mutationSize, selectPercent, tolerance, objectiveChange)

#### Attributes

type name description default
size_t populationSize The number of candidates in the population. This should be at least 4 in size. 500
size_t maxGenerations The maximum number of generations allowed for CNE. 5000
double mutationProb Probability that a weight will get mutated. 0.1
double mutationSize The range of mutation noise to be added. This range is between 0 and mutationSize. 0.02
double selectPercent The percentage of candidates to select to become the the next generation. 0.2
double tolerance The final value of the objective function for termination. If set to negative value, tolerance is not considered. 1e-5
double objectiveChange Minimum change in best fitness values between two consecutive generations should be greater than threshold. If set to negative value, objectiveChange is not considered. 1e-5

Attributes of the optimizer may also be changed via the member methods PopulationSize(), MaxGenerations(), MutationProb(), SelectPercent(), Tolerance(), and ObjectiveChange().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

CNE optimizer(200, 10000, 0.2, 0.2, 0.3, 65, 0.1e-4);
optimizer.Optimize(f, coordinates);


## Eve

An optimizer for differentiable separable functions.

Eve is a stochastic gradient based optimization method with locally and globally adaptive learning rates.

#### Constructors

• Eve()
• Eve(stepSize, batchSize)
• Eve(stepSize, batchSize, beta1, beta2, beta3, epsilon, clip, maxIterations, tolerance, shuffle)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double beta3 Exponential decay rate for relative change. 0.999
double epsilon Value used to initialize the mean squared gradient parameter. 1e-8
double clip Clipping range to avoid extreme valus. 10
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Beta3(), Epsilon(), Clip(), MaxIterations(), Tolerance(), and Shuffle().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Eve optimizer(0.001, 32, 0.9, 0.999, 0.999, 10, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


## Frank-Wolfe

An optimizer for differentiable functions that may also be constrained.

Frank-Wolfe is a technique to minimize a continuously differentiable convex function f over a compact convex subset D of a vector space. It is also known as conditional gradient method.

#### Constructors

• FrankWolfe<LinearConstrSolverType, UpdateRuleType>(linearConstrSolver, updateRule)
• FrankWolfe<LinearConstrSolverType, UpdateRuleType>(linearConstrSolver, updateRule, maxIterations, tolerance)

The LinearConstrSolverType template parameter specifies the constraint domain D for the problem. The ConstrLpBallSolver and ConstrStructGroupSolver<GroupLpBall> classes are available for use; the former restricts D to the unit ball of the specified l-p norm. Other constraint types may be implemented as a class with the same method signatures as either of the existing classes.

The UpdateRuleType template parameter specifies the update rule used by the optimizer. The UpdateClassic and UpdateLineSearch classes are available for use and represent a simple update step rule and a line search based update rule, respectively. The UpdateSpan and UpdateFulLCorrection classes are also available and may be used with the FuncSq function class (which is a squared matrix loss).

For convenience the following typedefs have been defined:

• OMP (equivalent to FrankWolfe<ConstrLpBallSolver, UpdateSpan>): a solver for the orthogonal matching pursuit problem
• StandardFrankWolfe (equivalent to FrankWolfe<ConstrLpBallSolver, ClassicUpdate>): the standard Frank-Wolfe algorithm with the solution restricted to lie within the unit ball

#### Attributes

type name description default
LinearConstrSolverType linearConstrSolver Solver for linear constrained problem. n/a
UpdateRuleType updateRule Rule for updating solution in each iteration. n/a
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
size_t tolerance Maximum absolute tolerance to terminate algorithm. 1e-10

Attributes of the optimizer may also be changed via the member methods LinearConstrSolver(), UpdateRule(), MaxIterations(), and Tolerance().

#### Examples:

TODO

An optimizer for differentiable separable functions.

Follow the Moving Leader (FTML) is an optimizer where recent samples are weighted more heavily in each iteration, so FTML can adapt more quickly to changes.

#### Constructors

• FTML()
• FTML(stepSize, batchSize)
• FTML(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle)
• FTML(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

FTML optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


An optimizer for differentiable functions.

Gradient Descent is a technique to minimize a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.

#### Constructors

• GradientDescent()
• GradientDescent(stepSize)
• GradientDescent(stepSize, maxIterations, tolerance)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
size_t tolerance Maximum absolute tolerance to terminate algorithm. 1e-5

Attributes of the optimizer may also be changed via the member methods StepSize(), MaxIterations(), and Tolerance().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

optimizer.Optimize(f, coordinates);


An optimizer for categorical functions.

An optimizer that finds the minimum of a given function by iterating through points on a multidimensional grid.

#### Constructors

• GridSearch()

#### Attributes

The GridSearch class has no configurable attributes.

Note: the GridSearch class can only optimize categorical functions where every parameter is categorical.

## Hogwild! (Parallel SGD)

An optimizer for sparse differentiable separable functions.

An implementation of parallel stochastic gradient descent using the lock-free HOGWILD! approach. This implementation requires OpenMP to be enabled during compilation (i.e., -fopenmp specified as a compiler flag).

Note that the requirements for Hogwild! are slightly different than for most differentiable separable functions but it is often possible to use Hogwild! by implementing Gradient() with a template parameter. See the sparse differentiable separable functions documentation for more details.

#### Constructors

• ParallelSGD<DecayPolicyType>(maxIterations, threadShareSize)
• ParallelSGD<DecayPolicyType>(maxIterations, threadShareSize, tolerance, shuffle, decayPolicy)

The DecayPolicyType template parameter specifies the policy used to update the step size after each iteration. The ConstantStep class is available for use. Custom behavior can be achieved by implementing a class with the same method signatures.

The default type for DecayPolicyType is ConstantStep, so the shorter type ParallelSGD<> can be used instead of the equivalent ParallelSGD<ConstantStep>.

#### Attributes

type name description default
size_t maxIterations Maximum number of iterations allowed (0 means no limit). n/a
size_t threadShareSize Number of datapoints to be processed in one iteration by each thread. n/a
double tolerance Maximum absolute tolerance to terminate the algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
DecayPolicyType decayPolicy An instantiated step size update policy to use. DecayPolicyType()

Attributes of the optimizer may also be modified via the member methods MaxIterations(), ThreadShareSize(), Tolerance(), Shuffle(), and DecayPolicy().

Note that the default value for decayPolicy is the default constructor for the DecayPolicyType.

#### Examples

GeneralizedRosenbrockFunction f(50); // 50-dimensional Rosenbrock function.
arma::mat coordinates = f.GetInitialPoint();

ParallelSGD<> optimizer(100000, f.NumFunctions(), 1e-5, true);
optimizer.Optimize(f, coordinates);


## IQN

An optimizer for differentiable separable functions.

The Incremental Quasi-Newton belongs to the family of stochastic and incremental methods that have a cost per iteration independent of n. IQN iterations are a stochastic version of BFGS iterations that use memory to reduce the variance of stochastic approximations.

#### Constructors

• IQN()
• IQN(stepSize)
• IQN(stepSize, batchSize, maxIterations, tolerance)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Size of each batch. 10
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
size_t tolerance Maximum absolute tolerance to terminate algorithm. 1e-5

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), MaxIterations(), and Tolerance().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

IQN optimizer(0.01, 1, 5000, 1e-5);
optimizer.Optimize(f, coordinates);


## Katyusha

An optimizer for differentiable separable functions.

Katyusha is a direct, primal-only stochastic gradient method which uses a “negative momentum” on top of Nesterovâ€™s momentum. Two types are available—one that uses a proximal update step, and one that uses the standard update step.

#### Constructors

• KatyushaType<proximal>()
• KatyushaType<proximal>(_convexity, lipschitz_)
• KatyushaType<proximal>(_convexity, lipschitz, batchSize_)
• KatyushaType<proximal>(convexity, lipschitz, batchSize, maxIterations, innerIterations, tolerance, shuffle)

The proximal template parameter is a boolean value (true or false) that specifies whether or not the proximal update should be used.

For convenience the following typedefs have been defined:

• Katyusha (equivalent to KatyushaType<false>): Katyusha with the standard update step
• KatyushaProximal (equivalent to KatyushaType<true>): Katyusha with the proximal update step

#### Attributes

type name description default
double convexity The regularization parameter. 1.0
double lipschitz The Lipschitz constant. 10.0
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
size_t innerIterations The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. 0
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true

Attributes of the optimizer may also be changed via the member methods Convexity(), Lipschitz(), BatchSize(), MaxIterations(), InnerIterations(), Tolerance(), and Shuffle().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Without proximal update.
Katyusha optimizer(1.0, 10.0, 1, 100, 0, 1e-10, true);
optimizer.Optimize(f, coordinates);

// With proximal update.
KatyushaProximal proximalOptimizer(1.0, 10.0, 1, 100, 0, 1e-10, true);
proximalOptimizer.Optimize(f, coordinates);


## L-BFGS

An optimizer for differentiable functions

L-BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broydenâ€“Fletcherâ€“Goldfarbâ€“Shanno (BFGS) algorithm using a limited amount of computer memory.

#### Constructors

• L_BFGS()
• L_BFGS(numBasis, maxIterations)
• L_BFGS(numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials)
• L_BFGS(numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials, minStep, maxStep)

#### Attributes

type name description default
size_t numBasis Number of memory points to be stored (default 10). 10
size_t maxIterations Maximum number of iterations for the optimization (0 means no limit and may run indefinitely). 10000
double armijoConstant Controls the accuracy of the line search routine for determining the Armijo condition. 1e-4
double wolfe Parameter for detecting the Wolfe condition. 0.9
double minGradientNorm Minimum gradient norm required to continue the optimization. 1e-6
double factr Minimum relative function value decrease to continue the optimization. 1e-15
size_t maxLineSearchTrials The maximum number of trials for the line search (before giving up). 50
double minStep The minimum step of the line search. 1e-20
double maxStep The maximum step of the line search. 1e20

Attributes of the optimizer may also be changed via the member methods NumBasis(), MaxIterations(), ArmijoConstant(), Wolfe(), MinGradientNorm(), Factr(), MaxLineSearchTrials(), MinStep(), and MaxStep().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

L_BFGS optimizer(20);
optimizer.Optimize(f, coordinates);


## LRSDP (low-rank SDP solver)

An optimizer for semidefinite programs.

LRSDP is the implementation of Monteiro and Burer’s formulation of low-rank semidefinite programs (LR-SDP). This solver uses the augmented Lagrangian optimizer to solve low-rank semidefinite programs.

The assumption here is that the solution matrix for the SDP is low-rank. If this assumption is not true, the algorithm should not be expected to converge.

#### Constructors

• LRSDP<SDPType>()

The SDPType template parameter specifies the type of SDP to solve. The SDP<arma::mat> and SDP<arma::sp_mat> classes are available for use; these represent SDPs with dense and sparse C matrices, respectively. The SDP<> class is detailed in the semidefinite program documentation.

Once the LRSDP<> object is constructed, the SDP may be specified by calling the SDP() member method, which returns a reference to the SDPType.

#### Attributes

The attributes of the LRSDP optimizer may only be accessed via member methods.

type method name description default
size_t MaxIterations() Maximum number of iterations before termination. 1000
AugLagrangian AugLag() The internally-held Augmented Lagrangian optimizer. n/a

## Momentum SGD

An optimizer for differentiable separable functions.

Stochastic Gradient Descent is a technique for minimizing a function which can be expressed as a sum of other functions. This is an SGD variant that uses momentum for its updates. Using momentum updates for parameter learning can accelerate the rate of convergence, specifically in the cases where the surface curves much more steeply (a steep hilly terrain with high curvature).

#### Constructors

• MomentumSGD()
• MomentumSGD(stepSize, batchSize)
• MomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle)
• MomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy)

Note that MomentumSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = MomentumUpdate and DecayPolicyType = NoDecay.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
MomentumUpdate updatePolicy An instantiated MomentumUpdate. MomentumUpdate()

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), and UpdatePolicy().

Note that the MomentumUpdate class has the constructor MomentumUpdate(momentum) with a default value of 0.5 for the momentum.

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

Nadam is a variant of Adam based on NAG (Nesterov accelerated gradient). It uses Nesterov momentum for faster convergence.

#### Constructors

• Nadam()
• Nadam(stepSize, batchSize)
• Nadam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
• Nadam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the Nadam class is based on the AdamType<UpdateRule> class with UpdateRule = NadamUpdate.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Nadam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

NadaMax is a variant of AdaMax based on NAG (Nesterov accelerated gradient). It uses Nesterov momentum for faster convergence.

#### Constructors

• NadaMax()
• NadaMax(stepSize, batchSize)
• NadaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
• NadaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the NadaMax class is based on the AdamType<UpdateRule> class with UpdateRule = NadaMaxUpdate.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

NadaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


## Nesterov Momentum SGD

An optimizer for differentiable separable functions.

Stochastic Gradient Descent is a technique for minimizing a function which can be expressed as a sum of other functions. This is an SGD variant that uses Nesterov momentum for its updates. Nesterov Momentum application can accelerate the rate of convergence to O(1/k^2).

#### Constructors

• NesterovMomentumSGD()
• NesterovMomentumSGD(stepSize, batchSize)
• NesterovMomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle)
• NesterovMomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy)

Note that MomentumSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = NesterovMomentumUpdate and DecayPolicyType = NoDecay.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
NesterovMomentumUpdate updatePolicy An instantiated MomentumUpdate. NesterovMomentumUpdate()

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), and UpdatePolicy().

Note that the NesterovMomentumUpdate class has the constructor MomentumUpdate(momentum) with a default value of 0.5 for the momentum.

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

NesterovMomentumSGD optimizer(0.01, 32, 100000, 1e-5, true,
MomentumUpdate(0.5));
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

OptimisticAdam is an optimizer which implements the Optimistic Adam algorithm which uses Optmistic Mirror Descent with the Adam Optimizer. It addresses the problem of limit cycling while training GANs (generative adversarial networks). It uses OMD to achieve faster regret rates in solving the zero sum game of training a GAN. It consistently achieves a smaller KL divergnce with~ respect to the true underlying data distribution. The implementation here can be used with any differentiable separable function, not just GAN training.

#### Constructors

• OptimisticAdam()
• OptimisticAdam(stepSize, batchSize)
• OptimisticAdam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
• OptimisticAdam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the OptimisticAdam class is based on the AdamType<UpdateRule> class with UpdateRule = OptimisticAdamUpdate.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

OptimisticAdam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

#### Constructors

• Padam()
• Padam(stepSize, batchSize)
• Padam(stepSize, batchSize, beta1, beta2, partial, epsilon, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double partial Partially adaptive parameter. 0.25
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Partial(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Padam optimizer(0.001, 32, 0.9, 0.999, 0.25, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


## Primal-dual SDP Solver

An optimizer for semidefinite programs.

A primal-dual interior point method solver. This can solve semidefinite programs.

#### Constructors

• PrimalDualSolver<SDPType>(sdp)
• PrimalDualSolver<SDPType>(sdp, initialX, initialYSparse, initialYDense, initialZ)

The SDPType template parameter specifies the type of SDP to solve. The SDP<arma::mat> and SDP<arma::sp_mat> classes are available for use; these represent SDPs with dense and sparse C matrices, respectively. The SDP<> class is detailed in the semidefinite program documentation.

#### Attributes

The PrimalDualSolver<> class has several attributes that are only modifiable as member methods.

type method name description default
double Tau() Value of tau used to compute alpha_hat. 0.99
double NormXZTol() Tolerance for the norm of X*Z. 1e-7
double PrimalInfeasTol() Tolerance for primal infeasibility. 1e-7
double DualInfeasTol() Tolerance for dual infeasibility. 1e-7
size_t MaxIterations() Maximum number of iterations before convergence. 1000

#### Optimization

The PrimalDualSolver<> class offers two overloads of Optimize() that optionally return the converged values for the dual variables.

/**
* Invoke the optimization procedure, returning the converged values for the
* primal and dual variables.
*/
double Optimize(arma::mat& X,
arma::vec& ySparse,
arma::vec& yDense,
arma::mat& Z);

/**
* Invoke the optimization procedure, and only return the primal variable.
*/
double Optimize(arma::mat& X);


## RMSProp

An optimizer for differentiable separable functions.

#### Constructors

• RMSProp()
• RMSProp(stepSize, batchSize)
• RMSProp(stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle)
• RMSProp(stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Number of points to process in each step. 32
double alpha Smoothing constant, similar to that used in AdaDelta and momentum methods. 0.99
double epsilon Value used to initialise the mean squared gradient parameter. 1e-8
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm.
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer can also be modified via the member methods StepSize(), BatchSize(), Alpha(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

RMSProp optimizer(1e-3, 1, 0.99, 1e-8, 5000000, 1e-9, true);
optimizer.Optimize(f, coordinates);


## Simulated Annealing (SA)

An optimizer for arbitrary functions.

Simulated Annealing is an stochastic optimization algorithm which is able to deliver near-optimal results quickly without knowing the gradient of the function being optimized. It has a unique hill climbing capability that makes it less vulnerable to local minima. This implementation uses exponential cooling schedule and feedback move control by default, but the cooling schedule can be changed via a template parameter.

#### Constructors

• SA<CoolingScheduleType>(coolingSchedule)
• SA<CoolingScheduleType>(coolingSchedule, maxIterations)
• SA<CoolingScheduleType>(coolingSchedule, maxIterations, initT, initMoves, moveCtrlSweep, tolerance, maxToleranceSweep, maxMoveCoef, initMoveCoef, gain)

The CoolingScheduleType template parameter implements a policy to update the temperature. The ExponentialSchedule class is available for use; it has a constructor ExponentialSchedule(lambda) where lambda is the cooling speed (default 0.001). Custom schedules may be created by implementing a class with at least the single member method below:

// Return the next temperature given the current system status.
double NextTemperature(const double currentTemperature,
const double currentEnergy);


For convenience, the default cooling schedule is ExponentialSchedule, so the shorter type SA<> may be used instead of the equivalent SA<ExponentialSchedule>.

#### Attributes

type name description default
CoolingScheduleType coolingSchedule Instantiated cooling schedule (default ExponentialSchedule). n/a
size_t maxIterations Maximum number of iterations allowed (0 indicates no limit). 1000000
double initT Initial temperature. 10000.0
size_t initMoves Number of initial iterations without changing temperature. 1000
size_t moveCtrlSweep Sweeps per feedback move control. 100
double tolerance Tolerance to consider system frozen. 1e-5
size_t maxToleranceSweep Maximum sweeps below tolerance to consider system frozen. 3
double maxMoveCoef Maximum move size. 20
double initMoveCoef Initial move size. 0.3
double gain Proportional control in feedback move control. 0.3

Attributes of the optimizer may also be changed via the member methods CoolingSchedule(), MaxIterations(), InitT(), InitMoves(), MoveCtrlSweep(), Tolerance(), MaxToleranceSweep(), MaxMoveCoef(), InitMoveCoef(), and Gain().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SA<> optimizer(ExponentialSchedule(), 1000000, 1000., 1000, 100, 1e-10, 3, 1.5,
0.5, 0.3);
optimizer.Optimize(f, coordinates);


## Simultaneous Perturbation Stochastic Approximation (SPSA)

An optimizer for differentiable separable functions.

The SPSA algorithm approximates the gradient of the function by finite differences along stochastic directions.

### Constructors

• SPSA(alpha, batchSize, gamma, stepSize, evaluationStepSize, maxIterations, tolerance, shuffle)

#### Attributes

type name description default
double alpha Scaling exponent for the step size. 0.602
size_t batchSize Batch size to use for each step. 32
double gamma Scaling exponent for evaluation step size. 0.101
double stepSize Scaling parameter for step size (named as ‘a’ in the paper). 0.16
double evaluationStepSize Scaling parameter for evaluation step size (named as ‘c’ in the paper). 0.3
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true

Attributes of the optimizer may also be changed via the member methods Alpha(), BatchSize(), Gamma(), StepSize(), EvaluationStepSize(), and MaxIterations().

#### Examples:

SphereFunction f(2);
arma::mat coordinates = f.GetInitialPoint();

SPSA optimizer(0.1, 2, 0.102, 0.16, 0.3, 100000, 0);
optimizer.Optimize(f, coordinates);


## Stochastic Recursive Gradient Algorithm (SARAH/SARAH+)

An optimizer for differentiable separable functions.

StochAstic Recusive gRadient algoritHm (SARAH), is a variance reducing stochastic recursive gradient algorithm which employs the stochastic recursive gradient, for solving empirical loss minimization for the case of nonconvex losses.

#### Constructors

• SARAHType<UpdatePolicyType>()
• SARAHType<UpdatePolicyType>(stepSize, batchSize)
• SARAHType<UpdatePolicyType>(stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy)

The UpdatePolicyType template parameter specifies the update step used for the optimizer. The SARAHUpdate and SARAHPlusUpdate classes are available for use, and implement the standard SARAH update and SARAH+ update, respectively. A custom update rule can be used by implementing a class with the same method signatures.

For convenience the following typedefs have been defined:

• SARAH (equivalent to SARAHType<SARAHUpdate>): the standard SARAH optimizer
• SARAH_Plus (equivalent to SARAHType<SARAHPlusUpdate>): the SARAH+ optimizer

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
size_t innerIterations The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. 0
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), MaxIterations(), InnerIterations(), Tolerance(), Shuffle(), and UpdatePolicy().

Note that the default value for updatePolicy is the default constructor for the UpdatePolicyType.

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Standard stochastic variance reduced gradient.
SARAH optimizer(0.01, 1, 5000, 0, 1e-5, true);
optimizer.Optimize(f, coordinates);

// Stochastic variance reduced gradient with Barzilai-Borwein.
SARAH_Plus optimizerPlus(0.01, 1, 5000, 0, 1e-5, true);
optimizerPlus.Optimize(f, coordinates);


## Standard SGD

An optimizer for differentiable separable functions.

Stochastic Gradient Descent is a technique for minimizing a function which can be expressed as a sum of other functions. It’s likely better to use any of the other variants of SGD than this class; however, this standard SGD implementation may still be useful in some situations.

#### Constructors

• StandardSGD()
• StandardSGD(stepSize, batchSize)
• StandardSGD(stepSize, batchSize, maxIterations, tolerance, shuffle)

Note that StandardSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = VanillaUpdate and DecayPolicyType = NoDecay.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), and Shuffle().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

StandardSGD optimizer(0.01, 32, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);


## Stochastic Coordinate Descent (SCD)

An optimizer for partially differentiable functions.

Stochastic Coordinate descent is a technique for minimizing a function by doing a line search along a single direction at the current point in the iteration. The direction (or “coordinate”) can be chosen cyclically, randomly or in a greedy fashion.

#### Constructors

• SCD<DescentPolicyType>()
• SCD<DescentPolicyType>(stepSize, maxIterations)
• SCD<DescentPolicyType>(stepSize, maxIterations, tolerance, updateInterval)
• SCD<DescentPolicyType>(stepSize, maxIterations, tolerance, updateInterval, descentPolicy)

The DescentPolicyType template parameter specifies the behavior of SCD when selecting the next coordinate to descend with. The RandomDescent, GreedyDescent, and CyclicDescent classes are available for use. Custom behavior can be achieved by implementing a class with the same method signatures.

For convenience, the following typedefs have been defined:

• RandomSCD (equivalent to SCD<RandomDescent>): selects coordinates randomly
• GreedySCD (equivalent to SCD<GreedyDescent>): selects the coordinate with the maximum guaranteed descent according to the Gauss-Southwell rule
• CyclicSCD (equivalent to SCD<CyclicDescent>): selects coordinates sequentially

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate the algorithm. 1e-5
size_t updateInterval The interval at which the objective is to be reported and checked for convergence. 1e3
DescentPolicyType descentPolicy The policy to use for selecting the coordinate to descend on. DescentPolicyType()

Attributes of the optimizer may also be modified via the member methods StepSize(), MaxIterations(), Tolerance(), UpdateInterval(), and DescentPolicy().

Note that the default value for descentPolicy is the default constructor for DescentPolicyType.

#### Examples

SparseTestFunction f;
arma::mat coordinates = f.GetInitialPoint();

RandomSCD randomscd(0.01, 100000, 1e-5, 1e3);
randomscd.Optimize(f, coordinates);

GreedySCD greedyscd(0.01, 100000, 1e-5, 1e3);
greedyscd.Optimize(f, coordinates);

CyclicSCD cyclicscd(0.01, 100000, 1e-5, 1e3);
cyclicscd.Optimize(f, coordinates);


## Stochastic Gradient Descent with Restarts (SGDR)

An optimizer for differentiable separable functions.

SGDR is based on Mini-batch Stochastic Gradient Descent class and simulates a new warm-started run/restart once a number of epochs are performed.

#### Constructors

• SGDR<UpdatePolicyType>()
• SGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize)
• SGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy)
• SGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy, resetPolicy)

The UpdatePolicyType template parameter controls the update policy used during the iterative update process. The MomentumUpdate class is available for use, and custom behavior can be achieved by implementing a class with the same method signatures as MomentumUpdate.

For convenience, the default type of UpdatePolicyType is MomentumUpdate, so the shorter type SGDR<> can be used instead of the equivalent SGDR<MomentumUpdate>.

#### Attributes

type name description default
size_t epochRestart Initial epoch where decay is applied. 50
double multFactor Batch size multiplication factor. 2.0
size_t batchSize Size of each mini-batch. 1000
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer can also be modified via the member methods EpochRestart(), MultFactor(), BatchSize(), StepSize(), MaxIterations(), Tolerance(), Shuffle(), UpdatePolicy(), and ResetPolicy().

Note that the default value for updatePolicy is the default constructor for the UpdatePolicyType.

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3);
optimizer.Optimize(f, coordinates);


## Snapshot Stochastic Gradient Descent with Restarts (SnapshotSGDR)

An optimizer for differentiable separable functions.

SnapshotSGDR simulates a new warm-started run/restart once a number of epochs are performed using the Snapshot Ensembles technique.

#### Constructors

• SnapshotSGDR<UpdatePolicyType>()
• SnapshotSGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize)
• SnapshotSGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy)
• SnapshotSGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy, resetPolicy)

The UpdatePolicyType template parameter controls the update policy used during the iterative update process. The MomentumUpdate class is available for use, and custom behavior can be achieved by implementing a class with the same method signatures as MomentumUpdate.

For convenience, the default type of UpdatePolicyType is MomentumUpdate, so the shorter type SnapshotSGDR<> can be used instead of the equivalent SnapshotSGDR<MomentumUpdate>.

#### Attributes

type name description default
size_t epochRestart Initial epoch where decay is applied. 50
double multFactor Batch size multiplication factor. 2.0
size_t batchSize Size of each mini-batch. 1000
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
size_t snapshots Maximum number of snapshots. 5
bool accumulate Accumulate the snapshot parameter. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer can also be modified via the member methods EpochRestart(), MultFactor(), BatchSize(), StepSize(), MaxIterations(), Tolerance(), Shuffle(), Snapshots(), Accumulate(), UpdatePolicy(), and ResetPolicy().

The Snapshots() function returns a std::vector<arma::mat>& (a vector of snapshots of the parameters), not a size_t representing the maximum number of snapshots.

Note that the default value for updatePolicy is the default constructor for the UpdatePolicyType.

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SnapshotSGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3);
optimizer.Optimize(f, coordinates);


## SMORMS3

An optimizer for differentiable separable functions.

SMORMS3 is a hybrid of RMSprop, which is trying to estimate a safe and optimal distance based on curvature or perhaps just normalizing the stepsize in the parameter space.

#### Constructors

• SMORMS3()
• SMORMS3(stepSize, batchSize)
• SMORMS3(stepSize, batchSize, epsilon, maxIterations, tolerance)
• SMORMS3(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process at each step. 32
double epsilon Value used to initialise the mean squared gradient parameter. 1e-16
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer can also be modified via the member methods StepSize(), BatchSize(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SMORMS3 optimizer(0.001, 1, 1e-16, 5000000, 1e-9, true);
optimizer.Optimize(f, coordinates);


## Standard stochastic variance reduced gradient (SVRG)

An optimizer for differentiable separable functions.

Stochastic Variance Reduced Gradient is a technique for minimizing smooth and strongly convex problems.

#### Constructors

• SVRGType<UpdatePolicyType, DecayPolicyType>()
• SVRGType<UpdatePolicyType, DecayPolicyType>(stepSize)
• SVRGType<UpdatePolicyType, DecayPolicyType>(stepSize, batchSize, maxIterations, innerIterations)
• SVRGType<UpdatePolicyType, DecayPolicyType>(stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy)

The UpdatePolicyType template parameter controls the update step used by SVRG during the optimization. The SVRGUpdate class is available for use and custom update behavior can be achieved by implementing a class with the same method signatures as SVRGUpdate.

The DecayPolicyType template parameter controls the decay policy used to adjust the step size during the optimization. The BarzilaiBorweinDecay and NoDecay classes are available for use. Custom decay functionality can be achieved by implementing a class with the same method signatures.

For convenience the following typedefs have been defined:

• SVRG (equivalent to SVRGType<SVRGUpdate, NoDecay>): the standard SVRG technique
• SVRG_BB (equivalent to SVRGType<SVRGUpdate, BarzilaiBorweinDecay>): SVRG with the Barzilai-Borwein decay policy

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Initial batch size. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
size_t innerIterations The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. 0
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), InnerIterations(), Tolerance(), Shuffle(), UpdatePolicy(), DecayPolicy(), and ResetPolicy().

Note that the default values for the updatePolicy and decayPolicy parameters are simply the default constructors of the UpdatePolicyType and DecayPolicyType classes.

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Standard stochastic variance reduced gradient.
SVRG optimizer(0.005, 1, 300, 0, 1e-10, true);
optimizer.Optimize(f, coordinates);

// Stochastic variance reduced gradient with Barzilai-Borwein.
SVRG_BB bbOptimizer(0.005, batchSize, 300, 0, 1e-10, true, SVRGUpdate(),
BarzilaiBorweinDecay(0.1));
bbOptimizer.Optimize(f, coordinates);


## SPALeRA Stochastic Gradient Descent (SPALeRASGD)

An optimizer for differentiable separable functions.

SPALeRA involves two components: a learning rate adaptation scheme, which ensures that the learning system goes as fast as it can; and a catastrophic event manager, which is in charge of detecting undesirable behaviors and getting the system back on track.

#### Constructors

• SPALeRASGD<DecayPolicyType>()
• SPALeRASGD<DecayPolicyType>(stepSize, batchSize)
• SPALeRASGD<DecayPolicyType>(stepSize, batchSize, maxIterations, tolerance)
• SPALeRASGD<DecayPolicyType>(stepSize, batchSize, maxIterations, tolerance, lambda, alpha, epsilon, adaptRate, shuffle, decayPolicy, resetPolicy)

The DecayPolicyType template parameter controls the decay in the step size during the course of the optimization. The NoDecay class is available for use; custom behavior can be achieved by implementing a class with the same method signatures.

By default, DecayPolicyType is set to NoDecay, so the shorter type SPALeRASGD<> can be used instead of the equivalent SPALeRASGD<NoDecay>.

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Initial batch size. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
double lambda Page-Hinkley update parameter. 0.01
double alpha Memory parameter of the Agnostic Learning Rate adaptation. 0.001
double epsilon Numerical stability parameter. 1e-6
double adaptRate Agnostic learning rate update rate. 3.10e-8
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Lambda(), Alpha(), Epsilon(), AdaptRate(), Shuffle(), DecayPolicy(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SPALeRASGD<> optimizer(0.05, 1, 10000, 1e-4);
optimizer.Optimize(f, coordinates);


## SWATS

An optimizer for differentiable separable functions.

SWATS is an optimizer that uses a simple strategy to switch from Adam to standard SGD when a triggering condition is satisfied. The condition relates to the projection of Adam steps on the gradient subspace.

#### Constructors

• SWATS()
• SWATS(stepSize, batchSize)
• SWATS(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process at each step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double epsilon Value used to initialise the mean squared gradient parameter. 1e-16
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true

Attributes of the optimizer can also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Epsilon(), MaxIterations(), Tolerance(), and Shuffle().

#### Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SWATS optimizer(0.001, 1, 0.9, 0.999, 1e-16, 5000000, 1e-9, true);
optimizer.Optimize(f, coordinates);


An optimizer for differentiable separable functions.

WNGrad is a general nonlinear update rule for the learning rate. WNGrad has near-optimal convergence rates in both the batch and stochastic settings.

#### Constructors

• WNGrad()
• WNGrad(stepSize, batchSize)
• WNGrad(stepSize, batchSize, maxIterations, tolerance, shuffle)
• WNGrad(stepSize, batchSize, maxIterations, tolerance, shuffle, resetPolicy)

#### Attributes

type name description default
double stepSize Step size for each iteration. 0.562
size_t batchSize Initial batch size. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

#### Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

optimizer.Optimize(f, coordinates);


# History/changelog

### ensmallen 1.13.1

###### 2019-01-24
• Fix -Wreorder warning (#75).

### ensmallen 1.13.0

###### 2019-01-14
• Enhance options for AugLagrangian optimizer (#66).

### ensmallen 1.12.2

###### 2019-01-05
• Fix list of contributors.

### ensmallen 1.12.1

###### 2019-01-03
• Make sure all files end with newlines.

### ensmallen 1.12.0

###### 2018-12-30

• Minor documentation fixes. Remove too-verbose documentation from source for each optimizer (#61).

• Add ResetPolicy() to SGD-like optimizers (#60).

### ensmallen 1.11.1

###### 2018-11-29
• Minor documentation fixes.

### ensmallen 1.11.0

###### 2018-11-28

• Fix header name in documentation samples.

### ensmallen 1.10.1

###### 2018-11-16
• Fixes for GridSearch optimizer.

• Include documentation with release.

### ensmallen 1.10.0

###### 2018-10-20
• Initial release.

• Includes the ported optimization framework from mlpack (http://www.mlpack.org/).