ensmallen
ensmallen
flexible C++ library for efficient mathematical optimization

ensmallen documentation

ensmallen is a library for function optimization. It is devoted to efficiently solving the problem

for many different types of $f(x)$. The documentation is split into four parts:

Everything in ensmallen is in the ens namespace so it is often useful to put a using directive in your code:

using namespace ens;

Citation

Please cite the following paper if you use ensmallen in your research and/or software.
Citations are useful for the continued development and maintenance of the library.

Contents

Function type documentation

Arbitrary functions

The least restrictive type of function that can be implemented in ensmallen is a function for which only the objective can be evaluated. For this, a class with the following API must be implemented:

class ArbitraryFunctionType
{
 public:
  // This should return f(x).
  double Evaluate(const arma::mat& x);
};

For this type of function, we assume that the gradient f'(x) is not computable. If it is, see differentiable functions.

The Evaluate() method is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used to optimize an arbitrary function:

Each of these optimizers has an Optimize() function that is called as Optimize(f, x) where f is the function to be optimized (which implements Evaluate()) and x holds the initial point of the optimization. After Optimize() is called, x will hold the final result of the optimization (that is, the best x found that minimizes f(x)).

Example: squared function optimization

An example program that implements the objective function f(x) = 2 |x|^2 is shown below, using the simulated annealing optimizer.

#include <ensmallen.hpp>

class SquaredFunction
{
 public:
  // This returns f(x) = 2 |x|^2.
  double Evaluate(const arma::mat& x)
  {
    return 2 * std::pow(arma::norm(x), 2.0);
  }
};

int main()
{
  // The minimum is at x = [0 0 0].  Our initial point is chosen to be
  // [1.0, -1.0, 1.0].
  arma::mat x("1.0 -1.0 1.0");

  // Create simulated annealing optimizer with default options.
  // The ens::SA<> type can be replaced with any suitable ensmallen optimizer
  // that is able to handle arbitrary functions.
  ens::SA<> optimizer;
  SquaredFunction f; // Create function to be optimized.
  optimizer.Optimize(f, x);

  std::cout << "Minimum of squared function found with simulated annealing is "
      << x;
}

Differentiable functions

Probably the most common type of function that can be optimized with ensmallen is a differentiable function, where both f(x) and f’(x) can be calculated. To optimize a differentiable function with ensmallen, a class must be implemented that follows the API below:

class DifferentiableFunctionType
{
 public:
  // Given parameters x, return the value of f(x).
  double Evaluate(const arma::mat& x);

  // Given parameters x and a matrix g, store f'(x) in the provided matrix g.
  // g should have the same size (rows, columns) as x.
  void Gradient(const arma::mat& x, arma::mat& gradient);

  // OPTIONAL: this may be implemented in addition to---or instead
  // of---Evaluate() and Gradient().  If this is the only function implemented,
  // implementations of Evaluate() and Gradient() will be automatically
  // generated using template metaprogramming.  Often, implementing
  // EvaluateWithGradient() can result in more efficient optimizations.
  //
  // Given parameters x and a matrix g, return the value of f(x) and store
  // f'(x) in the provided matrix g.  g should have the same size (rows,
  // columns) as x.
  double EvaluateWithGradient(const arma::mat& x, arma::mat& g);
};

Note that you may implement either Evaluate() and Gradient() or EvaluateWithGradient(), but it is not mandatory to implement both. (Of course, supplying both is okay too.) It often results in faster code when EvaluateWithGradient() is implemented, especially for functions where f(x) and f’(x) compute some of the same intermediate quantities.

Each of the implemented methods is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used with differentiable functions:

Each of these optimizers has an Optimize() function that is called as Optimize(f, x) where f is the function to be optimized and x holds the initial point of the optimization. After Optimize() is called, x will hold the final result of the optimization (that is, the best x found that minimizes f(x)).

An example program is shown below. In this program, we optimize a linear regression model. In this setting, we have some matrix data of data points that we’ve observed, and some vector responses of the observed responses to this data. In our model, we assume that each response is the result of a weighted linear combination of the data:

where $x$ is a vector of parameters. This gives the objective function $f(x) = (\operatorname{responses} - x’\operatorname{data})^2$.

In the example program, we optimize this objective function and compare the runtime of an implementation that uses Evaluate() and Gradient(), and the runtime of an implementation that uses EvaluateWithGradient().

#include <ensmallen.hpp>

// Define a differentiable objective function by implementing both Evaluate()
// and Gradient() separately.
class LinearRegressionFunction
{
 public:
  // Construct the object with the given data matrix and responses.
  LinearRegressionFunction(const arma::mat& dataIn,
                           const arma::rowvec& responsesIn) :
      data(dataIn), responses(responsesIn) { }

  // Return the objective function for model parameters x.
  double Evaluate(const arma::mat& x)
  {
    return std::pow(arma::norm(responses - x.t() * data), 2.0);
  }

  // Compute the gradient for model parameters x.
  void Gradient(const arma::mat& x, arma::mat& g)
  {
    g = -2 * data * (responses - x.t() * data);
  }

 private:
  // The data.
  const arma::mat& data;
  // The responses to each data point.
  const arma::rowvec& responses;
};

// Define the same function, but only implement EvaluateWithGradient().
class LinearRegressionEWGFunction
{
 public:
  // Construct the object with the given data matrix and responses.
  LinearRegressionEWGFunction(const arma::mat& dataIn,
                              const arma::rowvec& responsesIn) :
      data(dataIn), responses(responsesIn) { }

  // Simultaneously compute both the objective function and gradient for model
  // parameters x.  Note that this is faster than implementing Evaluate() and
  // Gradient() individually because it caches the computation of
  // (responses - x.t() * data)!
  double EvaluateWithGradient(const arma::mat& x, arma::mat& g)
  {
    const arma::rowvec v = (responses - x.t() * data);
    g = -2 * data * v;
    return arma::accu(v % v); // equivalent to \| v \|^2
  }
};

int main()
{
  // We'll run a simple speed comparison between both objective functions.

  // First, generate some random data, with 10000 points and 10 dimensions.
  // This data has no pattern and as such will make a model that's not very
  // useful---but the purpose here is just demonstration. :)
  //
  // For a more "real world" situation, load a dataset from file using X.load()
  // and y.load() (but make sure the matrix is column-major, so that each
  // observation/data point corresponds to a *column*, *not* a row.
  arma::mat data(10, 10000, arma::fill::randn);
  arma::rowvec responses(10000, arma::fill::randn);

  // Create a starting point for our optimization randomly.  The model has 10
  // parameters, so the shape is 10x1.
  arma::mat startingPoint(10, 1, arma::fill::randn);

  // We'll use Armadillo's wall_clock class to do a timing comparison.
  arma::wall_clock clock;

  // Construct the first objective function.
  LinearRegressionFunction lrf1(data, responses);
  arma::mat lrf1Params(startingPoint);

  // Create the L_BFGS optimizer with default parameters.
  // The ens::L_BFGS type can be replaced with any ensmallen optimizer that can
  // handle differentiable functions.
  ens::L_BFGS lbfgs;

  // Time how long L-BFGS takes for the first Evaluate() and Gradient()
  // objective function.
  clock.tic();
  lbfgs.Optimize(lrf1, lrf1Params);
  const double time1 = clock.toc();

  std::cout << "LinearRegressionFunction with Evaluate() and Gradient() took "
    << time1 << " seconds to converge to the model: " << std::endl;
  std::cout << lrf1Params.t();

  // Create the second objective function, which uses EvaluateWithGradient().
  LinearRegressionEWGFunction lrf2(data, responses);
  arma::mat lrf2Params(startingPoint);

  // Time how long L-BFGS takes for the EvaluateWithGradient() objective
  // function.
  clock.tic();
  lbfgs.Optimize(lrf2, lrf2Params);
  const double time2 = clock.toc();

  std::cout << "LinearRegressionEWGFunction with EvaluateWithGradient() took "
    << time2 << " seconds to converge to the model: " << std::endl;
  std::cout << lrf2Params.t();

  // When this runs, the output parameters will be exactly on the same, but the
  // LinearRegressionEWGFunction will run more quickly!
}

Partially differentiable functions

Some differentiable functions have the additional property that the gradient f'(x) can be decomposed along a different axis j such that the gradient is sparse. This makes the most sense in machine learning applications where the function f(x) is being optimized for some dataset data, which has d dimensions. A partially differentiable separable function is partially differentiable with respect to each dimension j of data. This property is useful for coordinate descent type algorithms.

To use ensmallen optimizers to minimize these types of functions, only two functions needs to be added to the differentiable function type:

// Compute the partial gradient f'_j(x) with respect to data coordinate j and
// store it in the sparse matrix g.
void Gradient(const arma::mat& x, const size_t j, arma::sp_mat& g);

// Get the number of features that f(x) can be partially differentiated with.
size_t NumFeatures();

Note: many partially differentiable function optimizers do not require a regular implementation of the Gradient(), so that function may be omitted.

If these functions are implemented, the following partially differentiable function optimizers can be used:

Arbitrary separable functions

Often, an objective function f(x) may be represented as the sum of many functions:

f(x) = f_0(x) + f_1(x) + ... + f_N(x).

In this function type, we assume the gradient f'(x) is not computable. If it is, see differentiable separable functions.

For machine learning tasks, the objective function may be, e.g., the sum of a function taken across many data points. Implementing an arbitrary separable function type in ensmallen is similar to implementing an arbitrary objective function, but with a few extra utility methods:

class ArbitrarySeparableFunctionType
{
 public:
  // Given parameters x, return the sum of the individual functions
  // f_i(x) + ... + f_{i + batchSize - 1}(x).  i will always be greater than 0,
  // and i + batchSize will be less than or equal to the value of NumFunctions().
  double Evaluate(const arma::mat& x, const size_t i, const size_t batchSize);

  // Shuffle the ordering of the functions f_i(x).
  // (For machine learning problems, this would be equivalent to shuffling the
  // data points, e.g., before an epoch of learning.)
  void Shuffle();

  // Get the number of functions f_i(x).
  // (For machine learning problems, this is often just the number of points in
  // the dataset.)
  size_t NumFunctions();
};

Each of the implemented methods is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used with arbitrary separable functions:

Each of these optimizers has an Optimize() function that is called as Optimize(f, x) where f is the function to be optimized and x holds the initial point of the optimization. After Optimize() is called, x will hold the final result of the optimization (that is, the best x found that minimizes f(x)).

Note: using an arbitrary non-separable function optimizer will call Evaluate(x, 0, NumFunctions() - 1); if this is a very computationally intensive operation for your objective function, it may be best to avoid using a non-separable arbitrary function optimizer.

Note: if possible, it’s often better to try and use a gradient-based approach. See differentiable separable functions for separable f(x) where the gradient f’(x) can be computed.

The example program below demonstrates the implementation and use of an arbitrary separable function. The function used is the linear regression objective function, described in differentiable functions. Given some dataset data and responses responses, the linear regression objective function is separable as

where $\operatorname{data}(i)$ represents the data point indexed by $i$ and $\operatorname{responses}(i)$ represents the observed response indexed by $i$.

#include <ensmallen.hpp>

// This class implements the linear regression objective function as an
// arbitrary separable function type.
class LinearRegressionFunction
{
 public:
  // Create the linear regression function with the given data and the given
  // responses.
  LinearRegressionFunction(const arma::mat& dataIn,
                           const arma::rowvec& responsesIn) :
      data(data), responses(responses) { }

  // Given parameters x, compute the sum of the separable objective
  // functions starting with f_i(x) and ending with
  // f_{i + batchSize - 1}(x).
  double Evaluate(const arma::mat& x, const size_t i, const size_t batchSize)
  {
    // A more complex implementation could avoid the for loop and use
    // submatrices, but it is easier to understand when implemented this way.
    double objective = 0.0;
    for (size_t j = i; j < i + batchSize; ++j)
    {
      objective += std::pow(responses[j] - x.t() * data.col(j), 2.0);
    }
  }

  // Shuffle the ordering of the functions f_i(x).  We do this by simply
  // shuffling the data and responses.
  void Shuffle()
  {
    // Generate a random ordering of data points.
    arma::uvec ordering = arma::shuffle(
        arma::linspace<arma::uvec>(0, data.n_cols - 1, data.n_cols));

    // This reorders the data and responses with our randomly-generated
    // ordering above.
    data = data.cols(ordering);
    responses = responses.cols(ordering);
  }

  // Return the number of functions f_i(x).  In our case this is simply the
  // number of data points.
  size_t NumFunctions() { return data.n_cols; }
};

int main()
{
  // First, generate some random data, with 10000 points and 10 dimensions.
  // This data has no pattern and as such will make a model that's not very
  // useful---but the purpose here is just demonstration. :)
  //
  // For a more "real world" situation, load a dataset from file using X.load()
  // and y.load() (but make sure the matrix is column-major, so that each
  // observation/data point corresponds to a *column*, *not* a row.
  arma::mat data(10, 10000, arma::fill::randn);
  arma::rowvec responses(10000, arma::fill::randn);

  // Create a starting point for our optimization randomly.  The model has 10
  // parameters, so the shape is 10x1.
  arma::mat params(10, 1, arma::fill::randn);

  // Use the CMAES optimizer with default parameters to minimize the
  // LinearRegressionFunction.
  // The ens::CMAES type can be replaced with any suitable ensmallen optimizer
  // that can handle arbitrary separable functions.
  ens::CMAES cmaes;
  LinearRegressionFunction lrf(data, responses);
  cmaes.Optimize(lrf, params);

  std::cout << "The optimized linear regression model found by CMAES has the "
      << "parameters " << params.t();
}

Differentiable separable functions

Likely the most important type of function to be optimized in machine learning and some other fields is the differentiable separable function. A differentiable separable function f(x) can be represented as the sum of many functions:

f(x) = f_0(x) + f_1(x) + ... + f_N(x).

And it also has a computable separable gradient f'(x):

f'(x) = f'_0(x) + f'_1(x) + ... + f'_N(x).

For machine learning tasks, the objective function may be, e.g., the sum of a function taken across many data points. Implementing a differentiable separable function type in ensmallen is similar to implementing an ordinary differentiable function, but with a few extra utility methods:

class ArbitrarySeparableFunctionType
{
 public:
  // Given parameters x, return the sum of the individual functions
  // f_i(x) + ... + f_{i + batchSize - 1}(x).  i will always be greater than 0,
  // and i + batchSize will be less than or equal to the value of NumFunctions().
  double Evaluate(const arma::mat& x, const size_t i, const size_t batchSize);

  // Given parameters x and a matrix g, store the sum of the gradient of
  // individual functions f'_i(x) + ... + f'_{i + batchSize - 1}(x) into g.  i
  // will always be greater than 0, and i + batchSize will be less than or
  // equal to the value of NumFunctions().
  void Gradient(const arma::mat& x,
                const size_t i,
                arma::mat& g,
                const size_t batchSize);

  // Shuffle the ordering of the functions f_i(x).
  // (For machine learning problems, this would be equivalent to shuffling the
  // data points, e.g., before an epoch of learning.)
  void Shuffle();

  // Get the number of functions f_i(x).
  // (For machine learning problems, this is often just the number of points in
  // the dataset.)
  size_t NumFunctions();

  // OPTIONAL: this may be implemented in addition to---or instead
  // of---Evaluate() and Gradient().  If this is the only function implemented,
  // implementations of Evaluate() and Gradient() will be automatically
  // generated using template metaprogramming.  Often, implementing
  // EvaluateWithGradient() can result in more efficient optimizations.
  //
  // Given parameters x and a matrix g, return the sum of the individual
  // functions f_i(x) + ... + f_{i + batchSize - 1}(x), and store the sum of
  // the gradient of individual functions f'_i(x) + ... +
  // f'_{i + batchSize - 1}(x) into the provided matrix g.  g should have the
  // same size (rows, columns) as x.  i will always be greater than 0, and i +
  // batchSize will be less than or equal to the value of NumFunctions().
  double EvaluateWithGradient(const arma::mat& x,
                              const size_t i,
                              arma::mat& g,
                              const size_t batchSize);
};

Note that you may implement either Evaluate() and Gradient() or EvaluateWithGradient(), but it is not mandatory to implement both. (Of course, supplying both is okay too.) It often results in faster code when EvaluateWithGradient() is implemented, especially for functions where f(x) and f’(x) compute some of the same intermediate quantities.

Each of the implemented methods is allowed to have additional cv-modifiers (static, const, etc.).

The following optimizers can be used with differentiable functions:

The example program below demonstrates the implementation and use of an arbitrary separable function. The function used is the linear regression objective function, described in differentiable functions. Given some dataset data and responses responses, the linear regression objective function is separable as

f_i(x) = (responses(i) - x' * data(i))^2

where data(i) represents the data point indexed by i and responses(i) represents the observed response indexed by i. This example implementation only implements EvaluateWithGradient() in order to avoid redundant calculations.

#include <ensmallen.hpp>

// This class implements the linear regression objective function as an
// arbitrary separable function type.
class LinearRegressionFunction
{
 public:
  // Create the linear regression function with the given data and the given
  // responses.
  LinearRegressionFunction(const arma::mat& dataIn,
                           const arma::rowvec& responsesIn) :
      data(data), responses(responses) { }

  // Given parameters x, compute the sum of the separable objective
  // functions starting with f_i(x) and ending with
  // f_{i + batchSize - 1}(x), and also compute the gradient of those functions
  // and store them in g.
  double EvaluateWithGradient(const arma::mat& x,
                              const size_t i,
                              arma::mat& g,
                              const size_t batchSize)
  {
    // This slightly complex implementation uses Armadillo submatrices to
    // compute the objective functions and gradients simultaneously for
    // multiple points.
    //
    // The shared computation between the objective and gradient is the term
    // (response - x * data) so we compute that first, only for points in the
    // batch.
    const arma::rowvec v = (responses.cols(i, i + batchSize - 1) - x.t() *
        data.cols(i, i + batchSize - 1));
    g = -2 * data.cols(i, i + batchSize - 1) * v;
    return arma::accu(v % v); // equivalent to |v|^2
  }

  // Shuffle the ordering of the functions f_i(x).  We do this by simply
  // shuffling the data and responses.
  void Shuffle()
  {
    // Generate a random ordering of data points.
    arma::uvec ordering = arma::shuffle(
        arma::linspace<arma::uvec>(0, data.n_cols - 1, data.n_cols));

    // This reorders the data and responses with our randomly-generated
    // ordering above.
    data = data.cols(ordering);
    responses = responses.cols(ordering);
  }

  // Return the number of functions f_i(x).  In our case this is simply the
  // number of data points.
  size_t NumFunctions() { return data.n_cols; }
};

int main()
{
  // First, generate some random data, with 10000 points and 10 dimensions.
  // This data has no pattern and as such will make a model that's not very
  // useful---but the purpose here is just demonstration. :)
  //
  // For a more "real world" situation, load a dataset from file using X.load()
  // and y.load() (but make sure the matrix is column-major, so that each
  // observation/data point corresponds to a *column*, *not* a row.
  arma::mat data(10, 10000, arma::fill::randn);
  arma::rowvec responses(10000, arma::fill::randn);

  // Create a starting point for our optimization randomly.  The model has 10
  // parameters, so the shape is 10x1.
  arma::mat params(10, 1, arma::fill::randn);

  // Use RMSprop to find the best parameters for the linear regression model.
  // The type 'ens::RMSprop' can be changed for any ensmallen optimizer able to
  // handle differentiable separable functions.
  ens::RMSProp rmsprop;
  LinearRegressionFunction lrf(data, responses);
  rmsprop.Optimize(lrf, params);

  std::cout << "The optimized linear regression model found by RMSprop has the"
      << " parameters " << params.t();
}

Sparse differentiable separable functions

Some differentiable separable functions have the additional property that the gradient f'_i(x) is sparse. When this is true, one additional method can be implemented as part of the class to be optimized:

// Add this definition to use sparse differentiable separable function
// optimizers.  Given x, store the sum of the sparse gradient f'_i(x) + ... +
// f'_{i + batchSize - 1}(x) into the provided matrix g.
void Gradient(const arma::mat& x,
              const size_t i,
              arma::sp_mat& g,
              const size_t batchSize);

It’s also possible to instead use templates to provide only one Gradient() function for both sparse and non-sparse optimizers:

// This provides Gradient() for both sparse and non-sparse optimizers.
template<typename GradType>
void Gradient(const arma::mat& x,
              const size_t i,
              GradType& g,
              const size_t batchSize);

If either of these methods are available, then any ensmallen optimizer that optimizes sparse separable differentiable functions may be used. This includes:

Categorical functions

A categorical function is a function f(x) where some of the values of x are categorical variables (i.e. they take integer values [0, c - 1] and each value 0, 1, …, c - 1 is unrelated). In this situation, the function is not differentiable, and so only the objective f(x) can be implemented. Therefore the class requirements for a categorical function are exactly the same as for an ArbitraryFunctionType—but for any categorical dimension x_i in x, the value will be in the range [0, c_i - 1] where c_i is the number of categories in dimension x_i.

class CategoricalFunction
{
 public:
  // Return the objective function for the given parameters x.
  double Evaluate(const arma::mat& x);
};

However, when an optimizer’s Optimize() method is called, two additional parameters must be specified, in addition to the function to optimize and the matrix holding the parameters:

  • const std::vector<bool>& categoricalDimensions: a vector of length equal to the number of elements in x (the number of dimensions). If an element is true, then the dimension is categorical.

  • const arma::Row<size_t>& numCategories: a vector of length equal to the number of elements in x (the number of dimensions). If a dimension is categorical, then the corresponding value in numCategories should hold the number of categories in that dimension.

The following optimizers can be used in this way to optimize a categorical function:

An example program showing usage of categorical optimization is shown below.

#include <ensmallen.hpp>

// An implementation of a simple categorical function.  The parameters can be
// understood as x = [c1 c2 c3].  When c1 = 0, c2 = 2, and c3 = 1, the value of
// f(x) is 0.  In any other case, the value of f(x) is 10.  Therefore, the
// optimum is found at [0, 2, 1].
class SimpleCategoricalFunction
{
 public:
  // Return the objective function f(x) as described above.
  double Evaluate(const arma::mat& x)
  {
    if (size_t(x[0]) == 0 &&
        size_t(x[1]) == 2 &&
        size_t(x[2]) == 1)
      return 0.0;
    else
      return 10.0;
  }
};

int main()
{
  // Create and optimize the categorical function with the GridSearch
  // optimizer.  We must also create a std::vector<bool> that holds the types
  // of each dimension, and an arma::Row<size_t> that holds the number of
  // categories in each dimension.
  SimpleCategoricalFunction c;

  // We have three categorical dimensions only.
  std::vector<bool> categoricalDimensions;
  categoricalDimensions.push_back(true);
  categoricalDimensions.push_back(true);
  categoricalDimensions.push_back(true);

  // The first category can take 5 values; the second can take 3; the third can
  // take 12.
  arma::Row<size_t> numCategories("5 3 12");

  // The initial point for our optimization will be to set all categories to 0.
  arma::mat params("0 0 0");

  // Now create the GridSearch optimizer with default parameters, and run the
  // optimization.
  // The ens::GridSearch type can be replaced with any ensmallen optimizer that
  // is able to handle categorical functions.
  ens::GridSearch gs;
  gs.Optimize(c, params, categoricalDimensions, numCategories);

  std::cout << "The ens::GridSearch optimizer found the optimal parameters to "
      << "be " << params;
}

Constrained functions

A constrained function is an objective function f(x) that is also subject to some constraints on x. (For instance, perhaps a constraint could be that x is a positive semidefinite matrix.) ensmallen is able to handle differentiable objective functions of this type—so, f'(x) must also be computable. Given some set of constraints c_0(x), … c_M(x), we can re-express our constrained objective function as

f_C(x) = f(x) + c_0(x) + ... + c_M(x)

where the constraint c_i(x) is DBL_MAX if it is not satisfied, and otherwise takes some real value. For a “hard constraint”, we can simply take c_i(x) = 0 when it is satisfied. But allowing c_i(x) to return anything allows us to handle “soft” constraints also.

In order to optimize a constrained function with ensmallen, a class implementing the API below is required.

class ConstrainedFunctionType
{
 public:
  // Return the objective function f(x) for the given x.
  double Evaluate(const arma::mat& x);

  // Compute the gradient of f(x) for the given x and store the result in g.
  void Gradient(const arma::mat& x, arma::mat& g);

  // Get the number of constraints on the objective function.
  size_t NumConstraints();

  // Evaluate constraint i at the parameters x.  If the constraint is
  // unsatisfied, DBL_MAX should be returned.  If the constraint is satisfied,
  // any real value can be returned.  The optimizer will add this value to its
  // overall objective that it is trying to minimize.  (So, a hard constraint
  // can just return 0 if it's satisfied.)
  double EvaluateConstraint(const size_t i, const arma::mat& x);

  // Evaluate the gradient of constraint i at the parameters x, storing the
  // result in the given matrix g.  If this is a hard constraint you can set
  // the gradient to 0.  If the constraint is not satisfied, it could be
  // helpful to set the gradient in such a way that the gradient points in the
  // direction where the constraint would be satisfied.
  void GradientConstraint(const size_t i, const arma::mat& x, arma::mat& g);
};

A constrained function can be optimized with the following optimizers:

Semidefinite programs

A special class of constrained function is a semidefinite program. ensmallen has support for creating and optimizing semidefinite programs via the ens::SDP<> class. For this, the SDP must be expressed in the primal form:

min_x dot(C, x) such that

dot(A_i, x) = b_i, for i = 1, ..., M;
x >= 0

In this case, A_i and C are square matrices (sparse or dense), and b_i is a scalar.

Once the problem is expressed in this form, a class of type ens::SDP<> can be created. SDP<arma::mat> indicates an SDP with a dense C, and SDP<arma::sp_mat> indicates an SDP with a sparse C. The class has the following useful methods:

  • SDP(cMatrixSize, numDenseConstraints, numSparseConstraints): create a new SDP
  • size_t NumSparseConstraints(): get number of sparse constraint matrices A_i
  • size_t NumDenseConstraints(): get number of dense constraint matrices A_i
  • std::vector<arma::mat>& DenseA(): get vector of dense A_i matrices
  • std::vector<arma::sp_mat>& SparseA(): get vector of sparse A_i matrices
  • arma::vec& DenseB(): get vector of b_i values for dense A_i constraints
  • arma::vec& SparseB(): get vector of b_i values for sparse A_i constraints

Once these methods are used to set each A_i matrix and corresponding b_i value, and C objective matrix, the SDP object can be used with any ensmallen SDP solver. The list of SDP solvers is below:

Example code showing how to solve an SDP is given below.

int main()
{
  // We will build a toy semidefinite program and then use the PrimalDualSolver to find a solution

  // The semi-definite constraint looks like:
  //
  // [ 1  x_12  x_13  0  0  0  0 ]
  // [     1    x_23  0  0  0  0 ]
  // [            1   0  0  0  0 ]
  // [               s1  0  0  0 ]  >= 0
  // [                  s2  0  0 ]
  // [                     s3  0 ]
  // [                        s4 ]

  // x_11 == 0
  arma::sp_mat A0(7, 7); A0.zeros();
  A0(0, 0) = 1.;

  // x_22 == 0
  arma::sp_mat A1(7, 7); A1.zeros();
  A1(1, 1) = 1.;

  // x_33 == 0
  arma::sp_mat A2(7, 7); A2.zeros();
  A2(2, 2) = 1.;

  // x_12 <= -0.1  <==>  x_12 + s1 == -0.1, s1 >= 0
  arma::sp_mat A3(7, 7); A3.zeros();
  A3(1, 0) = A3(0, 1) = 1.; A3(3, 3) = 2.;

  // -0.2 <= x_12  <==>  x_12 - s2 == -0.2, s2 >= 0
  arma::sp_mat A4(7, 7); A4.zeros();
  A4(1, 0) = A4(0, 1) = 1.; A4(4, 4) = -2.;

  // x_23 <= 0.5  <==>  x_23 + s3 == 0.5, s3 >= 0
  arma::sp_mat A5(7, 7); A5.zeros();
  A5(2, 1) = A5(1, 2) = 1.; A5(5, 5) = 2.;

  // 0.4 <= x_23  <==>  x_23 - s4 == 0.4, s4 >= 0
  arma::sp_mat A6(7, 7); A6.zeros();
  A6(2, 1) = A6(1, 2) = 1.; A6(6, 6) = -2.;

  std::vector<arma::sp_mat> ais({A0, A1, A2, A3, A4, A5, A6});

  SDP<arma::sp_mat> sdp(7, 7 + 4 + 4 + 4 + 3 + 2 + 1, 0);

  for (size_t j = 0; j < 3; j++)
  {
    // x_j4 == x_j5 == x_j6 == x_j7 == 0
    for (size_t i = 0; i < 4; i++)
    {
      arma::sp_mat A(7, 7); A.zeros();
      A(i + 3, j) = A(j, i + 3) = 1;
      ais.emplace_back(A);
    }
  }

  // x_45 == x_46 == x_47 == 0
  for (size_t i = 0; i < 3; i++)
  {
    arma::sp_mat A(7, 7); A.zeros();
    A(i + 4, 3) = A(3, i + 4) = 1;
    ais.emplace_back(A);
  }

  // x_56 == x_57 == 0
  for (size_t i = 0; i < 2; i++)
  {
    arma::sp_mat A(7, 7); A.zeros();
    A(i + 5, 4) = A(4, i + 5) = 1;
    ais.emplace_back(A);
  }

  // x_67 == 0
  arma::sp_mat A(7, 7); A.zeros();
  A(6, 5) = A(5, 6) = 1;
  ais.emplace_back(A);

  std::swap(sdp.SparseA(), ais);

  sdp.SparseB().zeros();
  sdp.SparseB()[0] = sdp.SparseB()[1] = sdp.SparseB()[2] = 1.;
  sdp.SparseB()[3] = -0.2; sdp.SparseB()[4] = -0.4;
  sdp.SparseB()[5] = 1.; sdp.SparseB()[6] = 0.8;

  sdp.C().zeros();
  sdp.C()(0, 2) = sdp.C()(2, 0) = 1.;

  // That took a long time but we finally set up the problem right!  Now we can
  // use the PrimalDualSolver to solve it.
  // ens::PrimalDualSolver could be replaced with ens::LRSDP or other ensmallen
  // SDP solvers.
  PrimalDualSolver<SDP<arma::sp_mat>> solver(sdp);
  arma::mat X, Z;
  arma::vec ysparse, ydense;
  // ysparse, ydense, and Z hold the primal and dual variables found during the
  // optimization.
  const double obj = solver.Optimize(X, ysparse, ydense, Z);

  std::cout << "SDP optimized with objective " << obj << "." << std::endl;
}

Alternate matrix types

All of the examples above (and throughout the rest of the documentation) generally assume that the matrix being optimized has type arma::mat. But ensmallen’s optimizers are capable of optimizing more types than just dense Armadillo matrices. In fact, the full signature of each optimizer’s Optimize() method is this:

template<typename FunctionType, typename MatType>
typename MatType::elem_type Optimize(FunctionType& function,
                                     MatType& coordinates);

The return type, typename MatType::elem_type, is just the numeric type held by the given matrix type. So, for arma::mat, the return type is just double. In addition, optimizers for differentiable functions have a third template parameter, GradType, which specifies the type of the gradient. GradType can be manually specified in the situation where, e.g., a sparse gradient is desired.

It is easy to write a function to optimize, e.g., an arma::fmat. Here is an example, adapted from the SquaredFunction example from the arbitrary function documentation.

#include <ensmallen.hpp>

class SquaredFunction
{
 public:
  // This returns f(x) = 2 |x|^2.
  float Evaluate(const arma::fmat& x)
  {
    return 2 * std::pow(arma::norm(x), 2.0);
  }

  void Gradient(const arma::fmat& x, arma::fmat& gradient)
  {
    gradient = 4 * x;
  }
};

int main()
{
  // The minimum is at x = [0 0 0].  Our initial point is chosen to be 
  // [1.0, -1.0, 1.0].
  arma::fmat x("1.0 -1.0 1.0");

  // Create simulated annealing optimizer with default options.
  // The ens::SA<> type can be replaced with any suitable ensmallen optimizer
  // that is able to handle arbitrary functions.
  ens::L_BFGS<> optimizer;
  SquaredFunction f; // Create function to be optimized.
  optimizer.Optimize(f, x); // The optimizer will infer arma::fmat!

  std::cout << "Minimum of squared function found with simulated annealing is "
      << x;
}

Note that we have simply changed the SquaredFunction to accept arma::fmat instead of arma::mat as parameters to Evaluate(), and the return type has accordingly been changed to float from double. It would even be possible to optimize functions with sparse coordinates by having Evaluate() take a sparse matrix (i.e. arma::sp_mat).

If it were desired to represent the gradient as a sparse type, the Gradient() function would need to be modified to take a sparse matrix (i.e. arma::sp_mat or similar), and then you could call optimizer.Optimize<SquaredFunction, arma::mat, arma::sp_mat>(f, x); to perform the optimization while using sparse matrix types to represent the gradient. Using sparse MatType or GradType should only be done when it is known that the objective matrix and/or gradients will be sparse; otherwise the code may run very slow!

ensmallen will automatically infer MatType from the call to Optimize(), and check that the given FunctionType has all of the necessary functions for the given MatType, throwing a static_assert error if not. If you would like to disable these checks, define the macro ENS_DISABLE_TYPE_CHECKS before including ensmallen:

#define ENS_DISABLE_TYPE_CHECKS
#include <ensmallen.hpp>

This can be useful for situations where you know that the checks should be ignored. However, be aware that the code may fail to compile and give more confusing and difficult error messages!

Optimizer documentation

AdaDelta

An optimizer for differentiable separable functions.

AdaDelta is an extension of AdaGrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. Instead of accumulating all past squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients.

Constructors

  • AdaDelta()
  • AdaDelta(stepSize)
  • AdaDelta(stepSize, batchSize)
  • AdaDelta(stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle)
  • AdaDelta(stepSize, batchSize, rho, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 1.0
size_t batchSize Number of points to process in one step. 32
double rho Smoothing constant. Corresponding to fraction of gradient to keep at each time step. 0.95
double epsilon Value used to initialise the mean squared gradient parameter. 1e-6
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), Rho(), Epsilon(), MaxIterations(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples:

AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates);

See also:

Adagrad

An optimizer for differentiable separable functions.

AdaGrad is an optimizer with parameter-specific learning rates, which are adapted relative to how frequently a parameter gets updated during training. Larger updates for more sparse parameters and smaller updates for less sparse parameters.

Constructors

  • AdaGrad()
  • AdaGrad(stepSize)
  • AdaGrad(stepSize, batchSize)
  • AdaGrad(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle)
  • AdaGrad(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Number of points to process in one step. 32
double epsilon Value used to initialise the mean squared gradient parameter. 1e-8
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. tolerance
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples:

AdaGrad optimizer(1.0, 1, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates);

See also:

Adam

An optimizer for differentiable separable functions.

Adam is an an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.

Constructors

  • Adam()
  • Adam(stepSize, batchSize)
  • Adam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
  • Adam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Note that the Adam class is based on the AdamType<UpdateRule> class with UpdateRule = AdamUpdate.

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Adam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

AdaMax

An optimizer for differentiable separable functions.

AdaMax is simply a variant of Adam based on the infinity norm.

Constructors

  • AdaMax()
  • AdaMax(stepSize, batchSize)
  • AdaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
  • AdaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, exactObjective, resetPolicy)

Note that the AdaMax class is based on the AdamType<UpdateRule> class with UpdateRule = AdaMaxUpdate.

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), ExactObjective(), and ResetPolicy().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

AdaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

AMSGrad

An optimizer for differentiable separable functions.

AMSGrad is a variant of Adam with guaranteed convergence.

Constructors

  • AMSGrad()
  • AMSGrad(stepSize, batchSize)
  • AMSGrad(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
  • AMSGrad(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, exactObjective, resetPolicy)

Note that the AMSGrad class is based on the AdamType<UpdateRule> class with UpdateRule = AMSGradUpdate.

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), ExactObjective(), and ResetPolicy().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

AMSGrad optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

Augmented Lagrangian

An optimizer for differentiable constrained functions.

The AugLagrangian class implements the Augmented Lagrangian method of optimization. In this scheme, a penalty term is added to the Lagrangian. This method is also called the “method of multipliers”. Internally, the optimizer uses L-BFGS.

Constructors

  • AugLagrangian(maxIterations, penaltyThresholdFactor sigmaUpdateFactor)

Attributes

type name description default
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
double penaltyThresholdFactor When penalty threshold is updated, set it to this multiplied by the penalty. 10.0
double sigmaUpdateFactor When sigma is updated, multiply it by this. 0.25
L_BFGS& lbfgs Internal l-bfgs optimizer. L_BFGS()

The attributes of the optimizer may also be modified via the member methods MaxIterations(), PenaltyThresholdFactor(), SigmaUpdateFactor() and L_BFGS().

/**
 * Optimize the function.  The value '1' is used for the initial value of each
 * Lagrange multiplier.  To set the Lagrange multipliers yourself, use the
 * other overload of Optimize().
 *
 * @tparam LagrangianFunctionType Function which can be optimized by this
 *     class.
 * @param function The function to optimize.
 * @param coordinates Output matrix to store the optimized coordinates in.
 */
template<typename LagrangianFunctionType>
bool Optimize(LagrangianFunctionType& function,
              arma::mat& coordinates);

/**
 * Optimize the function, giving initial estimates for the Lagrange
 * multipliers.  The vector of Lagrange multipliers will be modified to
 * contain the Lagrange multipliers of the final solution (if one is found).
 *
 * @tparam LagrangianFunctionType Function which can be optimized by this
 *      class.
 * @param function The function to optimize.
 * @param coordinates Output matrix to store the optimized coordinates in.
 * @param initLambda Vector of initial Lagrange multipliers.  Should have
 *     length equal to the number of constraints.
 * @param initSigma Initial penalty parameter.
 */
template<typename LagrangianFunctionType>
bool Optimize(LagrangianFunctionType& function,
              arma::mat& coordinates,
              const arma::vec& initLambda,
              const double initSigma);

Examples

GockenbachFunction f;
arma::mat coordinates = f.GetInitialPoint();

AugLagrangian optimizer;
optimizer.Optimize(f, coords);

See also:

Big Batch SGD

An optimizer for differentiable separable functions.

Big-batch stochastic gradient descent adaptively grows the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation, so the Big Batch SGD optimizer is able to adaptively adjust batch sizes without user oversight.

Constructors

  • BigBatchSGD<UpdatePolicy>()
  • BigBatchSGD<UpdatePolicy>(stepSize)
  • BigBatchSGD<UpdatePolicy>(stepSize, batchSize)
  • BigBatchSGD<UpdatePolicy>(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, exactObjective)

The UpdatePolicy template parameter refers to the way that a new step size is computed. The AdaptiveStepsize and BacktrackingLineSearch classes are available for use; custom behavior can be achieved by implementing a class with the same method signatures.

For convenience the following typedefs have been defined:

  • BBS_Armijo = BigBatchSGD<BacktrackingLineSearch>
  • BBS_BB = BigBatchSGD<AdaptiveStepsize>

Attributes

type name description default
size_t batchSize Initial batch size. 1000
double stepSize Step size for each iteration. 0.01
double batchDelta Factor for the batch update step. 0.1
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be changed via the member methods BatchSize(), StepSize(), BatchDelta(), MaxIterations(), Tolerance(), Shuffle(), and ExactObjective().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Big-Batch SGD with the adaptive stepsize policy.
BBS_BB optimizer(batchSize, 0.01, 0.1, 8000, 1e-4);
optimizer.Optimize(f, coordinates);

// Big-Batch SGD with backtracking line search.
BBS_Armijo optimizer2(batchSize, 0.01, 0.1, 8000, 1e-4);
optimizer2.Optimize(f, coordinates);

See also:

CMAES

An optimizer for separable functions.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a stochastic search algorithm. CMA-ES is a second order approach estimating a positive definite matrix within an iterative procedure using the covariance matrix.

Constructors

  • CMAES<SelectionPolicyType>()
  • CMAES<SelectionPolicyType>(lambda, lowerBound, upperBound)
  • CMAES<SelectionPolicyType>(lambda, lowerBound, upperBound, batchSize)
  • CMAES<SelectionPolicyType>(lambda, lowerBound, upperBound, batchSize, maxIterations, tolerance, selectionPolicy)

The SelectionPolicyType template parameter refers to the strategy used to compute the (approximate) objective function. The FullSelection and RandomSelection classes are available for use; custom behavior can be achieved by implementing a class with the same method signatures.

For convenience the following types can be used:

  • CMAES<> (equivalent to CMAES<FullSelection>): uses all separable functions to compute objective
  • ApproxCMAES (equivalent to CMAES<RandomSelection>): uses a small amount of separable functions to compute approximate objective

Attributes

type name description default
size_t lambda The population size (0 uses a default size). 0
double lowerBound Lower bound of decision variables. -10.0
double upperBound Upper bound of decision variables. 10.0
size_t batchSize Batch size to use for the objective calculation. 32
size_t maxIterations Maximum number of iterations. 1000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
SelectionPolicyType selectionPolicy Instantiated selection policy used to calculate the objective. SelectionPolicyType()

Attributes of the optimizer may also be changed via the member methods Lambda(), LowerBound(), UpperBound(), BatchSize(), MaxIterations(), Tolerance(), and SelectionPolicy().

The selectionPolicy attribute allows an instantiated SelectionPolicyType to be given. The FullSelection policy has no need to be instantiated and thus the option is not relevant when the CMAES<> optimizer type is being used; the RandomSelection policy has the constructor RandomSelection(fraction) where fraction specifies the percentage of separable functions to use to estimate the objective function.

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// CMAES with the FullSelection policy.
CMAES<> optimizer(0, -1, 1, 32, 200, 0.1e-4);
optimizer.Optimize(f, coordinates);

// CMAES with the RandomSelection policy.
ApproxCMAES<> approxOptimizer(batchSize, 0.01, 0.1, 8000, 1e-4);
approxOptimizer.Optimize(f, coordinates);

See also:

CNE

An optimizer for arbitrary functions.

Conventional Neural Evolution is an optimizer that works like biological evolution which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population. The initial population is generated based on a random normal distribution centered at the given starting point.

Constructors

  • CNE()
  • CNE(populationSize, maxGenerations)
  • CNE(populationSize, maxGenerations, mutationProb, mutationSize)
  • CNE(populationSize, maxGenerations, mutationProb, mutationSize, selectPercent, tolerance)

Attributes

type name description default
size_t populationSize The number of candidates in the population. This should be at least 4 in size. 500
size_t maxGenerations The maximum number of generations allowed for CNE. 5000
double mutationProb Probability that a weight will get mutated. 0.1
double mutationSize The range of mutation noise to be added. This range is between 0 and mutationSize. 0.02
double selectPercent The percentage of candidates to select to become the the next generation. 0.2
double tolerance The final value of the objective function for termination. If set to negative value, tolerance is not considered. 1e-5

Attributes of the optimizer may also be changed via the member methods PopulationSize(), MaxGenerations(), MutationProb(), SelectPercent() and Tolerance().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

CNE optimizer(200, 10000, 0.2, 0.2, 0.3, 1e-5);
optimizer.Optimize(f, coordinates);

See also:

DE

An optimizer for arbitrary functions.

Differential Evolution is an evolutionary optimization algorithm which selects best candidates based on their fitness scores and creates new generation by mutation and crossover of population.

Constructors

  • DE()
  • DE(populationSize, maxGenerations)
  • DE(populationSize, maxGenerations, crossoverRate)
  • DE(populationSize, maxGenerations, crossoverRate, differentialWeight)
  • DE(populationSize, maxGenerations, crossoverRate, differentialWeight, tolerance)

Attributes

type name description default
size_t populationSize The number of candidates in the population. This should be at least 3 in size. 100
size_t maxGenerations The maximum number of generations allowed for DE. 2000
double crossoverRate Probability that a candidate will undergo crossover. 0.6
double differentialWeight Amplification factor for differentiation. 0.8
double tolerance The final value of the objective function for termination. If set to negative value, tolerance is not considered. 1e-5

Attributes of the optimizer may also be changed via the member methods PopulationSize(), MaxGenerations(), CrossoverRate(), DifferentialWeight() and Tolerance().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

DE optimizer(200, 1000, 0.6, 0.8, 1e-5);
optimizer.Optimize(f, coordinates);

See also:

Eve

An optimizer for differentiable separable functions.

Eve is a stochastic gradient based optimization method with locally and globally adaptive learning rates.

Constructors

  • Eve()
  • Eve(stepSize, batchSize)
  • Eve(stepSize, batchSize, beta1, beta2, beta3, epsilon, clip, maxIterations, tolerance, shuffle)

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double beta3 Exponential decay rate for relative change. 0.999
double epsilon Value used to initialize the mean squared gradient parameter. 1e-8
double clip Clipping range to avoid extreme valus. 10
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Beta3(), Epsilon(), Clip(), MaxIterations(), Tolerance(), Shuffle(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Eve optimizer(0.001, 32, 0.9, 0.999, 0.999, 10, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

Frank-Wolfe

An optimizer for differentiable functions that may also be constrained.

Frank-Wolfe is a technique to minimize a continuously differentiable convex function f over a compact convex subset D of a vector space. It is also known as conditional gradient method.

Constructors

  • FrankWolfe<LinearConstrSolverType, UpdateRuleType>(linearConstrSolver, updateRule)
  • FrankWolfe<LinearConstrSolverType, UpdateRuleType>(linearConstrSolver, updateRule, maxIterations, tolerance)

The LinearConstrSolverType template parameter specifies the constraint domain D for the problem. The ConstrLpBallSolver and ConstrStructGroupSolver<GroupLpBall> classes are available for use; the former restricts D to the unit ball of the specified l-p norm. Other constraint types may be implemented as a class with the same method signatures as either of the existing classes.

The UpdateRuleType template parameter specifies the update rule used by the optimizer. The UpdateClassic and UpdateLineSearch classes are available for use and represent a simple update step rule and a line search based update rule, respectively. The UpdateSpan and UpdateFulLCorrection classes are also available and may be used with the FuncSq function class (which is a squared matrix loss).

For convenience the following typedefs have been defined:

  • OMP (equivalent to FrankWolfe<ConstrLpBallSolver, UpdateSpan>): a solver for the orthogonal matching pursuit problem
  • StandardFrankWolfe (equivalent to FrankWolfe<ConstrLpBallSolver, ClassicUpdate>): the standard Frank-Wolfe algorithm with the solution restricted to lie within the unit ball

Attributes

type name description default
LinearConstrSolverType linearConstrSolver Solver for linear constrained problem. n/a
UpdateRuleType updateRule Rule for updating solution in each iteration. n/a
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
size_t tolerance Maximum absolute tolerance to terminate algorithm. 1e-10

Attributes of the optimizer may also be changed via the member methods LinearConstrSolver(), UpdateRule(), MaxIterations(), and Tolerance().

Examples:

TODO

See also:

FTML (Follow the Moving Leader)

An optimizer for differentiable separable functions.

Follow the Moving Leader (FTML) is an optimizer where recent samples are weighted more heavily in each iteration, so FTML can adapt more quickly to changes.

Constructors

  • FTML()
  • FTML(stepSize, batchSize)
  • FTML(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle)
  • FTML(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

FTML optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

Gradient Descent

An optimizer for differentiable functions.

Gradient Descent is a technique to minimize a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.

Constructors

  • GradientDescent()
  • GradientDescent(stepSize)
  • GradientDescent(stepSize, maxIterations, tolerance)

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
size_t tolerance Maximum absolute tolerance to terminate algorithm. 1e-5

Attributes of the optimizer may also be changed via the member methods StepSize(), MaxIterations(), and Tolerance().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

GradientDescent optimizer(0.001, 0, 1e-15);
optimizer.Optimize(f, coordinates);

See also:

An optimizer for categorical functions.

An optimizer that finds the minimum of a given function by iterating through points on a multidimensional grid.

Constructors

  • GridSearch()

Attributes

The GridSearch class has no configurable attributes.

Note: the GridSearch class can only optimize categorical functions where every parameter is categorical.

See also:

Hogwild! (Parallel SGD)

An optimizer for sparse differentiable separable functions.

An implementation of parallel stochastic gradient descent using the lock-free HOGWILD! approach. This implementation requires OpenMP to be enabled during compilation (i.e., -fopenmp specified as a compiler flag).

Note that the requirements for Hogwild! are slightly different than for most differentiable separable functions but it is often possible to use Hogwild! by implementing Gradient() with a template parameter. See the sparse differentiable separable functions documentation for more details.

Constructors

  • ParallelSGD<DecayPolicyType>(maxIterations, threadShareSize)
  • ParallelSGD<DecayPolicyType>(maxIterations, threadShareSize, tolerance, shuffle, decayPolicy)

The DecayPolicyType template parameter specifies the policy used to update the step size after each iteration. The ConstantStep class is available for use. Custom behavior can be achieved by implementing a class with the same method signatures.

The default type for DecayPolicyType is ConstantStep, so the shorter type ParallelSGD<> can be used instead of the equivalent ParallelSGD<ConstantStep>.

Attributes

type name description default
size_t maxIterations Maximum number of iterations allowed (0 means no limit). n/a
size_t threadShareSize Number of datapoints to be processed in one iteration by each thread. n/a
double tolerance Maximum absolute tolerance to terminate the algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
DecayPolicyType decayPolicy An instantiated step size update policy to use. DecayPolicyType()

Attributes of the optimizer may also be modified via the member methods MaxIterations(), ThreadShareSize(), Tolerance(), Shuffle(), and DecayPolicy().

Note that the default value for decayPolicy is the default constructor for the DecayPolicyType.

Examples

GeneralizedRosenbrockFunction f(50); // 50-dimensional Rosenbrock function.
arma::mat coordinates = f.GetInitialPoint();

ParallelSGD<> optimizer(100000, f.NumFunctions(), 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

IQN

An optimizer for differentiable separable functions.

The Incremental Quasi-Newton belongs to the family of stochastic and incremental methods that have a cost per iteration independent of n. IQN iterations are a stochastic version of BFGS iterations that use memory to reduce the variance of stochastic approximations.

Constructors

  • IQN()
  • IQN(stepSize)
  • IQN(stepSize, batchSize, maxIterations, tolerance)

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Size of each batch. 10
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
size_t tolerance Maximum absolute tolerance to terminate algorithm. 1e-5

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), MaxIterations(), and Tolerance().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

IQN optimizer(0.01, 1, 5000, 1e-5);
optimizer.Optimize(f, coordinates);

See also:

Katyusha

An optimizer for differentiable separable functions.

Katyusha is a direct, primal-only stochastic gradient method which uses a “negative momentum” on top of Nesterov’s momentum. Two types are available—one that uses a proximal update step, and one that uses the standard update step.

Constructors

  • KatyushaType<proximal>()
  • KatyushaType<proximal>(convexity, lipschitz)
  • KatyushaType<proximal>(convexity, lipschitz, batchSize)
  • KatyushaType<proximal>(convexity, lipschitz, batchSize, maxIterations, innerIterations, tolerance, shuffle, exactObjective)

The proximal template parameter is a boolean value (true or false) that specifies whether or not the proximal update should be used.

For convenience the following typedefs have been defined:

  • Katyusha (equivalent to KatyushaType<false>): Katyusha with the standard update step
  • KatyushaProximal (equivalent to KatyushaType<true>): Katyusha with the proximal update step

Attributes

type name description default
double convexity The regularization parameter. 1.0
double lipschitz The Lipschitz constant. 10.0
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
size_t innerIterations The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. 0
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be changed via the member methods Convexity(), Lipschitz(), BatchSize(), MaxIterations(), InnerIterations(), Tolerance(), Shuffle(), and ExactObjective().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Without proximal update.
Katyusha optimizer(1.0, 10.0, 1, 100, 0, 1e-10, true);
optimizer.Optimize(f, coordinates);

// With proximal update.
KatyushaProximal proximalOptimizer(1.0, 10.0, 1, 100, 0, 1e-10, true);
proximalOptimizer.Optimize(f, coordinates);

See also:

L-BFGS

An optimizer for differentiable functions

L-BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm using a limited amount of computer memory.

Constructors

  • L_BFGS()
  • L_BFGS(numBasis, maxIterations)
  • L_BFGS(numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials)
  • L_BFGS(numBasis, maxIterations, armijoConstant, wolfe, minGradientNorm, factr, maxLineSearchTrials, minStep, maxStep)

Attributes

type name description default
size_t numBasis Number of memory points to be stored (default 10). 10
size_t maxIterations Maximum number of iterations for the optimization (0 means no limit and may run indefinitely). 10000
double armijoConstant Controls the accuracy of the line search routine for determining the Armijo condition. 1e-4
double wolfe Parameter for detecting the Wolfe condition. 0.9
double minGradientNorm Minimum gradient norm required to continue the optimization. 1e-6
double factr Minimum relative function value decrease to continue the optimization. 1e-15
size_t maxLineSearchTrials The maximum number of trials for the line search (before giving up). 50
double minStep The minimum step of the line search. 1e-20
double maxStep The maximum step of the line search. 1e20

Attributes of the optimizer may also be changed via the member methods NumBasis(), MaxIterations(), ArmijoConstant(), Wolfe(), MinGradientNorm(), Factr(), MaxLineSearchTrials(), MinStep(), and MaxStep().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

L_BFGS optimizer(20);
optimizer.Optimize(f, coordinates);

See also:

LRSDP (low-rank SDP solver)

An optimizer for semidefinite programs.

LRSDP is the implementation of Monteiro and Burer’s formulation of low-rank semidefinite programs (LR-SDP). This solver uses the augmented Lagrangian optimizer to solve low-rank semidefinite programs.

The assumption here is that the solution matrix for the SDP is low-rank. If this assumption is not true, the algorithm should not be expected to converge.

Constructors

  • LRSDP<SDPType>()

The SDPType template parameter specifies the type of SDP to solve. The SDP<arma::mat> and SDP<arma::sp_mat> classes are available for use; these represent SDPs with dense and sparse C matrices, respectively. The SDP<> class is detailed in the semidefinite program documentation.

Once the LRSDP<> object is constructed, the SDP may be specified by calling the SDP() member method, which returns a reference to the SDPType.

Attributes

The attributes of the LRSDP optimizer may only be accessed via member methods.

type method name description default
size_t MaxIterations() Maximum number of iterations before termination. 1000
AugLagrangian AugLag() The internally-held Augmented Lagrangian optimizer. n/a

See also:

Momentum SGD

An optimizer for differentiable separable functions.

Stochastic Gradient Descent is a technique for minimizing a function which can be expressed as a sum of other functions. This is an SGD variant that uses momentum for its updates. Using momentum updates for parameter learning can accelerate the rate of convergence, specifically in the cases where the surface curves much more steeply (a steep hilly terrain with high curvature).

Constructors

  • MomentumSGD()
  • MomentumSGD(stepSize, batchSize)
  • MomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle)
  • MomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy, decayPolicy, resetPolicy, exactObjective)

Note that MomentumSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = MomentumUpdate and DecayPolicyType = NoDecay.

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
MomentumUpdate updatePolicy An instantiated MomentumUpdate. MomentumUpdate()
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), UpdatePolicy(), DecayPolicy(), ResetPolicy(), and ExactObjective().

Note that the MomentumUpdate class has the constructor MomentumUpdate(momentum) with a default value of 0.5 for the momentum.

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));
optimizer.Optimize(f, coordinates);

See also:

Nadam

An optimizer for differentiable separable functions.

Nadam is a variant of Adam based on NAG (Nesterov accelerated gradient). It uses Nesterov momentum for faster convergence.

Constructors

  • Nadam()
  • Nadam(stepSize, batchSize)
  • Nadam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
  • Nadam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the Nadam class is based on the AdamType<UpdateRule> class with UpdateRule = NadamUpdate.

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Nadam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

NadaMax

An optimizer for differentiable separable functions.

NadaMax is a variant of AdaMax based on NAG (Nesterov accelerated gradient). It uses Nesterov momentum for faster convergence.

Constructors

  • NadaMax()
  • NadaMax(stepSize, batchSize)
  • NadaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
  • NadaMax(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the NadaMax class is based on the AdamType<UpdateRule> class with UpdateRule = NadaMaxUpdate.

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

NadaMax optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

Nesterov Momentum SGD

An optimizer for differentiable separable functions.

Stochastic Gradient Descent is a technique for minimizing a function which can be expressed as a sum of other functions. This is an SGD variant that uses Nesterov momentum for its updates. Nesterov Momentum application can accelerate the rate of convergence to O(1/k^2).

Constructors

  • NesterovMomentumSGD()
  • NesterovMomentumSGD(stepSize, batchSize)
  • NesterovMomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle)
  • NesterovMomentumSGD(stepSize, batchSize, maxIterations, tolerance, shuffle, momentumPolicy, decayPolicy, resetPolicy, exactObjective)

Note that MomentumSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = NesterovMomentumUpdate and DecayPolicyType = NoDecay.

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
NesterovMomentumUpdate updatePolicy An instantiated MomentumUpdate. NesterovMomentumUpdate()
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), UpdatePolicy(), DecayPolicy(), ResetPolicy(), and ExactObjective().

Note that the NesterovMomentumUpdate class has the constructor MomentumUpdate(momentum) with a default value of 0.5 for the momentum.

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

NesterovMomentumSGD optimizer(0.01, 32, 100000, 1e-5, true,
    MomentumUpdate(0.5));
optimizer.Optimize(f, coordinates);

See also:

OptimisticAdam

An optimizer for differentiable separable functions.

OptimisticAdam is an optimizer which implements the Optimistic Adam algorithm which uses Optmistic Mirror Descent with the Adam Optimizer. It addresses the problem of limit cycling while training GANs (generative adversarial networks). It uses OMD to achieve faster regret rates in solving the zero sum game of training a GAN. It consistently achieves a smaller KL divergnce with~ respect to the true underlying data distribution. The implementation here can be used with any differentiable separable function, not just GAN training.

Constructors

  • OptimisticAdam()
  • OptimisticAdam(stepSize, batchSize)
  • OptimisticAdam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle)
  • OptimisticAdam(stepSize, batchSize, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy)

Note that the OptimisticAdam class is based on the AdamType<UpdateRule> class with UpdateRule = OptimisticAdamUpdate.

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), and ResetPolicy().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

OptimisticAdam optimizer(0.001, 32, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

Padam

An optimizer for differentiable separable functions.

Padam is a variant of Adam with a partially adaptive momentum estimation method.

Constructors

  • Padam()
  • Padam(stepSize, batchSize)
  • Padam(stepSize, batchSize, beta1, beta2, partial, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double partial Partially adaptive parameter. 0.25
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Partial(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

Padam optimizer(0.001, 32, 0.9, 0.999, 0.25, 1e-8, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

PSO

An optimizer for arbitrary functions.

PSO is an evolutionary approach to optimization that is inspired by flocks or birds or fishes. The fundamental analogy is that every creature (particle in a swarm) is at a measurable position of goodness or fitness, and this information can be shared amongst the creatures in the flock, so that iteratively, the entire flock can get close to the global optimum.

Constructors

  • PSOType<VelocityUpdatePolicy, InitPolicy>()
  • PSOType<VelocityUpdatePolicy, InitPolicy>(numParticles)
  • PSOType<VelocityUpdatePolicy, InitPolicy>(numParticles, lowerBound, upperBound)
  • PSOType<VelocityUpdatePolicy, InitPolicy>(numParticles, lowerBound, upperBound, maxIterations)
  • PSOType<VelocityUpdatePolicy, InitPolicy>(numParticles, lowerBound, upperBound, maxIterations, horizonSize)
  • PSOType<VelocityUpdatePolicy, InitPolicy>(numParticles, lowerBound, upperBound, maxIterations, horizonSize, impTolerance)
  • PSOType<VelocityUpdatePolicy, InitPolicy>(numParticles, lowerBound, upperBound, maxIterations, horizonSize, impTolerance, exploitationFactor, explorationFactor)

Attributes

type name description default
size_t numParticles numParticles Number of particles in the swarm. 64
double, arma::mat lowerBound Lower bound of the coordinates of the initial population. 1
double, arma::mat upperBound Upper bound of the coordinates of the initial population. 1
size_t maxIterations Maximum number of iterations allowed. 3000
size_t horizonSize Size of the lookback-horizon for computing improvement. 350
double impTolerance The final value of the objective function for termination. If set to negative value, tolerance is not considered. 1e-5
double exploitationFactor Influence of the personal best of the particle. 2.05
double explorationFactor Influence of the neighbours of the particle. 2.05

Note that the parameters lowerBound and upperBound are overloaded. Data types of double or arma::mat may be used. If they are initialized as single values of double, then the same value of the bound applies to all the axes, resulting in an initialization following a uniform distribution in a hypercube. If they are initialized as matrices of arma::mat, then the value of lowerBound[i] applies to axis [i]; similarly, for values in upperBound. This results in an initialization following a uniform distribution in a hyperrectangle within the specified bounds.

Attributes of the optimizer may also be changed via the member methods NumParticles(), LowerBound(), UpperBound(), MaxIterations(), HorizonSize(), ImpTolerance(),ExploitationFactor(), and ExplorationFactor().

At present, only the local-best variant of PSO is present in ensmallen. The optimizer may be initialized using the class type LBestPSO, which is an alias for PSOType<LBestUpdate, DefaultInit>.

Examples:

SphereFunction f(4);
arma::vec coordinates = f.GetInitialPoint();

LBestPSO s;
const double result = s.Optimize(f, coordinates)
RosenbrockFunction f;
arma::vec coordinates = f.GetInitialPoint();

// Setting bounds for the initial swarm population of size 2.
arma::vec lowerBound("50 50");
arma::vec upperBound("60 60");

LBestPSO s(200, lowerBound, upperBound, 3000, 600, 1e-30, 2.05, 2.05);
const double result = s.Optimize(f, coordinates)
RosenbrockFunction f;
arma::vec coordinates = f.GetInitialPoint();

// Setting bounds for the initial swarm population as type double.
double lowerBound = 50;
double upperBound = 60;

LBestPSO s(64, lowerBound, upperBound, 3000, 400, 1e-30, 2.05, 2.05);
const double result = s.Optimize(f, coordinates)

See also:

Primal-dual SDP Solver

An optimizer for semidefinite programs.

A primal-dual interior point method solver. This can solve semidefinite programs.

Constructors

  • PrimalDualSolver<>(maxIterations)
  • PrimalDualSolver<>(maxIterations, tau, normXzTol, primalInfeasTol, dualInfeasTol)

Attributes

The PrimalDualSolver<> class has several attributes that are only modifiable as member methods.

type method name description default
double Tau() Value of tau used to compute alpha_hat. 0.99
double NormXZTol() Tolerance for the norm of X*Z. 1e-7
double PrimalInfeasTol() Tolerance for primal infeasibility. 1e-7
double DualInfeasTol() Tolerance for dual infeasibility. 1e-7
size_t MaxIterations() Maximum number of iterations before convergence. 1000

Optimization

The PrimalDualSolver<> class offers two overloads of Optimize() that optionally return the converged values for the dual variables.

/**
 * Invoke the optimization procedure, returning the converged values for the
 * primal and dual variables.
 */
template<typename SDPType>
double Optimize(SDPType& s,
                arma::mat& X,
                arma::vec& ySparse,
                arma::vec& yDense,
                arma::mat& Z);

/**
 * Invoke the optimization procedure, and only return the primal variable.
 */
template<typename SDPType>
double Optimize(SDPType& s, arma::mat& X);

The SDPType template parameter specifies the type of SDP to solve. The SDP<arma::mat> and SDP<arma::sp_mat> classes are available for use; these represent SDPs with dense and sparse C matrices, respectively. The SDP<> class is detailed in the semidefinite program documentation. SDPType is automatically inferred when Optimize() is called with an SDP.

See also:

Quasi-Hyperbolic Momentum Update SGD (QHSGD)

An optimizer for differentiable separable functions.

Quasi-hyperbolic momentum update SGD (QHSGD) is an SGD-like optimizer with momentum where quasi-hyperbolic terms are added to the parametrization. The update rule for this optimizer is a weighted average of momentum SGD and vanilla SGD.

Constructors

  • QHSGD()
  • QHSGD(stepSize, batchSize)
  • QHSGD(stepSize, batchSize, maxIterations, tolerance, shuffle, exactObjective)

Note that QHSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = QHUpdate and DecayPolicyType = NoDecay.

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), and ExactObjective().

Note that the QHUpdate class has the constructor QHUpdate(v, momentum) with a default value of 0.7 for the quasi-hyperbolic term v and 0.999 for the momentum term.

Examples

 RosenbrockFunction f;
 arma::mat coordinates = f.GetInitialPoint();

 QHSGD optimizer(0.01, 32, 100000, 1e-5, true);
 optimizer.Optimize(f, coordinates);

See also:

QHAdam

An optimizer for differentiable separable functions.

QHAdam is an optimizer that uses quasi-hyperbolic descent with the Adam optimizer. This replaces the moment estimators of Adam with quasi-hyperbolic terms, and different values of the v1 and v2 parameters are equivalent to the following other optimizers:

  • When v1 = v2 = 1, QHAdam is equivalent to Adam.

  • When v1 = 0 and v2 = 1, QHAdam is equivalent to RMSProp.

  • When v1 = beta1 and v2 = 1, QHAdam is equivalent to Nadam.

Constructors

  • QHAdam()
  • QHAdam(stepSize, batchSize)
  • QHAdam(stepSize, batchSize, v1, v2, beta1, beta2, eps, maxIterations)
  • QHAdam(stepSize, batchSize, v1, v2, beta1, beta2, eps, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process in a single step. 32
double v1 The First Quasi Hyperbolic Term. 0.7
double v2 The Second Quasi Hyperbolic Term. 1.00
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double eps Value used to initialize the mean squared gradient parameter. 1e-8
size_t max_iterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

The attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Eps(), MaxIterations(), Tolerance(), Shuffle(), V1(), V2(), ResetPolicy(), and ExactObjective().

Examples

 RosenbrockFunction f;
 arma::mat coordinates = f.GetInitialPoint();

 QHAdam optimizer(0.001, 32, 0.7, 0.9, 0.9, 0.999, 1e-8, 100000, 1e-5, true);
 optimizer.Optimize(f, coordinates);

See also:

RMSProp

An optimizer for differentiable separable functions.

RMSProp utilizes the magnitude of recent gradients to normalize the gradients.

Constructors

  • RMSProp()
  • RMSProp(stepSize, batchSize)
  • RMSProp(stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle)
  • RMSProp(stepSize, batchSize, alpha, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Number of points to process in each step. 32
double alpha Smoothing constant, similar to that used in AdaDelta and momentum methods. 0.99
double epsilon Value used to initialise the mean squared gradient parameter. 1e-8
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm.  
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer can also be modified via the member methods StepSize(), BatchSize(), Alpha(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

RMSProp optimizer(1e-3, 1, 0.99, 1e-8, 5000000, 1e-9, true);
optimizer.Optimize(f, coordinates);

See also:

Simulated Annealing (SA)

An optimizer for arbitrary functions.

Simulated Annealing is an stochastic optimization algorithm which is able to deliver near-optimal results quickly without knowing the gradient of the function being optimized. It has a unique hill climbing capability that makes it less vulnerable to local minima. This implementation uses exponential cooling schedule and feedback move control by default, but the cooling schedule can be changed via a template parameter.

Constructors

  • SA<CoolingScheduleType>(coolingSchedule)
  • SA<CoolingScheduleType>(coolingSchedule, maxIterations)
  • SA<CoolingScheduleType>(coolingSchedule, maxIterations, initT, initMoves, moveCtrlSweep, tolerance, maxToleranceSweep, maxMoveCoef, initMoveCoef, gain)

The CoolingScheduleType template parameter implements a policy to update the temperature. The ExponentialSchedule class is available for use; it has a constructor ExponentialSchedule(lambda) where lambda is the cooling speed (default 0.001). Custom schedules may be created by implementing a class with at least the single member method below:

// Return the next temperature given the current system status.
double NextTemperature(const double currentTemperature,
                       const double currentEnergy);

For convenience, the default cooling schedule is ExponentialSchedule, so the shorter type SA<> may be used instead of the equivalent SA<ExponentialSchedule>.

Attributes

type name description default
CoolingScheduleType coolingSchedule Instantiated cooling schedule (default ExponentialSchedule). n/a
size_t maxIterations Maximum number of iterations allowed (0 indicates no limit). 1000000
double initT Initial temperature. 10000.0
size_t initMoves Number of initial iterations without changing temperature. 1000
size_t moveCtrlSweep Sweeps per feedback move control. 100
double tolerance Tolerance to consider system frozen. 1e-5
size_t maxToleranceSweep Maximum sweeps below tolerance to consider system frozen. 3
double maxMoveCoef Maximum move size. 20
double initMoveCoef Initial move size. 0.3
double gain Proportional control in feedback move control. 0.3

Attributes of the optimizer may also be changed via the member methods CoolingSchedule(), MaxIterations(), InitT(), InitMoves(), MoveCtrlSweep(), Tolerance(), MaxToleranceSweep(), MaxMoveCoef(), InitMoveCoef(), and Gain().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SA<> optimizer(ExponentialSchedule(), 1000000, 1000., 1000, 100, 1e-10, 3, 1.5,
    0.5, 0.3);
optimizer.Optimize(f, coordinates);

See also:

Simultaneous Perturbation Stochastic Approximation (SPSA)

An optimizer for arbitrary functions.

The SPSA algorithm approximates the gradient of the function by finite differences along stochastic directions.

Constructors

  • SPSA(alpha, gamma, stepSize, evaluationStepSize, maxIterations, tolerance)

Attributes

type name description default
double alpha Scaling exponent for the step size. 0.602
double gamma Scaling exponent for evaluation step size. 0.101
double stepSize Scaling parameter for step size (named as ‘a’ in the paper). 0.16
double evaluationStepSize Scaling parameter for evaluation step size (named as ‘c’ in the paper). 0.3
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5

Attributes of the optimizer may also be changed via the member methods Alpha(), Gamma(), StepSize(), EvaluationStepSize(), and MaxIterations().

Examples:

SphereFunction f(2);
arma::mat coordinates = f.GetInitialPoint();

SPSA optimizer(0.1, 0.102, 0.16, 0.3, 100000, 1e-5);
optimizer.Optimize(f, coordinates);

See also:

Stochastic Recursive Gradient Algorithm (SARAH/SARAH+)

An optimizer for differentiable separable functions.

StochAstic Recusive gRadient algoritHm (SARAH), is a variance reducing stochastic recursive gradient algorithm which employs the stochastic recursive gradient, for solving empirical loss minimization for the case of nonconvex losses.

Constructors

  • SARAHType<UpdatePolicyType>()
  • SARAHType<UpdatePolicyType>(stepSize, batchSize)
  • SARAHType<UpdatePolicyType>(stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, exactObjective)

The UpdatePolicyType template parameter specifies the update step used for the optimizer. The SARAHUpdate and SARAHPlusUpdate classes are available for use, and implement the standard SARAH update and SARAH+ update, respectively. A custom update rule can be used by implementing a class with the same method signatures.

For convenience the following typedefs have been defined:

  • SARAH (equivalent to SARAHType<SARAHUpdate>): the standard SARAH optimizer
  • SARAH_Plus (equivalent to SARAHType<SARAHPlusUpdate>): the SARAH+ optimizer

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
size_t innerIterations The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. 0
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be changed via the member methods StepSize(), BatchSize(), MaxIterations(), InnerIterations(), Tolerance(), Shuffle(), UpdatePolicy(), and ExactObjective().

Note that the default value for updatePolicy is the default constructor for the UpdatePolicyType.

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Standard stochastic variance reduced gradient.
SARAH optimizer(0.01, 1, 5000, 0, 1e-5, true);
optimizer.Optimize(f, coordinates);

// Stochastic variance reduced gradient with Barzilai-Borwein.
SARAH_Plus optimizerPlus(0.01, 1, 5000, 0, 1e-5, true);
optimizerPlus.Optimize(f, coordinates);

See also:

Standard SGD

An optimizer for differentiable separable functions.

Stochastic Gradient Descent is a technique for minimizing a function which can be expressed as a sum of other functions. It’s likely better to use any of the other variants of SGD than this class; however, this standard SGD implementation may still be useful in some situations.

Constructors

  • StandardSGD()
  • StandardSGD(stepSize, batchSize)
  • StandardSGD(stepSize, batchSize, maxIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy, exactObjective)

Note that StandardSGD is based on the templated type SGD<UpdatePolicyType, DecayPolicyType> with UpdatePolicyType = VanillaUpdate and DecayPolicyType = NoDecay.

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Batch size to use for each step. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the function order is shuffled; otherwise, each function is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), UpdatePolicy(), DecayPolicy(), ResetPolicy(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

StandardSGD optimizer(0.01, 32, 100000, 1e-5, true);
optimizer.Optimize(f, coordinates);

See also:

Stochastic Coordinate Descent (SCD)

An optimizer for partially differentiable functions.

Stochastic Coordinate descent is a technique for minimizing a function by doing a line search along a single direction at the current point in the iteration. The direction (or “coordinate”) can be chosen cyclically, randomly or in a greedy fashion.

Constructors

  • SCD<DescentPolicyType>()
  • SCD<DescentPolicyType>(stepSize, maxIterations)
  • SCD<DescentPolicyType>(stepSize, maxIterations, tolerance, updateInterval)
  • SCD<DescentPolicyType>(stepSize, maxIterations, tolerance, updateInterval, descentPolicy)

The DescentPolicyType template parameter specifies the behavior of SCD when selecting the next coordinate to descend with. The RandomDescent, GreedyDescent, and CyclicDescent classes are available for use. Custom behavior can be achieved by implementing a class with the same method signatures.

For convenience, the following typedefs have been defined:

  • RandomSCD (equivalent to SCD<RandomDescent>): selects coordinates randomly
  • GreedySCD (equivalent to SCD<GreedyDescent>): selects the coordinate with the maximum guaranteed descent according to the Gauss-Southwell rule
  • CyclicSCD (equivalent to SCD<CyclicDescent>): selects coordinates sequentially

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate the algorithm. 1e-5
size_t updateInterval The interval at which the objective is to be reported and checked for convergence. 1e3
DescentPolicyType descentPolicy The policy to use for selecting the coordinate to descend on. DescentPolicyType()

Attributes of the optimizer may also be modified via the member methods StepSize(), MaxIterations(), Tolerance(), UpdateInterval(), and DescentPolicy().

Note that the default value for descentPolicy is the default constructor for DescentPolicyType.

Examples

SparseTestFunction f;
arma::mat coordinates = f.GetInitialPoint();

RandomSCD randomscd(0.01, 100000, 1e-5, 1e3);
randomscd.Optimize(f, coordinates);

GreedySCD greedyscd(0.01, 100000, 1e-5, 1e3);
greedyscd.Optimize(f, coordinates);

CyclicSCD cyclicscd(0.01, 100000, 1e-5, 1e3);
cyclicscd.Optimize(f, coordinates);

See also:

Stochastic Gradient Descent with Restarts (SGDR)

An optimizer for differentiable separable functions.

SGDR is based on Mini-batch Stochastic Gradient Descent class and simulates a new warm-started run/restart once a number of epochs are performed.

Constructors

  • SGDR<UpdatePolicyType>()
  • SGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize)
  • SGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy)
  • SGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, updatePolicy, resetPolicy, exactObjective)

The UpdatePolicyType template parameter controls the update policy used during the iterative update process. The MomentumUpdate class is available for use, and custom behavior can be achieved by implementing a class with the same method signatures as MomentumUpdate.

For convenience, the default type of UpdatePolicyType is MomentumUpdate, so the shorter type SGDR<> can be used instead of the equivalent SGDR<MomentumUpdate>.

Attributes

type name description default
size_t epochRestart Initial epoch where decay is applied. 50
double multFactor Batch size multiplication factor. 2.0
size_t batchSize Size of each mini-batch. 1000
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer can also be modified via the member methods EpochRestart(), MultFactor(), BatchSize(), StepSize(), MaxIterations(), Tolerance(), Shuffle(), UpdatePolicy(), ResetPolicy(), and ExactObjective().

Note that the default value for updatePolicy is the default constructor for the UpdatePolicyType.

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3);
optimizer.Optimize(f, coordinates);

See also:

Snapshot Stochastic Gradient Descent with Restarts (SnapshotSGDR)

An optimizer for differentiable separable functions.

SnapshotSGDR simulates a new warm-started run/restart once a number of epochs are performed using the Snapshot Ensembles technique.

Constructors

  • SnapshotSGDR<UpdatePolicyType>()
  • SnapshotSGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize)
  • SnapshotSGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy)
  • SnapshotSGDR<UpdatePolicyType>(epochRestart, multFactor, batchSize, stepSize, maxIterations, tolerance, shuffle, snapshots, accumulate, updatePolicy, resetPolicy, exactObjective)

The UpdatePolicyType template parameter controls the update policy used during the iterative update process. The MomentumUpdate class is available for use, and custom behavior can be achieved by implementing a class with the same method signatures as MomentumUpdate.

For convenience, the default type of UpdatePolicyType is MomentumUpdate, so the shorter type SnapshotSGDR<> can be used instead of the equivalent SnapshotSGDR<MomentumUpdate>.

Attributes

type name description default
size_t epochRestart Initial epoch where decay is applied. 50
double multFactor Batch size multiplication factor. 2.0
size_t batchSize Size of each mini-batch. 1000
double stepSize Step size for each iteration. 0.01
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
size_t snapshots Maximum number of snapshots. 5
bool accumulate Accumulate the snapshot parameter. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer can also be modified via the member methods EpochRestart(), MultFactor(), BatchSize(), StepSize(), MaxIterations(), Tolerance(), Shuffle(), Snapshots(), Accumulate(), UpdatePolicy(), ResetPolicy(), and ExactObjective().

The Snapshots() function returns a std::vector<arma::mat>& (a vector of snapshots of the parameters), not a size_t representing the maximum number of snapshots.

Note that the default value for updatePolicy is the default constructor for the UpdatePolicyType.

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SnapshotSGDR<> optimizer(50, 2.0, 1, 0.01, 10000, 1e-3);
optimizer.Optimize(f, coordinates);

See also:

SMORMS3

An optimizer for differentiable separable functions.

SMORMS3 is a hybrid of RMSprop, which is trying to estimate a safe and optimal distance based on curvature or perhaps just normalizing the stepsize in the parameter space.

Constructors

  • SMORMS3()
  • SMORMS3(stepSize, batchSize)
  • SMORMS3(stepSize, batchSize, epsilon, maxIterations, tolerance)
  • SMORMS3(stepSize, batchSize, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process at each step. 32
double epsilon Value used to initialise the mean squared gradient parameter. 1e-16
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer can also be modified via the member methods StepSize(), BatchSize(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SMORMS3 optimizer(0.001, 1, 1e-16, 5000000, 1e-9, true);
optimizer.Optimize(f, coordinates);

See also:

Standard stochastic variance reduced gradient (SVRG)

An optimizer for differentiable separable functions.

Stochastic Variance Reduced Gradient is a technique for minimizing smooth and strongly convex problems.

Constructors

  • SVRGType<UpdatePolicyType, DecayPolicyType>()
  • SVRGType<UpdatePolicyType, DecayPolicyType>(stepSize)
  • SVRGType<UpdatePolicyType, DecayPolicyType>(stepSize, batchSize, maxIterations, innerIterations)
  • SVRGType<UpdatePolicyType, DecayPolicyType>(stepSize, batchSize, maxIterations, innerIterations, tolerance, shuffle, updatePolicy, decayPolicy, resetPolicy, exactObjective)

The UpdatePolicyType template parameter controls the update step used by SVRG during the optimization. The SVRGUpdate class is available for use and custom update behavior can be achieved by implementing a class with the same method signatures as SVRGUpdate.

The DecayPolicyType template parameter controls the decay policy used to adjust the step size during the optimization. The BarzilaiBorweinDecay and NoDecay classes are available for use. Custom decay functionality can be achieved by implementing a class with the same method signatures.

For convenience the following typedefs have been defined:

  • SVRG (equivalent to SVRGType<SVRGUpdate, NoDecay>): the standard SVRG technique
  • SVRG_BB (equivalent to SVRGType<SVRGUpdate, BarzilaiBorweinDecay>): SVRG with the Barzilai-Borwein decay policy

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Initial batch size. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 1000
size_t innerIterations The number of inner iterations allowed (0 means n / batchSize). Note that the full gradient is only calculated in the outer iteration. 0
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
UpdatePolicyType updatePolicy Instantiated update policy used to adjust the given parameters. UpdatePolicyType()
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), InnerIterations(), Tolerance(), Shuffle(), UpdatePolicy(), DecayPolicy(), ResetPolicy(), and ExactObjective().

Note that the default values for the updatePolicy and decayPolicy parameters are simply the default constructors of the UpdatePolicyType and DecayPolicyType classes.

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

// Standard stochastic variance reduced gradient.
SVRG optimizer(0.005, 1, 300, 0, 1e-10, true);
optimizer.Optimize(f, coordinates);

// Stochastic variance reduced gradient with Barzilai-Borwein.
SVRG_BB bbOptimizer(0.005, batchSize, 300, 0, 1e-10, true, SVRGUpdate(),
    BarzilaiBorweinDecay(0.1));
bbOptimizer.Optimize(f, coordinates);

See also:

SPALeRA Stochastic Gradient Descent (SPALeRASGD)

An optimizer for differentiable separable functions.

SPALeRA involves two components: a learning rate adaptation scheme, which ensures that the learning system goes as fast as it can; and a catastrophic event manager, which is in charge of detecting undesirable behaviors and getting the system back on track.

Constructors

  • SPALeRASGD<DecayPolicyType>()
  • SPALeRASGD<DecayPolicyType>(stepSize, batchSize)
  • SPALeRASGD<DecayPolicyType>(stepSize, batchSize, maxIterations, tolerance)
  • SPALeRASGD<DecayPolicyType>(stepSize, batchSize, maxIterations, tolerance, lambda, alpha, epsilon, adaptRate, shuffle, decayPolicy, resetPolicy, exactObjective)

The DecayPolicyType template parameter controls the decay in the step size during the course of the optimization. The NoDecay class is available for use; custom behavior can be achieved by implementing a class with the same method signatures.

By default, DecayPolicyType is set to NoDecay, so the shorter type SPALeRASGD<> can be used instead of the equivalent SPALeRASGD<NoDecay>.

Attributes

type name description default
double stepSize Step size for each iteration. 0.01
size_t batchSize Initial batch size. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
double lambda Page-Hinkley update parameter. 0.01
double alpha Memory parameter of the Agnostic Learning Rate adaptation. 0.001
double epsilon Numerical stability parameter. 1e-6
double adaptRate Agnostic learning rate update rate. 3.10e-8
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
DecayPolicyType decayPolicy Instantiated decay policy used to adjust the step size. DecayPolicyType()
bool resetPolicy Flag that determines whether update policy parameters are reset before every Optimize call. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Lambda(), Alpha(), Epsilon(), AdaptRate(), Shuffle(), DecayPolicy(), ResetPolicy(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SPALeRASGD<> optimizer(0.05, 1, 10000, 1e-4);
optimizer.Optimize(f, coordinates);

See also:

SWATS

An optimizer for differentiable separable functions.

SWATS is an optimizer that uses a simple strategy to switch from Adam to standard SGD when a triggering condition is satisfied. The condition relates to the projection of Adam steps on the gradient subspace.

Constructors

  • SWATS()
  • SWATS(stepSize, batchSize)
  • SWATS(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance)
  • SWATS(stepSize, batchSize, beta1, beta2, epsilon, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.001
size_t batchSize Number of points to process at each step. 32
double beta1 Exponential decay rate for the first moment estimates. 0.9
double beta2 Exponential decay rate for the weighted infinity norm estimates. 0.999
double epsilon Value used to initialise the mean squared gradient parameter. 1e-16
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the mini-batch order is shuffled; otherwise, each mini-batch is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer can also be modified via the member methods StepSize(), BatchSize(), Beta1(), Beta2(), Epsilon(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

SWATS optimizer(0.001, 1, 0.9, 0.999, 1e-16, 5000000, 1e-9, true);
optimizer.Optimize(f, coordinates);

See also:

WNGrad

An optimizer for differentiable separable functions.

WNGrad is a general nonlinear update rule for the learning rate. WNGrad has near-optimal convergence rates in both the batch and stochastic settings.

Constructors

  • WNGrad()
  • WNGrad(stepSize, batchSize)
  • WNGrad(stepSize, batchSize, maxIterations, tolerance, shuffle)
  • WNGrad(stepSize, batchSize, maxIterations, tolerance, shuffle, resetPolicy, exactObjective)

Attributes

type name description default
double stepSize Step size for each iteration. 0.562
size_t batchSize Initial batch size. 32
size_t maxIterations Maximum number of iterations allowed (0 means no limit). 100000
double tolerance Maximum absolute tolerance to terminate algorithm. 1e-5
bool shuffle If true, the batch order is shuffled; otherwise, each batch is visited in linear order. true
bool resetPolicy If true, parameters are reset before every Optimize call; otherwise, their values are retained. true
bool exactObjective Calculate the exact objective (Default: estimate the final objective obtained on the last pass over the data). false

Attributes of the optimizer may also be modified via the member methods StepSize(), BatchSize(), MaxIterations(), Tolerance(), Shuffle(), ResetPolicy(), and ExactObjective().

Examples

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

WNGrad<> optimizer(0.562, 1, 10000, 1e-4);
optimizer.Optimize(f, coordinates);

See also:

Callback documentation

Callbacks in ensmallen are methods that are called at various states during the optimization process, which can be used to implement and control behaviors such as:

  • Changing the learning rate.
  • Printing of the current objective.
  • Sending a message when the optimization hits a specific state such us a minimal objective.

Callbacks can be passed as an argument to the Optimize() function:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));

// Pass the built-in *PrintLoss* callback as the last argument to the
// *Optimize()* function.
optimizer.Optimize(f, coordinates, PrintLoss());

Passing multiple callbacks is just the same as passing a single callback:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));

// Pass the built-in *PrintLoss* and *EarlyStopAtMinLoss* callback as the last
// argument to the *Optimize()* function.
optimizer.Optimize(f, coordinates, PrintLoss(), EarlyStopAtMinLoss());

It is also possible to pass a callback instantiation that allows accessing of internal callback parameters at a later state:

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));

// Create an instantiation of the built-in *StoreBestCoordinates* callback,
// which will store the best objective and the corresponding model parameter
// that can be accessed later.
StoreBestCoordinates<> callback;

// Pass an instantiation of the built-in *StoreBestCoordinates* callback as the
// last argument to the *Optimize()* function.
optimizer.Optimize(f, coordinates, callback);

// Print the minimum objective that is stored inside the *StoreBestCoordinates*
// callback that was passed to the *Optimize()* call.
std::cout << callback.BestObjective() << std::endl;

Built-in Callbacks

EarlyStopAtMinLoss

Stops the optimization process if the loss stops decreasing or no improvement has been made.

Constructors

  • EarlyStopAtMinLoss()
  • EarlyStopAtMinLoss(patience)

Attributes

type name description default
size_t patience The number of epochs to wait after the minimum loss has been reached. 10

Examples:

AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates, EarlyStopAtMinLoss());

PrintLoss

Callback that prints loss to stdout or a specified output stream.

Constructors

  • PrintLoss()
  • PrintLoss(output)

Attributes

type name description default
std::ostream output Ostream which receives output from this object. stdout

Examples:

AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates, PrintLoss());

ProgressBar

Callback that prints a progress bar to stdout or a specified output stream.

Constructors

  • ProgressBar()
  • ProgressBar(width)
  • ProgressBar(width, output)

Attributes

type name description default
size_t width Width of the bar. 70
std::ostream output Ostream which receives output from this object. stdout

Examples:

AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();
optimizer.Optimize(f, coordinates, ProgressBar());

StoreBestCoordinates

Callback that stores the model parameter after every epoch if the objective decreased.

Constructors

  • StoreBestCoordinates<ModelMatType>()

The ModelMatType template parameter refers to the matrix type of the model parameter.

Attributes

The stored model parameter can be accessed via the member method BestCoordinates() and the best objective via BestObjective().

Examples:

AdaDelta optimizer(1.0, 1, 0.99, 1e-8, 1000, 1e-9, true);

RosenbrockFunction f;
arma::mat coordinates = f.GetInitialPoint();

StoreBestCoordinates<arma::mat> cb;
optimizer.Optimize(f, coordinates, cb);

std::cout << "The optimized model found by AdaDelta has the "
      << "parameters " << cb.BestCoordinatest();

Callback States

Callbacks are called at different states during the optimization process:

  • At the beginning and end of the optimization process.
  • After any call to Evaluate() and EvaluateConstraint.
  • After any call to Gradient() and GradientConstraint.
  • At the start and end of an epoch.

Each callback provides optimization relevant information that can be accessed or modified.

BeginOptimization

Called at the beginning of the optimization process.

  • BeginOptimization(optimizer, function, coordinates)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.

EndOptimization

Called at the end of the optimization process.

  • EndOptimization(optimizer, function, coordinates)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.

Evaluate

Called after any call to Evaluate().

  • Evaluate(optimizer, function, coordinates, objective)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.
double objective Objective value of the current point.

EvaluateConstraint

Called after any call to EvaluateConstraint().

  • EvaluateConstraint(optimizer, function, coordinates, constraint, constraintValue)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.
size_t constraint The index of the constraint.
double constraintValue Constraint value of the current point.

Gradient

Called after any call to Gradient().

  • Gradient(optimizer, function, coordinates, gradient)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.
GradType gradient Matrix that holds the gradient.

GradientConstraint

Called after any call to GradientConstraint().

  • GradientConstraint(optimizer, function, coordinates, constraint, gradient)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.
size_t constraint The index of the constraint.
GradType gradient Matrix that holds the gradient.

BeginEpoch

Called at the beginning of a pass over the data. The objective may be exact or an estimate depending on exactObjective value.

  • BeginEpoch(optimizer, function, coordinates, epoch, objective)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.
size_t epoch The index of the current epoch.
double objective Objective value of the current point.

EndEpoch

Called at the end of a pass over the data. The objective may be exact or an estimate depending on exactObjective value.

  • EndEpoch(optimizer, function, coordinates, epoch, objective)

Attributes

type name description
OptimizerType optimizer The optimizer used to update the function.
FunctionType function The function to be optimized.
MatType coordinates The current function parameter.
size_t epoch The index of the current epoch.
double objective Objective value of the current point.

Custom Callbacks

Learning rate scheduling

Setting the learning rate is crucially important when training because it controls both the speed of convergence and the ultimate performance of the model. One of the simplest learning rate strategies is to have a fixed learning rate throughout the training process. Choosing a small learning rate allows the optimizer to find good solutions, but this comes at the expense of limiting the initial speed of convergence. To overcome this tradeoff, changing the learning rate as more epochs have passed is commonly done in model training. The Evaluate method in combination with the StepSize method of the optimizer can be used to update the variables.

Example code showing how to implement a custom callback to change the learning rate is given below.

class ExponentialDecay
{
  // Set up the exponential decay learning rate scheduler with the user
  // specified decay value.
  ExponentialDecay(const double decay) : decay(decay), learningRate(0) { }


  // Callback function called at the start of the optimization process.
  // In this example we will use this to save the initial learning rate.
  template<typename OptimizerType, typename FunctionType, typename MatType>
  void BeginOptimization(OptimizerType& /* optimizer */,
                         FunctionType& /* function */,
                         MatType& /* coordinates */)
  {
    // Save the initial learning rate.
    learningRate = optimizer.StepSize();
  }

  // Callback function called at the end of a pass over the data. We are only
  // interested in the current epoch and the optimizer, we ignore the rest.
  template<typename OptimizerType, typename FunctionType, typename MatType>
  void EndEpoch(OptimizerType& optimizer,
                FunctionType& /* function */,
                const MatType& /* coordinates */,
                const size_t epoch,
                const double objective)
  {
    // Update the learning rate.
    optimizer.StepSize() = learningRate * (1.0 - std::pow(decay,
        (double) epoch));
  }

  double learningRate;
};

int main()
{
  // First, generate some random data, with 10000 points and 10 dimensions.
  // This data has no pattern and as such will make a model that's not very
  // useful---but the purpose here is just demonstration. :)
  //
  // For a more "real world" situation, load a dataset from file using X.load()
  // and y.load() (but make sure the matrix is column-major, so that each
  // observation/data point corresponds to a *column*, *not* a row.
  arma::mat data(10, 10000, arma::fill::randn);
  arma::rowvec responses(10000, arma::fill::randn);

  // Create a starting point for our optimization randomly.  The model has 10
  // parameters, so the shape is 10x1.
  arma::mat startingPoint(10, 1, arma::fill::randn);

  // Construct the objective function.
  LinearRegressionFunction lrf(data, responses);
  arma::mat lrfParams(startingPoint);

  // Create the StandardSGD optimizer with specified parameters.
  // The ens::StandardSGD type can be replaced with any ensmallen optimizer
  //that can handle differentiable functions.
  StandardSGD optimizer(0.001, 1, 0, 1e-15, true);

  // Use the StandardSGD optimizer with specified parameters to minimize the
  // LinearRegressionFunction and pass the *exponential decay*
  // callback function from above.
  optimizer.Optimize(lrf, lrfParams, ExponentialDecay(0.01));

  // Print the trained model parameter.
  std::cout << lrfParams.t();
}

Early stopping at minimum loss

Early stopping is a technique for controlling overfitting in machine learning models, especially neural networks, by stopping the optimization process before the model has trained for the maximum number of iterations.

Example code showing how to implement a custom callback to stop the optimization when the minimum of loss has been reached is given below.

#include <ensmallen.hpp>

// This class implements early stopping at minimum loss callback function to
// terminate the optimization process early if the loss stops decreasing.
class EarlyStop
{
 public:
  // Set up the early stop at min loss class, which keeps track of the minimum
  // loss.
  EarlyStop() : bestObjective(std::numeric_limits<double>::max()) { }

  // Callback function called at the end of a pass over the data, which provides
  // the current objective. We are only interested in the objective and ignore
  // the rest.
  template<typename OptimizerType, typename FunctionType, typename MatType>
  void EndEpoch(OptimizerType& /* optimizer */,
                FunctionType& /* function */,
                const MatType& /* coordinates */,
                const size_t /* epoch */,
                const double objective)
  {
    // Check if the given objective is lower as the previous objective.
    if (objective < bestObjective)
    {
      // Update the local objective.
      bestObjective = objective;
    }
    else
    {
      // Stop the optimization process.
      return true;
    }

    // Do not stop the optimization process.
    return false;
  }

  // Locally-stored best objective.
  double bestObjective;
};

int main()
{
  // First, generate some random data, with 10000 points and 10 dimensions.
  // This data has no pattern and as such will make a model that's not very
  // useful---but the purpose here is just demonstration. :)
  //
  // For a more "real world" situation, load a dataset from file using X.load()
  // and y.load() (but make sure the matrix is column-major, so that each
  // observation/data point corresponds to a *column*, *not* a row.
  arma::mat data(10, 10000, arma::fill::randn);
  arma::rowvec responses(10000, arma::fill::randn);

  // Create a starting point for our optimization randomly.  The model has 10
  // parameters, so the shape is 10x1.
  arma::mat startingPoint(10, 1, arma::fill::randn);

  // Construct the objective function.
  LinearRegressionFunction lrf(data, responses);
  arma::mat lrfParams(startingPoint);

  // Create the L_BFGS optimizer with default parameters.
  // The ens::L_BFGS type can be replaced with any ensmallen optimizer that can
  // handle differentiable functions.
  ens::L_BFGS lbfgs;

  // Use the L_BFGS optimizer with default parameters to minimize the
  // LinearRegressionFunction and pass the *early stopping at minimum loss*
  // callback function from above.
  lbfgs.Optimize(lrf, lrfParams, EarlyStop());

  // Print the trained model parameter.
  std::cout << lrfParams.t();
}

Note that we have simply passed an instantiation of EarlyStop the rest is handled inside the optimizer.

ensmallen provides a more complete and general implementation of a early stopping at minimum loss callback function.

History/changelog

ensmallen 2.10.2: “Fried Chicken”

2019-09-11
  • Add release script to rel/ for maintainers (#128).

  • Fix Armadillo version check (#133).

ensmallen 2.10.1: “Fried Chicken”

2019-09-10
  • Documentation fix for callbacks (#129.

  • Compatibility fixes for ensmallen 1.x (#131).

ensmallen 2.10.0: “Fried Chicken”

2019-09-07
  • All Optimize() functions now take any matrix type; so, e.g., arma::fmat or arma::sp_mat can be used for optimization. See the documentation for more details (#113, #119).

  • Introduce callback support. Callbacks can be appended as the last arguments of an Optimize() call, and can perform custom behavior at different points during the optimization. See the documentation for more details (#119).

  • Slight speedups for FrankWolfe optimizer (#127).

ensmallen 1.16.2: “Loud Alarm Clock”

2019-08-12
  • Fix PSO return type bug (#126).

ensmallen 1.16.1: “Loud Alarm Clock”

2019-08-11
  • Update HISTORY.md to use Markdown links to the PR and add release names.

  • Fix PSO return type bug (#124).

ensmallen 1.16.0: “Loud Alarm Clock”

2019-08-09
  • Add option to avoid computing exact objective at the end of the optimization (#109).

  • Fix handling of curvature for BigBatchSGD (#118).

  • Reduce runtime of tests (#118).

  • Introduce local-best particle swarm optimization, LBestPSO, for unconstrained optimization problems (#86).

ensmallen 1.15.1: “Wrong Side Of The Road”

2019-05-22
  • Fix -Wreorder in qhadam warning (#115).

  • Fix -Wunused-private-field warning in spsa (#115).

  • Add more warning output for gcc/clang (#116).

ensmallen 1.15.0: “Wrong Side Of The Road”

2019-05-14
  • Added QHAdam and QHSGD optimizers (#81).

ensmallen 1.14.4: “Difficult Crimp”

2019-05-12
  • Fixes for BigBatchSGD (#91).

ensmallen 1.14.3: “Difficult Crimp”

2019-05-06
  • Handle eig_sym() failures correctly (#100).

ensmallen 1.14.2: “Difficult Crimp”

2019-03-14
  • SPSA test tolerance fix (#97).

  • Minor documentation fixes (#95, #98).

  • Fix newlines at end of file (#92).

ensmallen 1.14.1: “Difficult Crimp”

2019-03-09
  • Fixes for SPSA (#87).

  • Optimized CNE and DE (#90). Changed initial population generation in CNE to be a normal distribution about the given starting point, which should accelerate convergence.

ensmallen 1.14.0: “Difficult Crimp”

2019-02-20
  • Add DE optimizer (#77).

  • Fix for Cholesky decomposition in CMAES (#83).

ensmallen 1.13.2: “Coronavirus Invasion”

2019-02-18
  • Minor documentation fixes (#82).

ensmallen 1.13.1: “Coronavirus Invasion”

2019-01-24
  • Fix -Wreorder warning (#75).

ensmallen 1.13.0: “Coronavirus Invasion”

2019-01-14
  • Enhance options for AugLagrangian optimizer (#66).

  • Add SPSA optimizer (#69).

ensmallen 1.12.2: “New Year’s Party”

2019-01-05
  • Fix list of contributors.

ensmallen 1.12.1: “New Year’s Party”

2019-01-03
  • Make sure all files end with newlines.

ensmallen 1.12.0: “New Year’s Party”

2018-12-30
  • Add link to ensmallen PDF to README.md.

  • Minor documentation fixes. Remove too-verbose documentation from source for each optimizer (#61).

  • Add FTML optimizer (#48).

  • Add SWATS optimizer (#42).

  • Add Padam optimizer (#46).

  • Add Eve optimizer (#45).

  • Add ResetPolicy() to SGD-like optimizers (#60).

ensmallen 1.11.1: “Jet Lag”

2018-11-29
  • Minor documentation fixes.

ensmallen 1.11.0: “Jet Lag”

2018-11-28
  • Add WNGrad optimizer.

  • Fix header name in documentation samples.

ensmallen 1.10.1: “Corporate Catabolism”

2018-11-16
  • Fixes for GridSearch optimizer.

  • Include documentation with release.

ensmallen 1.10.0: “Corporate Catabolism”

2018-10-20
  • Initial release.

  • Includes the ported optimization framework from mlpack (http://www.mlpack.org/).