Hamiltonian sampling

One of the problems I am interested is the power spectrum estimation of cosmological data in the sky. Hamiltonian can be very effective here in sampling the large dimensional (~10^6) posterior probability distributions. If the posterior distribution is unimodal this technique can be very effective. I have written several codes (in C,fortran and CUDA) which can be found here. CUDA version actually outperforms CPU version as long as as one can keep all the data inside the GPU memory.

Here are some papers I found very useful for understanding the fundamental concepts of Hamiltonian Monte Carlo (HMC)

  1. Neal, 2011, MCMC using Hamiltonian dynamics
  2. Beskos et al. 2010, The Acceptance Probability of the Hybrid Monte Carlo Method in High-Dimensional Problems
  3. Girolami & Calderhead, 2011, Riemann manifold Langevin and Hamiltonian Monte Carlo methods
  4. Hanson, 2001, Markov chain Monte Carlo posterior sampling with the Hamiltonian method
  5. Hoffman & Gelman, 2014, The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo

I started with a fortran code which can be found here, then moved to a c version called Ellipsis and finally created a CUDA version. For the CUDA version I get really big speed up. You can see the plots below.

Numerical optimization codes

I was looking for a code that can handle an optimization problem with a million dimension. What I know is that my function has a single peak. I can also get a fairly good starting point. There are correlations between parameters. Given these conditions, my question was what software is publicly available for solving the equations numerically. A huge list of open source minimization software can be found in Wikipedia. Eigen library provides a set of non-linear optimization methods. It is ported from MINPACK. Its starting point was the c version of MINPACK cminpack. Then there is lfbgs++ and HLFBGS. There are routines in Python and R to do this but my code was in C++. I came across a M1QN3 algorithm and was very impressed by its performance.

MINPACK

MINPACK is a library of FORTRAN subroutines for the solving of systems of nonlinear equations, or the least squares minimization of the residual of a set of linear or nonlinear equations. One of the famous subroutines in this code is lmder.

subroutine lmder(fcn,m,n,x,fvec,fjac,ldfjac,ftol,xtol,gtol,
     *                 maxfev,diag,mode,factor,nprint,info,nfev,njev,
     *                 ipvt,qtf,wa1,wa2,wa3,wa4)

A C/C++ rewrite of the MINPACK is available in GitHub. This portable and easy to use. Here is an example from cminpack. Note the definition of __minpack_real__ and the use macro __minpack_func__()

/*      driver for lmder example. */


#include <stdio.h>
#include <math.h>
#include <assert.h>
#include <minpack.h>
#define real __minpack_real__

void fcn(const int *m, const int *n, const real *x, real *fvec, real *fjac,
     const int *ldfjac, int *iflag);

int main()
{
  int i, j, m, n, ldfjac, maxfev, mode, nprint, info, nfev, njev;
  int ipvt[3];
  real ftol, xtol, gtol, factor, fnorm;
  real x[3], fvec[15], fjac[15*3], diag[3], qtf[3],
    wa1[3], wa2[3], wa3[3], wa4[15];
  int one=1;
  real covfac;

  m = 15;
  n = 3;

/*      the following starting values provide a rough fit. */

  x[1-1] = 1.;
  x[2-1] = 1.;
  x[3-1] = 1.;

  ldfjac = 15;

  /*      set ftol and xtol to the square root of the machine */
  /*      and gtol to zero. unless high solutions are */
  /*      required, these are the recommended settings. */

  ftol = sqrt(__minpack_func__(dpmpar)(&one));
  xtol = sqrt(__minpack_func__(dpmpar)(&one));
  gtol = 0.;

  maxfev = 400;
  mode = 1;
  factor = 1.e2;
  nprint = 0;

  __minpack_func__(lmder)(&fcn, &m, &n, x, fvec, fjac, &ldfjac, &ftol, &xtol, &gtol,
    &maxfev, diag, &mode, &factor, &nprint, &info, &nfev, &njev,
    ipvt, qtf, wa1, wa2, wa3, wa4);
  fnorm = __minpack_func__(enorm)(&m, fvec);
  printf("      final l2 norm of the residuals%15.7g\n\n", (double)fnorm);
  printf("      number of function evaluations%10i\n\n", nfev);
  printf("      number of Jacobian evaluations%10i\n\n", njev);
  printf("      exit parameter                %10i\n\n", info);
  printf("      final approximate solution\n");
  for (j=1; j<=n; j++) {
    printf("%s%15.7g", j%3==1?"\n     ":"", (double)x[j-1]);
  }
  printf("\n");
  ftol = __minpack_func__(dpmpar)(&one);
  covfac = fnorm*fnorm/(m-n);
  __minpack_func__(covar)(&n, fjac, &ldfjac, ipvt, &ftol, wa1);
  printf("      covariance\n");
  for (i=1; i<=n; i++) {
    for (j=1; j<=n; j++) {
      printf("%s%15.7g", j%3==1?"\n     ":"", (double)fjac[(i-1)*ldfjac+j-1]*covfac);
    }
  }
  printf("\n");
  return 0;
}

void fcn(const int *m, const int *n, const real *x, real *fvec, real *fjac,
     const int *ldfjac, int *iflag)
{

  /*      subroutine fcn for lmder example. */

  int i;
  real tmp1, tmp2, tmp3, tmp4;
  real y[15]={1.4e-1, 1.8e-1, 2.2e-1, 2.5e-1, 2.9e-1, 3.2e-1, 3.5e-1,
        3.9e-1, 3.7e-1, 5.8e-1, 7.3e-1, 9.6e-1, 1.34, 2.1, 4.39};
  assert(*m == 15 && *n == 3);

  if (*iflag == 0) {
    /*      insert print statements here when nprint is positive. */
    /* if the nprint parameter to lmder is positive, the function is
       called every nprint iterations with iflag=0, so that the
       function may perform special operations, such as printing
       residuals. */
    return;
  }

  if (*iflag != 2) {
    /* compute residuals */
    for (i=1; i <= 15; i++)
    {
      tmp1 = i;
      tmp2 = 16 - i;
      tmp3 = tmp1;
      if (i > 8) {
        tmp3 = tmp2;
      }
      fvec[i-1] = y[i-1] - (x[1-1] + tmp1/(x[2-1]*tmp2 + x[3-1]*tmp3));
    }
  } else {
    /* compute Jacobian */
    for (i=1; i<=15; i++) {
      tmp1 = i;
      tmp2 = 16 - i;
      tmp3 = tmp1;
      if (i > 8) {
        tmp3 = tmp2;
      }
      tmp4 = (x[2-1]*tmp2 + x[3-1]*tmp3); tmp4 = tmp4*tmp4;
      fjac[i-1 + *ldfjac*(1-1)] = -1.;
      fjac[i-1 + *ldfjac*(2-1)] = tmp1*tmp2/tmp4;
      fjac[i-1 + *ldfjac*(3-1)] = tmp1*tmp3/tmp4;
    };
  }
  return;
}

Eigen library

Eigen has its own port of MINPACK. However, it is in the unsupported modules and the documentation can be found here. Here is a snippet from of the examples provided by Eigen.

struct lmder_functor : Functor<double>
{
    lmder_functor(void): Functor<double>(3,15) {}
    int operator()(const VectorXd &x, VectorXd &fvec) const
    {
        double tmp1, tmp2, tmp3;
        static const double y[15] = {1.4e-1, 1.8e-1, 2.2e-1, 2.5e-1, 2.9e-1, 3.2e-1, 3.5e-1,
            3.9e-1, 3.7e-1, 5.8e-1, 7.3e-1, 9.6e-1, 1.34, 2.1, 4.39};

        for (int i = 0; i < values(); i++)
        {
            tmp1 = i+1;
            tmp2 = 16 - i - 1;
            tmp3 = (i>=8)? tmp2 : tmp1;
            fvec[i] = y[i] - (x[0] + tmp1/(x[1]*tmp2 + x[2]*tmp3));
        }
        return 0;
    }

    int df(const VectorXd &x, MatrixXd &fjac) const
    {
        double tmp1, tmp2, tmp3, tmp4;
        for (int i = 0; i < values(); i++)
        {
            tmp1 = i+1;
            tmp2 = 16 - i - 1;
            tmp3 = (i>=8)? tmp2 : tmp1;
            tmp4 = (x[1]*tmp2 + x[2]*tmp3); tmp4 = tmp4*tmp4;
            fjac(i,0) = -1;
            fjac(i,1) = tmp1*tmp2/tmp4;
            fjac(i,2) = tmp1*tmp3/tmp4;
        }
        return 0;
    }
};

void testLmder1()
{
  int n=3, info;

  VectorXd x;

  /* the following starting values provide a rough fit. */
  x.setConstant(n, 1.);

  // do the computation
  lmder_functor functor;
  LevenbergMarquardt<lmder_functor> lm(functor);
  info = lm.lmder1(x);

  // check return value
  VERIFY_IS_EQUAL(info, 1);
  VERIFY_IS_EQUAL(lm.nfev, 6);
  VERIFY_IS_EQUAL(lm.njev, 5);

  // check norm
  VERIFY_IS_APPROX(lm.fvec.blueNorm(), 0.09063596);

  // check x
  VectorXd x_ref(n);
  x_ref << 0.08241058, 1.133037, 2.343695;
  VERIFY_IS_APPROX(x, x_ref);
}

HLBFGS

HLBFGS is a hybrid L-BFGS(Limited Memory Broyden–Fletcher–Goldfarb–Shanno Method) optimization framework which unifies L-BFGS method, Preconditioned L-BFGS method and Preconditioned Conjugate Gradient method. With different options, HLBFGS also works like gradient-decent method, Newton method and Conjugate-gradient method. The original purpose I wrote this routine is for fast-computing large-scaled Centroidal Voronoi diagram. This is one of my favorite libraries and offers a big set of algorithms, for example, Gradient Decent, Conjugate Gradient, L-BFGS and M1QN3. I am making use of this library in my code and find it very stable. Here is an example that shows the easiness of the API

#ifdef _DEBUG
#define _CRTDBG_MAP_ALLOC
#include <stdlib.h>
#include <crtdbg.h>
#endif

#include "HLBFGS.h"

#include "Lite_Sparse_Matrix.h"

#include <iostream>

Lite_Sparse_Matrix* m_sparse_matrix = 0;

//////////////////////////////////////////////////////////////////////////
void evalfunc(int N, double* x, double *prev_x, double* f, double* g)
{
    *f = 0;
    for (int i = 0; i < N; i+=2)
    {
        double T1 = 1 - x[i];
        double T2 = 10*(x[i+1]-x[i]*x[i]);
        *f += T1*T1+T2*T2;
        g[i+1]   = 20*T2;
        g[i] = -2*(x[i]*g[i+1]+T1);
    }
}
//////////////////////////////////////////////////////////////////////////
void newiteration(int iter, int call_iter, double *x, double* f, double *g,  double* gnorm)
{
    std::cout << iter <<": " << call_iter <<" " << *f <<" " << *gnorm  << std::endl;
}
//////////////////////////////////////////////////////////////////////////
void evalfunc_h(int N, double *x, double *prev_x, double *f, double *g, HESSIAN_MATRIX& hessian)
{
    //the following code is not optimal if the pattern of hessian matrix is fixed.
    if (m_sparse_matrix)
    {
        delete m_sparse_matrix;
    }

    m_sparse_matrix = new Lite_Sparse_Matrix(N, N, SYM_LOWER, CCS, FORTRAN_TYPE, true);

    m_sparse_matrix->begin_fill_entry();

    static bool first = true;
    double *diag = m_sparse_matrix->get_diag();

    if (first)
    {
        // you need to update f and g
        *f = 0;
        double tmp;
        for (int i = 0; i < N; i+=2)
        {
            tmp = x[i]*x[i];
            double T1 = 1 - x[i];
            double T2 = 10*(x[i+1]-tmp);
            *f += T1*T1+T2*T2;
            g[i+1]   = 20*T2;
            g[i] = -2*(x[i]*g[i+1]+T1);
            diag[i] = 2+1200*tmp-400*x[i+1];
            diag[i+1] = 200;
            m_sparse_matrix->fill_entry(i, i+1, -400*x[i]);
        }
    }
    else
    {
        for (int i = 0; i < N; i+=2)
        {
            diag[i] = 2+1200*x[i]*x[i]-400*x[i+1];
            diag[i+1] = 200;
            m_sparse_matrix->fill_entry(i, i+1, -400*x[i]);
        }
    }

    m_sparse_matrix->end_fill_entry();

    hessian.set_diag(m_sparse_matrix->get_diag());
    hessian.set_values(m_sparse_matrix->get_values());
    hessian.set_rowind(m_sparse_matrix->get_rowind());
    hessian.set_colptr(m_sparse_matrix->get_colptr());
    hessian.set_nonzeros(m_sparse_matrix->get_nonzero());
    first = false;
}
//////////////////////////////////////////////////////////////////////////
void Optimize_by_HLBFGS(int N, double *init_x, int num_iter, int M, int T, bool with_hessian)
{
    double parameter[20];
    int info[20];
    //initialize
    INIT_HLBFGS(parameter, info);
    info[4] = num_iter;
    info[6] = T;
    info[7] = with_hessian?1:0;
    info[10] = 0;
    info[11] = 1;

    if (with_hessian)
    {
        HLBFGS(N, M, init_x, evalfunc, evalfunc_h, HLBFGS_UPDATE_Hessian, newiteration, parameter, info);
    }
    else
    {
        HLBFGS(N, M, init_x, evalfunc, 0, HLBFGS_UPDATE_Hessian, newiteration, parameter, info);
    }

}
//////////////////////////////////////////////////////////////////////////
void main()
{
#ifdef _DEBUG
    _CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
#endif
    std::cout.precision(16);
    std::cout << std::scientific;

    int N = 1000;
    std::vector<double> x(N);

    for (int i = 0; i < N/2; i++)
    {
        x[2*i]   = -1.2;
        x[2*i+1] =  1.0;
    }

    int M = 7;
    int T = 0;

    //use Hessian
    // if M = 0, T = 0, it is Newton
    //Optimize_by_HLBFGS(N, &x[0], 1000, M, T, true);

    //without Hessian
    Optimize_by_HLBFGS(N, &x[0], 1000, M, T, false);  // it is LBFGS(M) actually, T is not used

    if (m_sparse_matrix)
        delete m_sparse_matrix;
}

M1QN3

While using HLFBGS I noticed the excellent performance of MIQN3 algorithm. So started looking further and came across the original fortran version. It is a part of MODULOPT, a library for solving optimization problems and testing optimization software. The API in fortran for M1QN3 is shown below

subroutine m1qn3 (simul, prosca, ctonb, ctcab, n, x, f, g, &
    dxmin, df1, epsg, normtype, impres, io, &
    imode, omode, niter, nsim, iz, dz, ndz, &
    reverse, indic, izs, rzs, dzs)

Here is what they say about this software in the website

The routine M1QN3 has been designed to minimize functions depending on a very large number of variables (several hundred million is sometimes possible) no subject to constraints. It implements a limited memory quasi-Newton technique (the L-BFGS method of J. Nocedal) with a dynamically updated scalar or diagonal preconditioner. It uses line-search to enforce global convergence; more precisely, the step-size is determined by the Fletcher-Lemaréchal algorithm and realizes the Wolfe conditions.

I find it true. I have tried it with a problem of 1e6 dimensions and it worked. My problem was unimodal and with minor correlations between parameters. Analytical gradients were available as well.

Spline libraries for scientific computation

Splines are extremely useful to represent data when the underlying model is unknown or expensive to compute. I use them as a part of my cosmological calculations. In this post I explore the publicly available codes for computing splines. Here is an example of a noisy sine curve in [0,4pi] and smoothed spline curve obtained with GSL B-splines.

sine spline

Gnu Scientific Library (GSL)

GSL offer B-splines as a part of the library. In c/c++ we can use the functionality by

#include <gsl/gsl_bspline.h>

An example program can be found here

Eigen library

Eigen is known for its excellence in linear algebra. It also offers a large set of unsupported modules. Splines is one of them. Below you can find a simple code I found in KDE forum

#include <unsupported/Eigen/Splines>

typedef Spline<double,2> Spline2d;
typedef Spline2d::PointType PointType;
typedef Spline2d::KnotVectorType KnotVectorType;
typedef Spline2d::ControlPointVectorType ControlPointVectorType;

ControlPointVectorType points = ControlPointVectorType::Random(2,100);
const Spline2d spline = SplineFitting<Spline2d>::Interpolate(points,3);

KnotVectorType chord_lengths; // knot parameters
Eigen::ChordLengths(points, chord_lengths);

for (Eigen::DenseIndex i=0; i<points.cols(); ++i)
{
    PointType pt = spline( chord_lengths(i) );
    PointType ref = points.col(i);
    VERIFY( (pt - ref).matrix().norm() < 1e-14 );
}

Einspline library

Einspline library offers excellent spline fitting facilities. There are several bases available, for example, linear, logarithmic and general. The code is written in c and runs fast. We can get the 1-D splines by including

#include <nubspline.h>

Benchmarking splines

I needed to see how each of these libraries compare to each other and decided to code up a benchmarking repository in github. It can be found here

Open source numerical integration codes

Numerical integration is one of the widely used computational techniques employed in scientific computation. These method can be traced back to mathematicians of ancient Greece. Availability of stable integration routine depends on what programming language you use. For example in Python there is scipy.integrate and it provides a stable code for many differentness computational problems. In R there is the integrate function for one-diemnstional integral really well. Since I work mainly in c, c++ and fortran I was looking for a publicly available code that can be used with my code. I found that QUADPACK and Quadpack++ is the only stable codes publicly available (correct me if I am wrong). There are numerical integration methods available as a part of the QuantLib and AlgLib. Cuba library offers mutli-dimenstional integration routines.

QUADPACK

QUADPACK has been around for a while. Written in fortran it remains as one of most stable and widely used codes. It contains several routines for integrating functions of different characteristics. For example the routine QAG provides a 1D globally adaptive integrator using Gauss-Kronrod quadrature.

subroutine qag(f,a,b,epsabs,epsrel,key,result,abserr,neval,ier,limit,lenw,last,iwork,work)

GSL

GSL has numerical integration module. It provides most of algorithms in QUADPACK. API is straightforward. The example below is from GSL

#include <stdio.h>
#include <math.h>
#include <gsl/gsl_integration.h>

double f (double x, void * params)
{
    double alpha = *(double *) params;
    double f = log(alpha*x) / sqrt(x);
    return f;
}

int main (void)
{
    gsl_integration_workspace * w
        = gsl_integration_workspace_alloc (1000);

    double result, error;
    double expected = -4.0;
    double alpha = 1.0;

    gsl_function F;
    F.function = &f;
    F.params = &alpha;

    gsl_integration_qags (&F, 0, 1, 0, 1e-7, 1000,
        w, &result, &error);

    printf ("result          = % .18f\n", result);
    printf ("exact result    = % .18f\n", expected);
    printf ("estimated error = % .18f\n", error);
    printf ("actual error    = % .18f\n", result - expected);
    printf ("intervals =  %d\n", w->size);

    gsl_integration_workspace_free (w);

    return 0;
}

quadpack++

quadpack++ implements adaptive quadrature routines from the GNU Scientific Library for an arbitrary floating-point type using C++ templates. The only requirement is that standard arithmetic operators, including mixed arithmetic with standard types such as “int”, be overloaded for the floating-point type. The core routines are implemented in header files, requiring nothing to be built or linked against. The library is intended for use with multiple-precision calculations. One package supporting this is MPFR, whose website has links to wrappers that define a multiple-precision floating-point type as a C++ class. This last point is extremely significant as both QUADPACK and GSL read the weights from a table and can only do certain predefined rules. quadpack++ implements codes to compute the weights for any arbitrary rule! Below is an example of what it can do

#include <cmath>
#include <cstdlib>
#include <iostream>
#include "workspace.hpp"

typedef double Real;

#define ABS(x) (((x) < Real(0)) ? -(x) : (x))

Real integrand(Real x, Real* alpha)
{
    return pow(x, *alpha) * log(1/x);
}

Real exact(Real alpha)
{
    return pow(alpha + 1, -2);
}

int main(int argc, char** argv)
{
    size_t limit = 128;
    size_t m_deg = 10; // sets (2m+1)-point Gauss-Kronrod

    Workspace<Real> Work(limit, m_deg);

    // Set parameters and function class...
    Real alpha = (argc > 1) ? Real(atof(argv[1])) : Real(1);

    std::cout << "# alpha : " << alpha << std::endl;

    Function<Real, Real> F(integrand, &alpha);

    // Set default quadrature tolerance from machine epsilon...
    Real epsabs =  Work.get_eps() * 1e2;
    Real epsrel = Real(0);
    std::cout << "# epsabs : " << epsabs << std::endl;

    int status = 0;
    Real result, abserr;

    try {
        status = Work.qag(F, Real(0), Real(1), epsabs, epsrel, result, abserr);
    }
    catch (const char* reason) {
        std::cerr << reason << std::endl;
        return status;
    }

    std::cout << "result: " << result;
    std::cout << " error: " << ABS(result - exact(alpha));
    std::cout << std::endl;

    return 0;
}
Open source spherical harmonics codes

I deal with data on the (shperical) sky every day and one of the easiest way of computing statistics of the data is to reduce them into a set of power spectra. This technique relies on the availability of fast accurate Spherical Harmonic Transforms (SHT). Their applications range from the analysis of geographical and atmospherical data to cosmological surveys. Here is a small list of SHT codes that I tested as a part of my data analysis pipeline.

LibSHARP

LibSHARP is written in c99 with OpenMP and SSE2/AVX. There is also MPI support. It is used as a backend for the widely used HEALPix library. More information can be found in ArXiv. It supports many different grid-geometries such as HEALPix, Gauss-Legendre, Fejer and Clenshaw-Curtis. Both scalar and vector transforms are supported. Note that this library is the successor to the libpsht library.

Most of the functionality can be loaded by three header files as shown below.

#include <sharp_lowlevel.h>
#include <sharp_almhelpers.h>
#include <sharp_geomhelpers.h>

The helpers are useful in creating grids on the sky. For example a Gauss-Legendre grid can be created using the helper function.

void sharp_make_gauss_geom_info (int nrings, int nphi, double phi0,
  int stride_lon, int stride_lat, sharp_geom_info **geom_info);

SHTns

SHTns is a high performance library for Spherical Harmonic Transform written in c. The spatial data can be stored in latitude-major or longitude-major order arrays. More details and benchmarks can be found in ArXiv. The code is parallelized by OpenMP. The main dependency is FFTW. API can be load by

#include <shtns.h>

The library interface is extremely light, for example, the synthesis can be called by

void SH_to_spat (shtns_cfg shtns,cplx * Qlm,double * Vr );

SHTOOLS

SHTOOLS is an archive of Fortran 95 and Python software that can be used to perform spherical harmonic transforms and reconstructions, rotations of data expressed in spherical harmonics, and multitaper spectral analyses on the sphere. There is a good amount of documentation here. API is in fortran is straightforward, for example a synthesis can be performed by

call MakeGridGLQ (gridglq, cilm, lmax, plx, zero, norm, csphase, lmax_calc)

The above code will create a 2D map from a set of spherical harmonic coefficients sampled on the Gauss-Legendre quadrature nodes.

S2HAT

S2HAT is derived from HEALPIx package but since v2.5 it is an independent library. The main features include memory scalability, linear speed-up, memory and load-balance optimization and applicability to any iso-latitude pixelization. Both scalar and vector transforms are supported. What is fascinating about this code is that it has a CUDA version. An example CUDA synthesis API is shown below

void s2hat_alm2map_cuda (
    int plms, // no use
    s2hat_pixeltype pixelization,
    s2hat_scandef scan,
    int nlmax,
    int nmmax,
    int nmvals,
    int* mvals,
    int nmaps, // no use
    int nstokes, // no use
    int first_ring,
    int last_ring,
    int map_size,
    double *local_map,
    int lda,
    s2hat_dcomplex *local_alm,
    int nplm, // no use
    double* local_plm, // no use
    int numprocs,
    int myid,
    MPI_Comm comm);