Code coverage gives us an idea about what parts of the code is covered by the unit tests.

gcov

There are several open source tools available for analysing coverage. In this post, however, I will be using gcov.

Let us consider a simple example of a calculator program. My calculator class is shown below

template<class scalarType>
class calculator
{
public:
    static scalarType add(scalarType const a , scalarType const b)
    {
        return a+b;
    }

    static scalarType subtract(scalarType const a , scalarType const b)
    {
        return a-b;
    }

    static scalarType multiply(scalarType const a , scalarType const b)
    {
        return a*b;
    }

    static scalarType divide(scalarType const a , scalarType const b)
    {
        return a/b;
    }

    static scalarType abs(scalarType const a)
    {
        if(a<scalarType(0))
        {
            return -a;
        }

        return a;
    }
};

There are several public functions that emulates the functionality of a calculator. Ideally, one would like to write a unit test corresponding to each of these functions so that the code is robust. The code coverage tools help you understand how much of your code is covered by the test programs. Now let us write a test program for the add function as shown below

#include <calculator.hpp>

int main(void)
{
    if( calculator<double>::add(double(1.),double(2.)) !=  double(1.)+double(2.) )
    {
        return 1;
    }
    return 0;
}

If this is the only test program you have, then the code coverage tools will tell you that the your code is not completely covered. Now let us consider unit tests for another function abs.

if( calculator<double>::abs(double(-1.)) != double(1.) )
{
    return 1;
}

Imagine that this is the only test function for the abs function. We then know that the test will not go beyond the if clause. This is where code coverage tools can be helpful. Let us see what the codecov.io tells us about the testing. Below is a screenshot of the coverage report.

calculator code coverage

As we can see the tests missed the return a statement at the bottom abs function. Therefore, ideally, we require another test function like

if( calculator<double>::abs(double(1.)) != double(1.) )
{
    return 1;
}

If both test functions are present we can be sure that the abs function is reliable.

Setting up a codecov report using TravisCI

If your code is under a continuous integration platform, we can easily integrate the code to the a coverage platform like codecov. All you have to do is add a few lines in the .travis.yml file. Below is an example

script:
    - make testCalculator
    - ./Calculator/testCalculator
    - gcov Calculator/testCalculator --object-directory=Calculator/CMakeFiles/testCalculator.dir/testCalculator.cpp.gcno
    - ls -la

after_success:
    - bash <(curl -s https://codecov.io/bash)

The next steps are

  1. make an account in codecov and
  2. add your repository to the dashboard.

If you are linking your github, then adding a public repository is straightforward. My example repository can be found here.

Last week I bought an Intel Xeon Phi off EBay. My plan is to create a desktop supercomputer! This is an extremely challenging but exciting project.

The excitement

One of my areas of my research is spherical harmonics (used in power spectrum estimation of cosmological data) and two of the codes that are available for fast Spherical Harmonic Transform (SHT) is parallelized by OpenMP. I believe Xeon Phi is ideal for this computation. People in Caltech did a performance analysis a few years ago on Xeon Phi and their paper gives a good insight. They do caution about what can be achieved.

While most heritage software will run on the Xeon Phi with little effort, there’s no guarantee it will run optimally to fully exploit its architecture, even if the segments that carry out intensive numerical computations have been highly parallelized.

I found a sample code in their paper which shows how “easy” it is to offload the highly parallelizable parts of the code to Xeon Phi. Take a look below

#ifndef MIC_DEV
#define MIC_DEV 0
#endif
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <mkl.h> /* needed for the dsecnd() timing function. */
#include <math.h>
/* Program test1.c: multiply two matrices using explicit looping of elements. */
/*---------------------------------------------------------------------*/
/* Simple "naive" method to multiply two square matrices A and B
 to generate matrix C. */
void myMult(int size,
    float (* restrict A)[size],
    float (* restrict B)[size],
    float (* restrict C)[size])
{
    #pragma offload target(mic:MIC_DEV) in(A:length(size*size)) \
    in( B:length(size*size)) \
    out(C:length(size*size))
    {
        /* Initialize the C matrix with zeroes. */
        #pragma omp parallel for default(none) shared(C,size)
        for(int i = 0; i < size; ++i)
            for(int j = 0; j < size; ++j)
                C[i][j] = 0.f;

        /* Compute matrix multiplication. */
        #pragma omp parallel for default(none) shared(A,B,C,size)
        for(int i = 0; i < size; ++i)
            for(int k = 0; k < size; ++k)
                for(int j = 0; j < size; ++j)
                    C[i][j] += A[i][k] * B[k][j];
    }
}
/*---------------------------------------------------------------------*/
/* Read input parameters; set-up dummy input data; multiply matrices using
 the myMult() function above; average repeated runs. */
int main(int argc, char *argv[])
{
    if(argc != 4)
    {
        fprintf(stderr,"Use: %s size nThreads nIter\n",argv[0]);
        return -1;
    }
    int i,j,nt;
    int size=atoi(argv[1]);
    int nThreads=atoi(argv[2]);
    int nIter=atoi(argv[3]);

    omp_set_num_threads(nThreads);
    /* when compiled in "mic-offload" mode, this memory gets allocated on host,
    when compiled in "mic-native" mode, it gets allocated on mic. */
    float (*restrict A)[size] = malloc(sizeof(float)*size*size);
    float (*restrict B)[size] = malloc(sizeof(float)*size*size);
    float (*restrict C)[size] = malloc(sizeof(float)*size*size);
    /* this first pragma is just to get the actual #threads used
    (sanity check). */
    #pragma omp parallel
    {
        nt = omp_get_num_threads();

        /* Fill the A and B arrays with dummy test data. */
        #pragma omp parallel for default(none) shared(A,B,size) private(i,j)
        for(i = 0; i < size; ++i)
        {
            for(j = 0; j < size; ++j)
            {
                A[i][j] = (float)i + j;
                B[i][j] = (float)i - j;
            }
        }
    }

    /* warm up run to overcome setup overhead in benchmark runs below. */
    myMult(size, A,B,C);
    double aveTime,minTime=1e6,maxTime=0.;
    /* run the matrix multiplication function nIter times and compute
    average runtime. */
    for(i=0; i < nIter; i++)
    {
        double startTime = dsecnd();
        myMult(size, A,B,C);
        double endTime = dsecnd();
        double runtime = endTime-startTime;
        maxTime=(maxTime > runtime)?maxTime:runtime;
        minTime=(minTime < runtime)?minTime:runtime;
        aveTime += runtime;
    }
    aveTime /= nIter;

    printf("%s nThreads %d matrix %d maxRT %g minRT %g aveRT %g GFlop/s %g\n",
    argv[0],nt,size,maxTime,minTime,aveTime, 2e-9*size*size*size/aveTime);
    free(A);
    free(B);
    free(C);
    return 0;
}

I found an interesting talk by Hatem Ltaief about performance of BLAS (MAGA library) on Xeon Phi. Here is a slide from his talk.

MAGMA

Specs, Mother-board, Air-supply, …

There is a lot of info about the mother-board support and heat dissipation on forums. For example see this thread. I have a 5110P which is passively cooled. The specs of which from the Intel page is below

   
L2 Cache 30 MB
# of Cores 60
Processor Base Frequency 1.053 GHz
Max Memory Size 8 GB
Max Memory Bandwidth 320 GB/s

XeonPhi

The card also requires sufficient air supply to make things work!. Puget-Systems has a case in which they have 3-D printed an air duct for Xeon Phi! From the thread, I decided to go for the specification suggested by Zibi Holka. Here are the specs

  1. ASUS GRYPHON (TUF) Z97 mATX motherboard! BIOS:2205 -standard BIOS
  2. Intel Core i7 i7-4790K CPU (Quad Core 4GHz, Socket H3 LGA-1150)
  3. Corsair Platinum 16GB 2x8GB Memory Kit (CMD32GX3M4A2133C9)
  4. Xeon PHI 31s1p
  5. Fedora 21
  6. Kernel (std. fedora): 3.18.7
  7. MPSS: 3.4.2 MPSS-Modules recompiled from https://github.com/xdsopl/mpss-modules

I may not get the exact CPU/Memory config and I have a 5110P instead of 31S1P. So far I have the first 4 in the list. I will update this blog as I make progress :)

FPGA or Field-Programmable Gate Array is the latest addition to crazy world of HPC. It is an an integrated circuit designed to be configured by a customer or a designer after manufacturing. The FPGA configuration is generally specified using a hardware description language (HDL). However, several vendors are now offering OpenCL based API. The idea that you have custom hardware for a specific algorithm, for example, Fast Fourier Transform or Markov Chain Monte Carlo is breathtaking.

MCMC

I am seeing more and more papers on this approach recently. For example, take this paper by Mingas & Bouganis. They report 1000X speedup for a Parallel Tempering Algorithm. They have implemented the code in VHDL on a Virtex-6 FPGA. Here is a plot from the paper

speedup

The code however is not public yet.

BLAS

Another interesting paper was on a BLAS implementation on FPGA by Kestur, Davis and Williams. Their conclusion is extremely interesting: “Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms.”. Here is a plot from their paper

time-energy

Recently I came across this paper on Hamiltonian Monte Carlo with stochastic gradients. Since it is very relevant to my research, I thought I will have a look. Here are my thoughts and comments on the paper.

ArXiv

Hamiltonian Monte Carlo(HMC) is one of my research areas. One of the most annoying thing about the HMC is that gradients of the probability distribution is required for the sampling. A MATLAB code is available in [GitHub](https://github.com/tqchen/ML-SGHMC. Below is a plot from this paper which compare the true input distribution with the one recovered by HMC.

SGHMC

The MATLAB code is simple to understand. See the gist below

function [ newx ] = sghmc( U, gradU, m, dt, nstep, x, C, V )
%% SGHMC using gradU, for nstep, starting at position x

p = randn( size(x) ) * sqrt( m );
B = 0.5 * V * dt;
D = sqrt( 2 * (C-B) * dt );

for i = 1 : nstep
    p = p - gradU( x ) * dt  - p * C * dt  + randn(1)*D;
    x = x + p./m * dt;
end
newx = x;
end

What I would like to see is its application to large-scale problems. My concern is that the choice of parameters, for example the friction B,

can have huge effect on the performance of the sampler.

OpenACC is getting a lot of attention recently and especially in the application to scientific computing. The main reason I believe is that these are compiler directives and hence reasonably fast to learn as long as you have pieces of code that are parallelisable. I wanted to see what compilers are publicly available for OpenACC.

Compiler support

The following compilers support the OpenACC directives (as of today)

  1. PGI
  2. CAPS
  3. OpenUH
  4. accULL
  5. OpenARC

OpenUH and accULL are open source and are available for download now. On a linux platform you can get these to build and check sone cool features of OpenACC.

/*
 * Simple test
 *  allocates two vectors, fills the vectors, does axpy a couple of times,
 *  compares the results for correctness.
 *  This is the same as version s1.cpp with 'int' instead of 'double'
 */

#include <iostream>
#include <cstdio>
#include <openacc.h>

/* Depending on the GCC version you have installed, this cause warnings */
extern "C" int atoi(const char*);

static void
axpy( int* y, int* x, int a, int n ){
    #pragma acc parallel loop present(x[0:n],y[0:n])
    for( int i = 0; i < n; ++i )
        y[i] += a*x[i];
}

static void
test( int* a, int* b, int n ){
    #pragma acc data copy(a[0:n]) copyin(b[0:n])
    {
    axpy( a, b, 2, n );

    for( int i = 0; i < n; ++i ) b[i] = 2;
    #pragma acc update device(b[0:n])

    axpy( a, b, 1.0, n );
    }
}


int main( int argc, char* argv[] ){
    int n = 1000;
    if( argc > 1 ) n = atoi(argv[1]);

    int* a, *b;

    a = new int[n];
    b = new int[n];

    for( int i = 0; i < n; ++i ) a[i] = i;
    for( int i = 0; i < n; ++i ) b[i] = n-i;

    test( a, b, n );

    int sum = 0;
    for( int i = 0; i < n; ++i ) sum += a[i];
    int exp = 0;
    for( int i = 0; i < n; ++i ) exp += (int)i + 2*(int)(n-i) + 2;
    std::cout << "Checksum is " << sum << std::endl;
    if( exp != sum )
    std::cout << "Difference is " << exp - sum << std::endl;
    else
    std::cout << "PASSED" << std::endl;
    return 0;
}

The most difficult bit is to get a compiler installed. The rest is very exciting!