Code Examples are in test/dataframe_tester.cc and test/dataframe_tester_2.cc. You should find at least one example for each feature.

Visitors are all defined in files include/DataFrameStatsVisitors.h, include/DataFrameMLVisitors.h>, include/DataFrameFinancialVisitors.h, and include/DataFrameTransformVisitors.h. Also see test/data_frame_tester.cc for example usage.
There are some common interfaces in most of the visitors. For example the following interfaces are common between most (but not all) visitors:
get_result() It returns the result of the visitor/algo.
pre() It is called by DataFrame each time before starting to pass the data to the visitor. pre() is the place to initialize the process
post() It is called by DataFrame each time it is done with passing data to the visitor.

Random Generators are a set of convenient routines to generate random number.
For the definition and defaults of RandGenParams, see this document and file DataFrameTypes.h



Table of Features

DataFrame Methods DataFrame Types DataFrame Built-in Visitors Random Generators
append_column( 2 ) enum class concat_policy{ } struct AffinityPropVisitor{ } gen_bernoulli_dist{ }
append_index( 2 ) enum class drop_policy{ } struct AutoCorrVisitor{ } gen_binomial_dist( )
bucketize( ) enum class fill_policy{ } struct BetaVisitor{ } gen_cauchy_dist( )
bucketize_async( ) enum class exponential_decay_spec{ } struct BollingerBand{ } gen_chi_squared_dist( )
concat( ) struct CorrVisitor{ } gen_exponential_dist( )
create_column( ) enum class io_format{ } struct CovVisitor{ } gen_extreme_value_dist( )
drop_missing( ) enum class join_policy{ } struct CumMaxVisitor{ } gen_fisher_f_dist( )
fill_missing( ) enum class mad_type{ } struct CumMinVisitor{ } gen_gamma_dist( )
gen_datetime_index( ) enum class nan_policy{ } struct CumProdVisitor{ } gen_geometric_dist( )
gen_sequence_index( ) enum class quantile_policy{ } struct CumSumVisitor{ } gen_lognormal_dist( )
get_col_unique_values( ) enum class random_policy{ } struct DotProdVisitor{ } gen_negative_binomial_dist( )
get_column( 2 ) enum class return_policy{ } struct DoubleCrossOver{ } gen_normal_dist( )
get_data_by_idx( 2 ) enum class shift_policy{ } struct ExpandingRollAdopter{ } gen_poisson_dist( )
get_data_by_loc( 2 ) enum class sort_spec{ } struct ExponentialRollAdopter{ } gen_student_t_dist( )
get_data_by_rand( ) enum class sort_state{ } struct GeometricMeanVisitor{ } gen_uniform_int_dist( )
get_data_by_sel( 3 ) enum class time_frequency{ } struct GroupbySum{ } gen_uniform_real_dist( )
get_index( 2 ) struct BadRange{ }
An Exception
struct HarmonicMeanVisitor{ } gen_weibull_dist( )
get_memory_usage( ) struct ColNotFound{ }
An Exception
struct KMeansVisitor{ }
get_reindexed( ) struct DataFrameError{ }
An Exception
struct KthValueVisitor{ }
get_reindexed_view( ) struct InconsistentData{ }
An Exception
struct MACDVisitor{ }
get_row( ) enum class Index2D{ } struct MADVisitor{ }
get_view_by_idx( 2 ) enum class MemUsage{ } struct MaxVisitor{ }
get_view_by_loc( 2 ) struct NotFeasible{ }
An Exception
struct MeanVisitor{ }
get_view_by_rand( ) struct NotImplemented{ }
An Exception
struct MedianVisitor{ }
get_view_by_sel( 3 ) enum class pattern_spec{ } struct MinVisitor{ }
groupby( ) df_plus operator struct ModeVisitor{ }
groupby_async( ) df_minus operator struct NLargestVisitor{ }
has_column( ) df_multiplies operator struct NSmallestVisitor{ }
is_equal( ) df_divides operator struct ProdVisitor{ }
join_by_column( ) struct QuantileVisitor{ }
join_by_index( ) struct ReturnVisitor{ }
load_column( 3 ) struct SampleZScoreVisitor{ }
load_data( ) struct SEMVisitor{ }
load_index( 2 ) struct SimpleRollAdopter{ }
make_consistent( ) struct SLRegressionVisitor{ }
modify_by_idx( ) struct StatsVisitor{ }
multi_visit( ) struct StdVisitor{ }
read( ) struct SumVisitor{ }
read_async( ) struct TrackingErrorVisitor{ }
remove_column( ) struct VWAPVisitor{ }
remove_data_by_idx( ) struct VWBASVisitor{ }
remove_data_by_loc( ) struct ZScoreVisitor{ }
remove_data_by_sel( 3 ) struct CategoryVisitor{ }
remove_lock( ) struct FactorizeVisitor{ }
rename_column( ) struct ClipVisitor{ }
replace( 2 ) struct SharpeRatioVisitor{ }
replace_async( 2 )
replace_index( )
rotate( )
self_bucketize( )
self_concat( )
self_rotate( )
self_shift( )
shape( )
set_lock( )
shift( )
shrink_to_fit( )
shuffle( )
single_act_visit( 2 )
single_act_visit_async( 2 )
sort( 5 )
sort_async( 5 )
transpose( )
value_counts( )
visit( 5 )
visit_async( 5 )
write( )
write_async( )
retype_column( )
load_align_column( )
get_columns_info( )
pattern_match( )


Motivation

Although Pandas has a spot-on interface and it is full of useful functionalities, it lacks performance and scalability. For example, it is hard to decipher high-frequency intraday data such as Options data or S&P500 constituents tick-by-tick data using Pandas.
Another issue I have encountered often is the research is done using Python, because it has such tools as Pandas, but the execution in production is in C++ for its efficiency, reliability and scalability. Therefore, there is this translation, or sometimes a bridge, between research and executions.
Also, in this day and age, C++ needs a heterogeneous data container.
Mainly because of these factors, I implemented the C++ DataFrame.
I welcome all contributions from people with expertise, interest, and time to do it. I will add more functionalities from time to time, but currently my spare time is limited.

Code Sample

using namespace hmdf;
 
// Defines a DataFrame with unsigned long index type that used std::vector
using MyDataFrame = StdDataFrame<unsigned long>;
 
MyDataFrame  df;
std::vector<int>  intvec = { 1, 2, 3, 4, 5 };
std::vector<double>  dblvec = { 1.2345, 2.2345, 3.2345, 4.2345, 5.2345 };
std::vector<double>  dblvec2 = { 0.998, 0.3456, 0.056, 0.15678, 0.00345, 0.923, 0.06743, 0.1 };
std::vector<std::string>  strvec = { "Some string", "some string 2", "some string 3", "some string 4", "some string 5" };
std::vector<unsigned long>  ulgvec = { 1UL, 2UL, 3UL, 4UL, 5UL, 8UL, 7UL, 6UL }
std::vector<unsigned long>  xulgvec = ulgvec;
 
// This is only one way of loading data into the DataFrame. There are
// many different ways of doing it. Please see DataFrame.h and
// dataframe_tester.cc
int rc = df.load_data(std::move(ulgvec),
                      std::make_pair("int_col", intvec),
                      std::make_pair("dbl_col", dblvec),
                      std::make_pair("dbl_col_2", dblvec2),
                      std::make_pair("str_col", strvec),
                      std::make_pair("ul_col", xulgvec));

// Sort the Frame by index
df.sort<MyDataFrame::IndexType, int, double, std::string>("INDEX", sort_spec::ascen);
// Sort the Frame by column “dbl_col_2”
df.sort<double, int, double, std::string>("dbl_col_2", sort_spec::desce);
 
// A functor to calculate mean, variance, skew, kurtosis, defined in 
// DataFrameStatsVisitors.h file
StatsVisitor<double>  stats_visitor;
 
// Calculate the stats on column “dbl_col”
df.visit<double>("dbl_col", stats_visitor);

//
// Example code with Views
//

std::vector<unsigned long>  idx = { 123450, 123451, 123452, 123450, 123455, 123450, 123449 };
std::vector<double>  d1 = { 1, 2, 3, 4, 5, 6, 7 };
std::vector<double>  d2 = { 8, 9, 10, 11, 12, 13, 14 };
std::vector<double>  d3 = { 15, 16, 17, 18, 19, 20, 21 };
std::vector<double>  d4 = { 22, 23, 24, 25 };
std::vector<std::string>  s1 = { "11", "22", "33", "xx", "yy", "gg", "string" };
MyDataFrame  df2;
 
df2.load_data(std::move(idx),
              std::make_pair("col_1", d1),
              std::make_pair("col_2", d2),
              std::make_pair("col_3", d3),
              std::make_pair("col_4", d4),
              std::make_pair("col_str", s1));

using MyDataFrameView = DataFrameView<unsigned long>;
 
MyDataFrameView  dfv = df2.get_view_by_loc<double, std::string>(Index2D<long> { 3, 6 });
 
dfv.get_column<double>("col_3")[0] = 88.0;
std::cout << "After changing a value on view: "
          << dfv.get_column<double>("col_3")[0]
          << " == " << df.get_column<double>("col_3")[3]
          << std::endl;

//
// Example with multithreaded-safe code
//

const size_t    vec_size = 100000;
auto            do_work = [vec_size]() {
    MyDataFrame         df;
    std::vector<size_t> vec;

    for (size_t i = 0; i < vec_size; ++i)
        vec.push_back(i);

    df.load_data(MyDataFrame::gen_sequence_index(0, vec_size, 1),
                 std::make_pair("col1", vec));

    // This is an extremely inefficient way of doing it, especially in
    // a multithreaded program. Each “get_column” is a hash table
    // look up and in multithreaded programs requires a lock.
    // It is much more efficient to call “get_column” outside the loop
    // and loop over the referenced vector.
    // Here I am doing it this way to make sure synchronization
    // between threads are bulletproof.
    for (size_t i = 0; i < vec_size; ++i)  {
        const size_t    j = df.get_column<size_t>("col1")[i];

        assert(i == j);
    }
    df.shrink_to_fit();
};

SpinLock                    lock;
std::vector<std::thread>    thr_vec;

// Use this lock to protect internal DataFrame static members
MyDataFrame::set_lock(&lock);
for (size_t i = 0; i < 20; ++i)
    thr_vec.push_back(std::thread(do_work));
for (size_t i = 0; i < 20; ++i)
    thr_vec[i].join();
MyDataFrame::remove_lock();

Views

Views were added in the second wave. It is a very useful concept with practical use-cases. A view is a slice of a DataFrame that is a reference to the original DataFrame. It appears exactly the same as a DataFrame, but if you modify any data in the view, the corresponding data point(s) in the original DataFrame will also be modified.
There are certain things you cannot do in views. For example, you cannot add to delete columns, extend the index column, ...
For more understanding, look at this document further and/or the test files.

Visitors

Visitors are the mechanism to run statistical algorithms. Most of DataFrame statistical algorithms are in “visitors”. Visitor is the mechanism by which DataFrame passes data points to your algorithm. You can add your own algorithms to a visitor functor and extend DataFrame easily. There are two kinds of visitation mechanisms in DataFrame:

  1. Regular visit (visit()). In this case DataFrame passes the given column(s) data points one-by-one to the visitor functor. This is convenient for algorithms that can operate one data point at a time (e.g. correlation, variance).
  2. Single-action visit (single_act_visit()). In this case a reference to the given column(s) are passed to the visitor functor at once. This is necessary for algorithms that need the whole data together (e.g. return, median).
See this document, DataFrameStatsVisitors.h, DataFrameMLVisitors.h, DataFrameFinancialVisitors.h, DataFrameTransformVisitors.h, and test/dataframe_tester.cc for more examples and documentation.

Random Generators

Random generators were added as a series of convenient stand-alone functions to generate random numbers (it covers all C++ standard distributions). You can seamlessly use these routines to generate random DataFrame columns.
See this document and file RandGen.h and dataframe_tester.cc.

Code Structure

The DataFrame library is “almost” a header-only library with a few boilerplate source file exceptions, HeteroVector.cc and HeteroView.cc and a few others. Also there is DateTime.cc.

Starting from the root directory;
include directory contains most of the code. It includes .h and .tcc files. The latter are C++ template code files (they are mostly located in the Internals subdirectory). The main header file is DataFrame.h. It contains the DataFrame class and its interface. There are comprehensive comments for each public interface call in that file. The rest of the files there will show you how the sausage is made.
Include directory also contains subdirectories that contain mostly internal DataFrame implementation.
One exception, the DateTime.h is located in the Utils subdirectory

src directory contains Linux-only make files and a few subdirectories that contain various source codes.

test directory contains all the test source files, mocked data files, and test output files. The main test source file is dataframe_tester.cc. It contains test cases for all functionalities of DataFrame. It is not in a very organized structure. I plan to make the test cases more organized.

Build Instructions

Using plain make and make-files
Go to the root of the repository, where license file is, and execute build_all.sh. This will build the library and test executables for Linux flavors
Using cmake
Please see README file. Thanks to @justinjk007, you should be able to build this in Linux, Windows, Mac, and more