Code Examples are in test/dataframe_tester.cc and test/dataframe_tester_2.cc. You should find at least one example for each feature.
Visitors are all defined in files include/DataFrameStatsVisitors.h, include/DataFrameMLVisitors.h>, include/DataFrameFinancialVisitors.h, and include/DataFrameTransformVisitors.h. Also see test/data_frame_tester.cc for example usage.
There are some common interfaces in most of the visitors. For example the following interfaces are common between most (but not all) visitors:
get_result() It returns the result of the visitor/algo.
pre() It is called by DataFrame each time before starting to pass the data to the visitor. pre() is the place to initialize the process
post() It is called by DataFrame each time it is done with passing data to the visitor.
Random Generators are a set of convenient routines to generate random number.
For the definition and defaults of RandGenParams, see this document and file DataFrameTypes.h
Although Pandas has a spot-on interface and it is full of useful functionalities, it lacks performance and scalability. For example, it is hard to decipher high-frequency intraday data such as Options data or S&P500 constituents tick-by-tick data using Pandas.
Another issue I have encountered often is the research is done using Python, because it has such tools as Pandas, but the execution in production is in C++ for its efficiency, reliability and scalability. Therefore, there is this translation, or sometimes a bridge, between research and executions.
Also, in this day and age, C++ needs a heterogeneous data container.
Mainly because of these factors, I implemented the C++ DataFrame.
I welcome all contributions from people with expertise, interest, and time to do it. I will add more functionalities from time to time, but currently my spare time is limited.
using namespace hmdf; // Defines a DataFrame with unsigned long index type that used std::vector using MyDataFrame = StdDataFrame<unsigned long>; MyDataFrame df; std::vector<int> intvec = { 1, 2, 3, 4, 5 }; std::vector<double> dblvec = { 1.2345, 2.2345, 3.2345, 4.2345, 5.2345 }; std::vector<double> dblvec2 = { 0.998, 0.3456, 0.056, 0.15678, 0.00345, 0.923, 0.06743, 0.1 }; std::vector<std::string> strvec = { "Some string", "some string 2", "some string 3", "some string 4", "some string 5" }; std::vector<unsigned long> ulgvec = { 1UL, 2UL, 3UL, 4UL, 5UL, 8UL, 7UL, 6UL } std::vector<unsigned long> xulgvec = ulgvec; // This is only one way of loading data into the DataFrame. There are // many different ways of doing it. Please see DataFrame.h and // dataframe_tester.cc int rc = df.load_data(std::move(ulgvec), std::make_pair("int_col", intvec), std::make_pair("dbl_col", dblvec), std::make_pair("dbl_col_2", dblvec2), std::make_pair("str_col", strvec), std::make_pair("ul_col", xulgvec)); // Sort the Frame by index df.sort<MyDataFrame::IndexType, int, double, std::string>("INDEX", sort_spec::ascen); // Sort the Frame by column “dbl_col_2” df.sort<double, int, double, std::string>("dbl_col_2", sort_spec::desce); // A functor to calculate mean, variance, skew, kurtosis, defined in // DataFrameStatsVisitors.h file StatsVisitor<double> stats_visitor; // Calculate the stats on column “dbl_col” df.visit<double>("dbl_col", stats_visitor); // // Example code with Views // std::vector<unsigned long> idx = { 123450, 123451, 123452, 123450, 123455, 123450, 123449 }; std::vector<double> d1 = { 1, 2, 3, 4, 5, 6, 7 }; std::vector<double> d2 = { 8, 9, 10, 11, 12, 13, 14 }; std::vector<double> d3 = { 15, 16, 17, 18, 19, 20, 21 }; std::vector<double> d4 = { 22, 23, 24, 25 }; std::vector<std::string> s1 = { "11", "22", "33", "xx", "yy", "gg", "string" }; MyDataFrame df2; df2.load_data(std::move(idx), std::make_pair("col_1", d1), std::make_pair("col_2", d2), std::make_pair("col_3", d3), std::make_pair("col_4", d4), std::make_pair("col_str", s1)); using MyDataFrameView = DataFrameView<unsigned long>; MyDataFrameView dfv = df2.get_view_by_loc<double, std::string>(Index2D<long> { 3, 6 }); dfv.get_column<double>("col_3")[0] = 88.0; std::cout << "After changing a value on view: " << dfv.get_column<double>("col_3")[0] << " == " << df.get_column<double>("col_3")[3] << std::endl; // // Example with multithreaded-safe code // const size_t vec_size = 100000; auto do_work = [vec_size]() { MyDataFrame df; std::vector<size_t> vec; for (size_t i = 0; i < vec_size; ++i) vec.push_back(i); df.load_data(MyDataFrame::gen_sequence_index(0, vec_size, 1), std::make_pair("col1", vec)); // This is an extremely inefficient way of doing it, especially in // a multithreaded program. Each “get_column” is a hash table // look up and in multithreaded programs requires a lock. // It is much more efficient to call “get_column” outside the loop // and loop over the referenced vector. // Here I am doing it this way to make sure synchronization // between threads are bulletproof. for (size_t i = 0; i < vec_size; ++i) { const size_t j = df.get_column<size_t>("col1")[i]; assert(i == j); } df.shrink_to_fit(); }; SpinLock lock; std::vector<std::thread> thr_vec; // Use this lock to protect internal DataFrame static members MyDataFrame::set_lock(&lock); for (size_t i = 0; i < 20; ++i) thr_vec.push_back(std::thread(do_work)); for (size_t i = 0; i < 20; ++i) thr_vec[i].join(); MyDataFrame::remove_lock();
Views were added in the second wave. It is a very useful concept with practical use-cases. A view is a slice of a DataFrame that is a reference to the original DataFrame. It appears exactly the same as a DataFrame, but if you modify any data in the view, the corresponding data point(s) in the original DataFrame will also be modified.
There are certain things you cannot do in views. For example, you cannot add to delete columns, extend the index column, ...
For more understanding, look at this document further and/or the test files.
Visitors are the mechanism to run statistical algorithms. Most of DataFrame statistical algorithms are in “visitors”. Visitor is the mechanism by which DataFrame passes data points to your algorithm. You can add your own algorithms to a visitor functor and extend DataFrame easily. There are two kinds of visitation mechanisms in DataFrame:
Random generators were added as a series of convenient stand-alone functions to generate random numbers (it covers all C++ standard distributions). You can seamlessly use these routines to generate random DataFrame columns.
See this document and file RandGen.h and dataframe_tester.cc.
The DataFrame library is “almost” a header-only library with a few boilerplate source file exceptions, HeteroVector.cc and HeteroView.cc and a few others. Also there is DateTime.cc.
Starting from the root directory;
include directory contains most of the code. It includes .h and .tcc files. The latter are C++ template code files (they are mostly located in the Internals subdirectory). The main header file is DataFrame.h. It contains the DataFrame class and its interface. There are comprehensive comments for each public interface call in that file. The rest of the files there will show you how the sausage is made.
Include directory also contains subdirectories that contain mostly internal DataFrame implementation.
One exception, the DateTime.h is located in the Utils subdirectory
src directory contains Linux-only make files and a few subdirectories that contain various source codes.
test directory contains all the test source files, mocked data files, and test output files. The main test source file is dataframe_tester.cc. It contains test cases for all functionalities of DataFrame. It is not in a very organized structure. I plan to make the test cases more organized.
Using plain make and make-files
Go to the root of the repository, where license file is, and execute build_all.sh. This will build the library and test executables for Linux flavors
Using cmake
Please see README file. Thanks to @justinjk007, you should be able to build this in Linux, Windows, Mac, and more