In the method descriptions “I” stands for index type and “T” or “Ts” stand for data type(s)
Visitors are all defined in files include/DataFrameStatsVisitors.h, include/DataFrameMLVisitors.h and include/DataFrameFinancialVisitors.h. Also see test/data_frame_tester.cc for example usage.
There are some common interfaces in most of the visitors. For example the following interfaces are common between most (but not all) visitors:
get_result() It returns the result of the visitor/algo.
pre() It is called by DataFrame each time before starting to pass the data to the visitor. pre() is the place to initialize the process
post() It is called by DataFrame each time it is done with passing data to the visitor.
Random generators are a set of convenient routines to generate random number.
For the definition and defaults of RandGenParams, see this document and file DataFrameTypes.h
These are currently arithmetic operators declared in include/DataFrame.h. Because they all have to be templated, they cannot be defined as redefined built-in operators.
template<typename DF, typename ... Ts> inline DF df_plus(const DF &lhs, const DF &rhs); template<typename DF, typename ... Ts> inline DF df_minus(const DF &lhs, const DF &rhs); template<typename DF, typename ... Ts> inline DF df_multiplies(const DF &lhs, const DF &rhs); template<typename DF, typename ... Ts> inline DF df_divides(const DF &lhs, const DF &rhs);These arithmetic operations operate on the same-name and same-type columns on lhs and rhs. Each pair of entries is operated on, only if they have the same index value.
Although Pandas has a spot-on interface and it is full of useful functionalities, it lacks performance and scalability. For example, it is hard to decipher high-frequency intraday data such as Options data or S&P500 constituents tick-by-tick data using Pandas.
Another issue I have encountered often is the research is done using Python, because it has such tools as Pandas, but the execution in production is in C++ for its efficiency, reliability and scalability. Therefore, there is this translation, or sometimes a bridge, between research and executions.
Also, in this day and age, C++ needs a heterogeneous data container.
Mainly because of these factors, I implemented the C++ DataFrame.
I welcome all contributions from people with expertise, interest, and time to do it. I will add more functionalities from time to time, but currently my spare time is limited.
Views were added in the second wave. It is a very useful concept with practical use-cases. A view is a slice of a DataFrame that is a reference to the original DataFrame. It appears exactly the same as a DataFrame, but if you modify any data in the view, the corresponding data point(s) in the original DataFrame will also be modified.
There are certain things you cannot do in views. For example, you cannot add to delete columns, extend the index column, ...
For more understanding, look at this document further and/or the test files.
Visitors are the mechanism to run statistical algorithms. Most of DataFrame statistical algorithms are in “visitors”. Visitor is the mechanism by which DataFrame passes data points to your algorithm. You can add your own algorithms to a visitor functor and extend DataFrame easily. There are two kinds of visitation mechanisms in DataFrame:
Random generators were added as a series of convenient stand-alone functions to generate random numbers (it covers all C++ standard distributions). You can seamlessly use these routines to generate random DataFrame columns.
See this document and file RandGen.h and dataframe_tester.cc.
The DataFrame library is “almost” a header-only library with a few boilerplate source file exceptions, HeteroVector.cc and HeteroView.cc and a few others. Also there is DateTime.cc.
Starting from the root directory;
include directory contains most of the code. It includes .h and .tcc files. The latter are C++ template code files (they are mostly located in the Internals subdirectory). The main header file is DataFrame.h. It contains the DataFrame class and its interface. There are comprehensive comments for each public interface call in that file. The rest of the files there will show you how the sausage is made.
Include directory also contains subdirectories that contain mostly internal DataFrame implementation.
One exception, the DateTime.h is located in the Utils subdirectory
src directory contains Linux-only make files and a few subdirectories that contain various source codes.
test directory contains all the test source files, mocked data files, and test output files. The main test source file is dataframe_tester.cc. It contains test cases for all functionalities of DataFrame. It is not in a very organized structure. I plan to make the test cases more organized.
Using plain make and make-files
Go to the root of the repository, where license file is, and execute build_all.sh. This will build the library and test executables for Linux flavors
Using cmake
Please see README file. Thanks to @justinjk007, you should be able to build this in Linux, Windows, Mac, and more