Pandas Usage

Overview

The main goal of datacompy is to provide a human-readable output describing differences between two dataframes. For example, if you have two dataframes containing data like:

df1

acct_id dollar_amt name float_fld date_fld
10000001234 123.45 George Maharis 14530.1555 2017-01-01
10000001235 0.45 Michael Bluth 1 2017-01-01
10000001236 1345 George Bluth   2017-01-01
10000001237 123456 Bob Loblaw 345.12 2017-01-01
10000001238 1.05 Lucille Bluth   2017-01-01
10000001238 1.05 Loose Seal Bluth   2017-01-01

df2

acct_id dollar_amt name float_fld
10000001234 123.4 George Michael Bluth 14530.155
10000001235 0.45 Michael Bluth  
10000001236 1345 George Bluth 1
10000001237 123456 Robert Loblaw 345.12
10000001238 1.05 Loose Seal Bluth 111

Set up like:

from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

Compare Object Setup

There are currently two supported methods for joining your dataframes - by join column(s) or by index.

compare = datacompy.Compare(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')

# OR

compare = datacompy.Compare(df1, df2, join_columns=['acct_id', 'name'])

# OR

compare = datacompy.Compare(df1, df2, on_index=True)

Reports

A report is generated by calling Compare.report(), which returns a string. Here is a sample report generated by datacompy for the two tables above, joined on acct_id (Note: if you don’t specify df1_name and/or df2_name, then any instance of “original” or “new” in the report is replaced with “df1” and/or “df2”.):

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 7

Columns with Unequal Values or Types
------------------------------------

       Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        float64   float64          1    0.0500            0
1   float_fld        float64   float64          4    0.0005            3
2        name         object    object          2    0.0000            0

Sample Rows with Unequal Values
-------------------------------

       acct_id  dollar_amt (original)  dollar_amt (new)
0  10000001234                 123.45             123.4

       acct_id  float_fld (original)  float_fld (new)
0  10000001234            14530.1555        14530.155
5  10000001238                   NaN          111.000
2  10000001236                   NaN            1.000
1  10000001235                1.0000              NaN

       acct_id name (original)            name (new)
0  10000001234  George Maharis  George Michael Bluth
3  10000001237      Bob Loblaw         Robert Loblaw

Sample Rows Only in original (First 10 Columns)
-----------------------------------------------

       acct_id  dollar_amt           name  float_fld    date_fld
4  10000001238        1.05  Lucille Bluth        NaN  2017-01-01

Convenience Methods

There are a few convenience methods available after the comparison has been run:

print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']])
#            name_df1              name_df2  name_match
# 0    George Maharis  George Michael Bluth       False
# 1     Michael Bluth         Michael Bluth        True
# 2      George Bluth          George Bluth        True
# 3        Bob Loblaw         Robert Loblaw       False
# 5  Loose Seal Bluth      Loose Seal Bluth        True

print(compare.df1_unq_rows)
#        acct_id  dollar_amt           name  float_fld    date_fld
# 4  10000001238        1.05  Lucille Bluth        NaN  2017-01-01

print(compare.df2_unq_rows)
# Empty DataFrame
# Columns: [acct_id, dollar_amt, name, float_fld]
# Index: []

print(compare.intersect_columns())
# {'float_fld', 'acct_id', 'name', 'dollar_amt'}

print(compare.df1_unq_columns())
# {'date_fld'}

print(compare.df2_unq_columns())
# set()

Limitations

There’s a number of limitations with datacompy:

  1. The dataframes that you’re comparing have to fit in memory. In comparison with SAS PROC COMPARE which can operate on datasets that are on disk, this could be a constraint if you’re using very large dataframes.

  2. If you only need to check whether or not two dataframes are exactly the same, you should look at the testing capabilities within Pandas and Numpy:

    from pandas.util.testing import assert_series_equal
    from pandas.util.testing import assert_frame_equal
    import numpy.testing as npt
    
    #Compare two series
    assert_series_equal(df1['some_field'], df2['some_field'])
    
    #Compare two dataframes
    assert_frame_equal(df1, df2)
    
    #Numpy testing
    npt.assert_array_equal(arr1, arr2)
    npt.assert_almost_equal(obj1, obj2)