datacompy.core module

Compare two Pandas DataFrames

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.core.Compare(df1, df2, join_columns=None, on_index=False, abs_tol=0, rel_tol=0, df1_name='df1', df2_name='df2')

Bases: object

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:

df1 : pandas DataFrame

First dataframe to check

df2 : pandas DataFrame

Second dataframe to check

join_columns : list or str, optional

Column(s) to join dataframes on. If a string is passed in, that one column will be used.

on_index : bool, optional

If True, the index will be used to join the two dataframes. If both join_columns and on_index are provided, an exception will be raised.

abs_tol : float, optional

Absolute tolerance between two values.

rel_tol : float, optional

Relative tolerance between two values.

df1_name : str, optional

A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

df2_name : str, optional

A string name for the second dataframe

Attributes

df1_unq_rows (pandas DataFrame) All records that are only in df1 (based on a join on join_columns)
df2_unq_rows (pandas DataFrame) All records that are only in df2 (based on a join on join_columns)
all_columns_match()

Whether the columns all match in the dataframes

all_rows_overlap()

Whether the rows are all present in both dataframes

Returns:

bool

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

count_matching_rows()

Count the number of rows match (on overlapping fields)

Returns:

int

Number of matching rows

df1
df1_unq_columns()

Get columns that are unique to df1

df2
df2_unq_columns()

Get columns that are unique to df2

intersect_columns()

Get columns that are shared between the two dataframes

intersect_rows_match()

Check whether the intersect rows all match

matches(ignore_extra_columns=False)

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns : bool

Ignores any columns in one dataframe and not in the other.

report(sample_count=10)

Returns a string representation of a report. The representation can then be printed or saved to a file.

Parameters:

sample_count : int, optional

The number of sample records to return. Defaults to 10.

Returns:

str

The report, formatted kinda nicely.

sample_mismatch(column, sample_count=10, for_display=False)

Returns a sample sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:

column : str

The raw column name (i.e. without _df1 appended)

sample_count : int, optional

The number of sample records to return. Defaults to 10.

for_display : bool, optional

Whether this is just going to be used for display (overwrite the column names)

Returns:

Pandas.DataFrame

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

subset()

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

datacompy.core.columns_equal(col_1, col_2, rel_tol=0, abs_tol=0)

Compares two columns from a dataframe, returning a True/False series, with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.
  • A null and a non-null value will evaluate to False.
  • Numeric values will use the relative and absolute tolerances.
  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing
  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.
Parameters:

col_1 : Pandas.Series

The first column to look at

col_2 : Pandas.Series

The second column

rel_tol : float, optional

Relative tolerance

abs_tol : float, optional

Absolute tolerance

Returns:

pandas.Series

A series of Boolean values. True == the values match, False == the values don’t match.

datacompy.core.compare_string_and_date_columns(col_1, col_2)

Compare a string column and date column, value-wise. This tries to convert a string column to a date column and compare that way.

Parameters:

col_1 : Pandas.Series

The first column to look at

col_2 : Pandas.Series

The second column

Returns:

pandas.Series

A series of Boolean values. True == the values match, False == the values don’t match.

datacompy.core.get_merged_columns(original_df, merged_df, suffix)

Gets the columns from an original dataframe, in the new merged dataframe

Parameters:

original_df : Pandas.DataFrame

The original, pre-merge dataframe

merged_df : Pandas.DataFrame

Post-merge with another dataframe, with suffixes added in.

suffix : str

What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

datacompy.core.render(filename, *fields)

Renders out an individual template. This basically just reads in a template file, and applies .format() on the fields.

Parameters:

filename : str

The file that contains the template. Will automagically prepend the templates directory before opening

fields : list

Fields to be rendered out in the template

Returns:

str

The fully rendered out file.

datacompy.core.temp_column_name(*dataframes)

Gets a temp column name that isn’t included in columns of any dataframes

Parameters:

dataframes : list of Pandas.DataFrame

The DataFrames to create a temporary column name for

Returns:

str

String column name that looks like ‘_temp_x’ for some integer x