File Comparator
1. Description
This binary compares two files (primary and secondary) and reports differences: missing rows, mismatches, and invalid/duplicate keys. It supports native .cf , plain text files and Parquet inputs. The tool can compare files with the same structure or perform mapped comparisons when the two sources have different schemas.
2. Supported input types and combinations
.cf: dynamic (.cf) and static (.cf + .peo) variants..txt(delimited text)..pq(Parquet files or directories containing Parquet data).
Allowed comparison combinations:
- cf vs cf (dynamic or static) — supported.
- cf vs txt — supported for dynamic .cf and txt.
- txt vs txt — supported without metadata.
- parquet vs parquet — supported.
- parquet vs cf / parquet vs txt — supported.
Notes:
- For comparisons involving
.txt, all fields are treated as strings. - When using V5 metadata resolution, the tool can load
.cfmetadata from a directory of JSON config files (see metadata options below).
3. How to run
Required parameters:
--primary-file: Path to the primary input file.--secondary-file: Path to the secondary input file.--output-directory: Directory where comparison outputs will be written.
Key/value mapping parameters (one of these forms is required):
Same-structure comparisons:
--keys: Comma-separated list of key columns present in both sources.- Optional
--values: Comma-separated list of value columns to compare (If not given, compares all the columns).
Mapped comparisons (different schemas):
--primary-keysand--secondary-keys: Comma-separated key columns for primary and secondary respectively (must have same length and order).--primary-valuesand--secondary-values: Comma-separated value columns for primary and secondary (must have same length and order).
Other options:
--client-mode: Optional client mode label such asV4orV5. WhenV5the program can auto-resolve.cfmetadata from a config directory.--primary-metadata-path: Explicit path to a metadata JSON for primary.cfinput.--secondary-metadata-path: Explicit path to a metadata JSON for secondary.cfinput.--metadata-config-directory: Directory of JSON metadata configs used to resolve.cfmetadata by input file name (used when--client-mode V5).--tolerance-value: Numeric tolerance used for numeric comparisons (defaults to exact match if not provided).--header-count: Number of header rows to skip in text files (default0).--delimiter: Delimiter for text files. Accepts a single character,TAB, or\t. Default is|.--rows-to-check: Optional limit on how many rows are read from the primary input (useful for quick checks).
Behavior notes:
- If using
--keys/--values(shared columns), these names/positions are applied to both inputs. - If the program cannot resolve keys or value mappings, it will fail with a helpful message.
- Duplicate keys are treated as invalid rows and the run will fail (non-zero exit) if duplicates are present — invalid-key reports are still written to the output directory.
- If the .cf file is a dynamic cf file, then even column positions can be given to compare instead of metadata field names.
4. Metadata handling for .cf inputs
- You can provide explicit metadata files with
--primary-metadata-pathor--secondary-metadata-path(JSON describing fields and types). - If
--client-mode V5and--metadata-config-directoryis provided, the program will try to resolve metadata automatically:- It searches JSON files in the directory for an
inputfield matching the full path or filename of the.cfbeing compared. - If a single match is found, that metadata is used.
- It searches JSON files in the directory for an
5. Output files and meaning
All outputs are written into the directory provided by --output-directory.
missing_in_primary.txt— rows present in the secondary file but not in primary. Each line is a display string for the missing record.missing_in_secondary.txt— rows present in the primary file but not in secondary. Each line is a display string for the missing record.invalid_keys_primary.txt— invalid or duplicate primary rows. Each line is of the formREASON|RECORD_DISPLAY(e.g.Empty key|...orDuplicate key|...).invalid_keys_secondary.txt— same as above for secondary.mismatch.txt— list of mismatched rows with a header row. There are two header variants depending on whether a TXT input is involved:- When a TXT input is involved, the header is:
key|mismatch_columns|primary_record|secondary_record
- `key`: the configured key values.
- `mismatch_columns`: a bracketed list of the column labels that differ, e.g. `[colA, colB]`.
- `primary_record` and `secondary_record`: The entire record from both files is displayed with comma as delimiter.
- When no TXT input is involved (both sources are structured, e.g.
.cf), the header is:
key|mismatch_columns|primary_record|secondary_record|missing_in_primary_cashflow|missing_in_secondary_cashflow
The extra two fields contain nested cashflow differences for cashflow-type columns (if any). Each is an encoded list like `[(interest,principal,date), ...]`. Empty if not applicable.
summary.txt— summary file includes:
- Count in primary: total rows read from primary.
- Count in secondary: total rows read from secondary.
- Valid rows primary: total number of valid records in primary file.
- Valid rows secondary: total number of valid records in secondary file.
- Invalid rows primary: total number of invalid records in primary file.
- Invalid rows secondary: total number of invalid records in secondary file.
- Mismatch count: total number of records which has mismatched values.
- Missing in primary count: total number of records missing in primary file.
- Missing in secondary count: total number of records missing in secondary file.
- Distinct mismatch columns: [col1, col2, ...] — list of columns that had mismatches.
6. Common usage examples
- Basic same-structure compare (keys by name):
--primary-file primary.cf
--secondary-file secondary.cf
--keys accountId
--output-directory ./out
- Text file compare with header and comma delimiter:
--primary-file primary.txt
--secondary-file secondary.txt
--keys 1,2
--delimiter ,
--header-count 1
--output-directory ./out
- Mapped compare when columns differ:
--primary-file a.cf
--secondary-file b.cf
--primary-keys id
--secondary-keys acct_id
--primary-values balance
--secondary-values bal
--output-directory ./out
- Use a tolerance for numeric comparison (example: 0.01):
--primary-file a.cf
--secondary-file b.cf
--keys id
--tolerance-value 0.01
--output-directory ./out