tts_data_utils.core

Submodules

tts_data_utils.core.container_history

class tts_data_utils.core.container_history.DataContainerHistoryContainer(name, metadata)

Bases: DataContainer

A special container that is added by default to all data containers (except history containers to avoid infinite recursion).

As containers are filtered, appended, and sliced, it can become difficult to trace how a particular came to be. The DataContainerHistoryContiner tracks actions that have been taken in order to build a container from its initial metadata and including each manipulation that happens going forward.

Data Container history is not meant to make previous states of the container reproducable, but just to offer a crutch in debugging code.

The number of records is also provided to enable some level of data analysis based on filtering values.

Parameters:

name (str) – The name of this instantiation of the data container.
metadata (dict) – Dictionary of open-ended metadata that can be provided in extensions of DataContainer

DATA_ITEM_CLS: alias of DataContainerHistoryItem

NAME = 'data container history'

property repr_cols: Columns to be used for representations (terminal, HTML, etc.).

class tts_data_utils.core.container_history.DataContainerHistoryItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)

Bases: DataItem

DataItem to go with DataContainerHistoryContainer

Parameters:

Action (str) – Which action was taken at the step represented in this row?
Description (str) – Description of the action taken at the step represented in this row
Count (Starting) – Number of records after the action was taken
Count – Number of records before the action was taken
Remaining (Percent) – Percentage of records at this step relative to previous step

DICT_VALID_KEYS = [('Action', <class 'str'>), ('Description', <class 'str'>), ('Ending Count', (<class 'int'>, <class 'float'>, <class 'str'>)), ('Starting Count', (<class 'int'>, <class 'float'>, <class 'str'>)), ('Percent Remaining', (<class 'int'>, <class 'float'>, <class 'str'>))]

time(): Must define some way to time-tag each data item.

tts_data_utils.core.data_container

class tts_data_utils.core.data_container.DataContainer(raw_data=None, subcontainers=None, csv_path=None, xlsx_path=None, django_records=None, metadata=None, name=None, cast_fields=False, validate=True, lorem=None, **kwargs)

Bases: ABC

Primary (abstract) class for this library. Provides representation of 2D data with extension hooks for easy definition of quality-of-life features for any bespoke data type across projects.

Concept: Allows for easy tabular representation in terminals and HTML, playing nicely with html_utils to provide easy reporting of tabular data and nested tabular data.

When defining an extension of this class, a DataItem class is also provided, which controls the expected columns in each row.

Each row of the 2D data is represented by an instance of the associated DataItem class, stored in self.records. Most dunder methods have been defined such that this class behaves like a list (mapping to self.records), but carries the container’s metadata and history along with it.

TO DO: Provide gallery of examples of outputs (see ticket #34 TO DO: Migrate out of JPL-internal issues)

Parameters:

raw_data (list[dict], optional) – 2D data to be transformed into DataContainer.
subcontainers (list[dict[str, DataContainer]], optional) – List of dictionaries where key is a label and value is a DataContainer. Must match length of raw_data.
csv_path (Path | str, optional) – Path for CSV to be transformed into DataContainer.
xlsx_path (Path | str, optional) – Path for XLSX to be transformed into DataContainer.
django_records (QuerySet, optional) – Django object containing data to be transformed.
metadata (dict, optional) – Arbitrary user information to be carried with the container.
name (str, optional) – Name of the DataContainer instance.
cast_fields (bool) – If True, attempts to force data into types defined in DataItem.
validate (bool) – If True, validates inputs against DataItem’s valid keys/types.
lorem (int, optional) – If provided as an integer, generates that many rows of dummy data.

DATA_ITEM_CLS = None: Associated DataItem that must be defined alongside a DataContainer.

DO_NOT_DIFF = []: Keys to ignore when running self.diff.

abstract class property NAME: Name of the data type being contained, i.e. ‘evr’, ‘transpire_commands’.

after(time, time_label=None, inclusive=False, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” occur after the value in the “time” parameter.

Unlike most other filter methods, time MUST be a datetime.

Note that “key” is not requried since DataItems have default time columns. If an object takes multiple time columns (or if using something like GenericContainer with no default time label), the time_label kwarg is provided.

Parameters:

value (datetime) – Time to compare against
time_label (str) – Name of time column to use if not the default
inclusive (bool) – If a row’s time matches “time” exactly, should it be included?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

append(items, cast_fields=False, fill=False)

Adds one or more items to the end of the container’s records.

Parameters:

items – A dictionary, DataItem, or a list of either to append.
cast_fields – If True, attempts to force data into types defined in DataItem.
fill – If True, fills in missing keys with default values.

assert_records_match_hash(expected_hash): Validates the integrity of the records against a known hash.

before(time, time_label=None, inclusive=False, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” occur before the value in the “time” parameter.

Unlike most other filter methods, time MUST be a datetime.

Note that “key” is not requried since DataItems have default time columns. If an object takes multiple time columns (or if using something like GenericContainer with no default time label), the time_label kwarg is provided.

Parameters:

value (datetime) – Time to compare against
time_label (str) – Name of time column to use if not the default
inclusive (bool) – If a row’s time matches “time” exactly, should it be included?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

between(key, lower, upper, inclusive='both', minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” occur between the values in the “lower” and “upper” parameters.

Unlike most other filter methods, time MUST be a datetime.

Note that “key” is required on this method unlike the before and after methods. This is just an error by the developer. It is slated to be fixed at the next major release since it will be a breaking change: issue #32 (TO DO: Migrate out of JPL-internal issues)

Parameters:

key (str) – Name of time column to use
value (datetime) – Time to compare against
inclusive (str (should be upper, lower, both, or neither)) – If a row’s time matches “time” exactly, should it be included?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

calculate_records_hash()

Generates a SHA256 hash representing the current state of all records.

The Process: Normalizes timestamps based on DataItem time formats to ensure consistent string representation before hashing the JSON-encoded record set.

compare_rows(l, r)

Calculates the similarity between two DataItems by counting matching values.

Concept: This is used by the visual diff engine to determine if two rows are similar enough to be considered a ‘replacement’ rather than an ‘insertion’ and ‘deletion’. It iterates through keys in the left item and checks for equality in the right item.

Parameters:

l (DataItem) – The left DataItem.
r (DataItem) – The right DataItem.

Returns:

Integer count of identical fields.

Return type:

int

contains(key, substring, case_sensitive=True, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field contains the value in the “substring” parameter as a substring.

Parameters:

key (str) – Name of column to filter on
substring (str) – Value to compare against
value (bool) – Should the substring match be case sensitive?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

property default_html_row_style: Returns default CSS dictionary for HTML rows.

property default_time_label: Returns the primary key used for time-based operations.

diff(left='48vf34VD)$', right='48vf34VD)$', name='', ancestors='', diff_container=None, summarize=False, debug=False, do_not_diff_keys=[], ignore=[], float_tol=1e-10)

Generates a DiffContainer with a comprehensive comparison between two objects. Recursively trees down through all attributes until the structures are fully diffed.

The Concept: This method is the backbone of the library’s regression testing suite. It is designed to compare a runtime DataContainer against a “vetted” baseline (typically from a CSV). It identifies missing keys, mismatched values, and type discrepancies across nested lists and dictionaries.

Handling Differently Ordered Data: Note that this method does not yet handle reordered containers gracefully; it is optimized for structures that are expected to be very similar in sequence.

The Null Guard: The default value ‘48vf34VD)$’ is used instead of None to allow None to be passed as a valid value to be diffed without triggering the “missing argument” logic.

Parameters:

left – The primary value or container to compare.
right – The second value or container to compare. If omitted, self is treated as left and the first argument is treated as right.
name – Internal tracker for the current field name (used in recursion).
ancestors – Internal tracker for the breadcrumb path (used in recursion).
diff_container – The accumulator for diff results.
summarize – If True, returns a boolean (True if all match) instead of the container.
do_not_diff_keys – Keys to skip (useful for history or dynamic IDs).
ignore – Output paths to prune from the final results.
float_tol – Maximum allowance for floating-point precision drift.

Returns:

A DiffContainer object or a boolean result.

docx_table(template=None)

Produces a Microsoft Word table representation.

Parameters:: template – Path to an optional template docx for styling.
Returns:: Rendered DocxTable object.

doesnotcontain(key, substring, case_sensitive=True, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field does not contain the value in the “substring” parameter as a substring.

Parameters:

key (str) – Name of column to filter on
substring (str) – Value to compare against
value (bool) – Should the substring match be case sensitive?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

doesnotmatch(key, pattern, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” does not match the regex in the parameter “pattern”.

Parameters:

key (str) – Name of column to not match against
time_label (str) – Name of time column to use if not the default
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

eq(key, value, minimum=None, maximum=None, exactly=None, tolerance=0)

Return a decimated verison of this DataContainer where all rows where column in “key” field matches value in “value” field.

Parameters:

key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

file_contents_as_string(*args, **kwargs): Placeholder for breaking tests. One day at a time here…

gt(key, value, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field is greater than value in “value” parameter.

Parameters:

key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

gte(key, value, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field is greater than or equal to value in “value” parameter.

Parameters:

key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

inject_error(lamb)

Iterates through records and applies a transformation lambda. Useful for error injection or data simulation.

Parameters:: lamb – A function that accepts a record and returns (bool, key, value).

insert(index, record): Inserts a record at the specified index and returns a new container instance.

isin(key, values, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field matches value any of the values in the list “values”.

Parameters:

key (str) – Name of column to filter on
value – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

lt(key, value, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field is less than value in “value” parameter.

Parameters:

key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

lte(key, value, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field is less than or equal to value in “value” parameter.

Parameters:

key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

matches(key, pattern, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” matches the regex in the parameter “pattern”.

Parameters:

key (str) – Name of column to match against
pattern (r-string) – Regex pattern
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

ne(key, value, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field does not match value in “value” field.

Parameters:

key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

notin(key, values, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” field does not match any value in the list “values”.

Parameters:

key (str) – Name of column to filter on
value – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

on_change(key, minimum=None, maximum=None, exactly=None)

Return a decimated verison of this DataContainer where all rows where column in “key” is different than in the row before. Will always include first row.

Parameters:

key (str) – Name of column to inspect for changes
time_label (str) – Name of time column to use if not the default
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem

power_table(superheader=None, columns=None, bypass_styles=False, row_styles=None, cell_styles=None, **kwargs)

Produce a rich, interactive HTML table representation of this DataContainer.

Concept: This method integrates with html_utils to translate the 2D records into a PowerTable. It handles complex nesting by recursively calling power_table on any subcontainers linked to specific rows.

Parameters:

superheader (str) – Title row spanning the full width of the table.
columns (list[str]) – Labels to include. Defaults to self.repr_cols.
bypass_styles (bool) – If True, default CSS and row-level styles are ignored.
row_styles (list[dict[str, str]]) – Custom CSS for each row. Must match self.records length.
cell_styles (list[list[dict[str, str]]]) – Custom CSS for each cell. Must match self.records length.
kwargs – Passthrough arguments for PowerTable (e.g., id, add_filters).

Returns:

A rendered PowerTable component.

read_csv(csv_path): Reads a CSV file into a list of record dictionaries.

read_xlsx(xlsx_path): Reads an Excel file into a list of record dictionaries, handling NaN values as None.

property repr_cols: Columns to be used for representations (terminal, HTML, etc.).

simple_record_table(*args, **kwargs): Placeholder for breaking tests. One day at a time here…

sort(by=None, lam=None, reverse=False)

Return a version of the DataContainer with rows sorted by the row in the “by” kwarg or by the lambda funciton in the “lam” kwarg.

Always sorts by ascending (for now, see https://github.jpl.nasa.gov/teamtools-studio/data_utils/issues/33)

Parameters:

by (str) – Name of column to sort by
lam (lambda) – Lambda to control how values are sorted
reverse (bool) – By default, sorts like Python list sort. This works the same as reverse kwarg on default list sort

Returns:

Returns a new DataContainer exactly the same as this one, but with updated history and

sorted outputs. :rtype: DataContainer

property source: Returns a list of raw source dictionaries for all contained records.

stamp_all(dispo_choice, dispo_format): Iterates through all data and applies a disposition stamp.

classmethod subdivide(sub_data, bypass_validation=False): Class method to create a new ‘sub-container’ instance.

subdivide_f(sub_f): Returns a subdivided container based on a filter function.

summarize(key, expected_values=None, include_times=True)

Generates a summary table counting occurrences and time ranges for unique values in a specific column.

The Concept: This method transforms the current data into a frequency report. If expected_values are provided, it validates the data against them and ensures the output table follows the user’s preferred ordering, while still appending any unexpected “rogue” values at the end of the list.

Parameters:

key (str) – The column name to summarize.
expected_values (list, optional) – Optional list of values to check for and order by.
include_times (bool) – If True, adds “First Occurrence” and “Last Occurrence” columns.

Returns:

A GenericContainer containing the summary records.

table(columns=None)

Explicitly prints the ASCII grid table representation to standard output.

Concept: While __repr__ handles automatic display in the terminal, this method allows for programmatic printing with an optional subset of columns.

Parameters:: columns (list[str], optional) – List of column labels to include. Defaults to self.repr_cols.

to_csv(csv_path, mkdirs=False)

Writes the container’s records to a CSV file.

Parameters:

csv_path – Target file path.
mkdirs – If True, creates the target directory if it does not exist.

unique(key, exclude=[], sort=True)

Returns a list of unique values found in a specific column.

Parameters:

key (str) – Name of the column to inspect.
exclude (list) – List of values to filter out of the final unique list.
sort (bool) – If True, the resulting list is sorted ascending.

Returns:

A list of unique values.

property valid: Returns True if all records pass validation (or if validation is bypassed).

visual_diff(right, ignore_cols=[], tolerance={})

Generates a side-by-side visual alignment between this container and another.

The Concept: This uses SequenceMatcher to find the best horizontal alignment between two datasets. It identifies identical rows, modified rows (replace), and inserted/deleted rows. It then injects “empty” placeholders into the resulting containers so that matching records stay horizontally synchronized when rendered.

Parameters:

right (DataContainer) – The DataContainer to compare against.
ignore_cols (list[str]) – Columns to exclude from the row-matching signature.
tolerance (dict[str, float]) – Drift allowance for numeric or datetime columns.

Returns:

A tuple of two VisualDiffContainers (left, right).

with_cols(columns)

Returns a new version of this container with the display columns changed. Will add new (empty) columns if they do not currently exist in the records.

Parameters:: columns (list[str]) – List of column names to display/return.
Returns:: A new DataContainer instance with updated column settings.

tts_data_utils.core.data_container.after(time, time_label, inclusive=False)

Returns a predicate for datetime comparison (later than).

Parameters:

time (datetime) – Time for comparison
time_label (str) – Label for time column
inclusive (bool) – Should we include a time that is exactly equal? Defaults to False

tts_data_utils.core.data_container.before(time, time_label, inclusive=False)

Returns a predicate for datetime comparison (earlier than).

Parameters:

time (datetime) – Time for comparison
time_label (str) – Label for time column
inclusive (bool) – Should we include a time that is exactly equal? Defaults to False

tts_data_utils.core.data_container.between(key, lower, upper, inclusive='both')

Returns a predicate for range comparison.

Parameters:

key (str) – column to filter on
lower – Lower value for range comparison
upper – Upper value for range comparison
inclusive – One of ‘both’, ‘neither’, ‘lower’, or ‘upper’.

tts_data_utils.core.data_container.contains(key, substring, case_sensitive=True)

Returns a predicate for substring matching.

Parameters:

key (str) – column to filter on
substring (str) – Substring to check values for
case_sensitive (bool) – whether to check with case sensitiveiy or not. Defaults to True

tts_data_utils.core.data_container.doesnotcontain(key, substring, case_sensitive=True)

Returns a predicate for negative substring matching.

Parameters:

key (str) – column to filter on
substring (str) – Substring to check values for
case_sensitive (bool) – whether to check with case sensitiveiy or not. Defaults to True

tts_data_utils.core.data_container.doesnotmatch(key, pattern)

Returns a predicate for negative regex matching.

Parameters:

key (str) – column to filter on
pattern (str) – regex pattern to match with

tts_data_utils.core.data_container.eq(key, value, tolerance=0)

Returns a predicate for: field == value.

Parameters:

key (str, int, float, datetime) – column to filter on
value (any) – value to check against

tts_data_utils.core.data_container.find_bad_utf8_characters(filepath)

Helper function to identify non-UTF-8 characters in CSV files, common when interacting with Windows-generated Microsoft Excel files.

The Problem: When CSVs are saved via Excel on Windows, non-UTF-8 characters—like curled quotation marks, single-character arrows, and degree symbols—are often added. Attempting to read these into Pandas causes a UnicodeDecodeError.

The Solution: This script allows developers to catch that exception, read the file in binary mode, and report the exact line and byte offset of the first error to facilitate cleaning.

Future Improvements: * Amend to repair the file automatically (referencing the M20 dictionary

input management logic).

Report all encoding errors instead of just the first.

See ticket #31 ( TO DO: Migrate out of JPL-internal issues))

Parameters:: filepath (str or pathlib.Path) – Path to the CSV file to be checked.

tts_data_utils.core.data_container.gt(key, value)

Returns a predicate for: field > value.

Parameters:

key (str) – column to filter on
value (int or float) – value to check against

tts_data_utils.core.data_container.gte(key, value)

Returns a predicate for: field >= value.

Parameters:

key (str) – column to filter on
value (int or float) – value to check against

tts_data_utils.core.data_container.isin(key, values)

Returns a predicate for: field in list_of_values.

Parameters:

key (str) – column to filter on
value (list) – value to check against

tts_data_utils.core.data_container.lt(key, value)

Returns a predicate for: field < value.

Parameters:

key (str) – column to filter on
value (int or float) – value to check against

tts_data_utils.core.data_container.lte(key, value)

Returns a predicate for: field <= value.

Parameters:

key (str) – column to filter on
value (int or float) – value to check against

tts_data_utils.core.data_container.matches(key, pattern)

Returns a predicate for regex matching.

Parameters:

key (str) – column to filter on
pattern (str) – regex pattern to match with

tts_data_utils.core.data_container.ne(key, value)

Returns a predicate for: field != value.

Parameters:

key (str) – column to filter on
value (any) – value to check against

tts_data_utils.core.data_container.notin(key, values)

Returns a predicate for: field not in list_of_values.

Parameters:

key (str) – column to filter on
value (list) – value to check against

tts_data_utils.core.data_container.on_change(key)

Returns a stateful predicate that triggers when the value in a column changes relative to the previous record.

Parameters:: key (str) – Column to check for changes in

tts_data_utils.core.data_item

class tts_data_utils.core.data_item.DataItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)

Bases: ABC

The fundamental atomic unit of a DataContainer, representing a single row.

The Concept: A DataItem acts as a “Smart Dictionary” with a memory. It separates data into two layers: Source and Derived.

Think of it like a piece of trace paper over an original document. The source (the document) remains untouc hed for auditability. Any edits, calculations, or augmentations are written on the derived_values (the trace paper). When you ask for a value, the DataItem looks at the trace paper first; if nothing is there, it reads from the original document.

Traceability & Integrity: This architecture ensures that the original raw data is never destroyed or altered by processing logic, which is critical for engineering applications requiring data lineage.

Parameters:

source (dict) – Dictionary of raw data for the row.
subcontainers (dict[str, DataContainer], optional) – Nested containers attached to this row.
copy_data (bool) – If True, deep-copies the source data to prevent mutation.
cast_fields (bool) – If True, attempts to force fields into canonical datatypes.
fill (bool) – If True, adds None for missing columns defined in DICT_VALID_KEYS.
is_django (bool) – Is the source coming from a Django model?
validate (bool) – Should we raise an exception if fields are the wrong type?
default_dispo (str) – Default disposition object (Dexter only).

DEFAULT_DISPO = None

DO_NOT_DIFF = []: Fields to ignore in diffs

FLOAT_FORMAT = {}

NAME = 'data item': Name of the data item to be printed in some situations

TIME_FORMAT_PRECISION = {}

add_dispo(disposition)

Dexter only, and should be broken out into its own extension if possible.

Add a disposition to a dexter object.

TBD. Ask Nick Peper FMI

Parameters:: disposition (Dexter Disposition) – Disposition for whatever has happened to this DataItem

any_batches()

Dexter only, and should be broken out into its own extension if possible.

TBD. Ask Nick Peper FMI

choose_and_stamp(dispo_choice, dispo_format)

Dexter only, and should be broken out into its own extension if possible.

Does the same as choose_dispo(), but also stamps the original DataItem with either a string or a list of dispositions.

Parameters:

dispo_choice (DISPO_CHOICE) – FIRST, LAST, ALL
dispo_format (DISPO_FORMATL) – HTML, EXCEL, TEXT

choose_dispo(dispo_choice)

Dexter only, and should be broken out into its own extension if possible.

In Dexter, any DataItem can receive many dispositions. This method chooses which to present to the user

Parameters:: dispo_choice (DISPO_CHOICE) – How would you like to roll up dispositions? FIST, LAST, ALL?

property default_html_cell_styles: A mapping of keys to CSS styles for individual HTML cells.

property default_html_row_style: Default CSS styles for an HTML table row.

property default_rich_text_row_style: Default styles for rich-text terminal output.

classmethod empty(keys=[]): Returns an instance with all columns set to empty strings or None.

in_batch(batch)

Dexter only, and should be broken out into its own extension if possible.

TBD. Ask Nick Peper FMI

Parameters:: batch (DataUtils Batch) – Pointer to the data on which a disposition should be checked

new_dispo()

Dexter only, and should be broken out into its own extension if possible.

Add a disposition to a dexter object. Like add_dispo, but just adds an empty disposition.

TBD. Ask Nick Peper FMI

property printable_values: Same as values(), but with formatting applied: - TIME_FORMATS used to convert datetimes to strings - FLOAT_FORMAT used to format float values (if defined)

row_signature(ignore_cols=[])

Generates a hashable representation of the row.

Concept: Used to determine if two items are effectively identical. It recursively converts mutable types (lists, dicts) into immutable tuples to ensure the signature can be hashed.

stamp(dispo_value)

Dexter only, and should be broken out into its own extension if possible.

Adds a disposition value to source. Not used in isolation, and should probably be made private in a future version

tag_with_batch(batch)

Dexter only, and should be broken out into its own extension if possible.

TBD. Ask Nick Peper FMI

Parameters:: batch (DataUtils Batch) – Pointer to the data on which a disposition should be checked

abstract property time: Must define some way to time-tag each data item.

property valid: Runs self.validate, but retunds a simple bool instead of a list of invalid records.

validate(): Compares values in each field to those provided in DICT_VALID_KEYS, which must be provided in each extension of DataItem.

property values: Property to return all values, be they from the original source, newly added, or altered.

tts_data_utils.core.diff

class tts_data_utils.core.diff.DiffContainer(name, metadata, csv_path=None)

Bases: DataContainer

The results object generated by a DataContainer.diff() operation.

The Concept: A DiffContainer is a specialized collection that holds DiffItems. Its primary purpose is to act as a “Filterable Report.” By default, when you print or view this container, it filters itself to show only the differences (Same == False), allowing you to ignore identical data and focus on what changed.

Usage Note: You generally do not instantiate this class manually. It is returned automatically when you call .diff() on any standard DataContainer.

Parameters:

name (str) – The identifier for this diff report.
metadata (dict) – Contextual information about the two containers being compared.
csv_path (str, optional) – Optional path to load/save the diff results.

DATA_ITEM_CLS: alias of DiffItem

DO_NOT_DIFF = ['left', 'right']: Keys to ignore when running self.diff.

NAME = 'diff'

is_same()

Filters the report to show only matching records.

Concept: Useful for “sanity checking” to confirm which values were verified as identical.

left(key)

Retrieves the baseline value for a specific key.

Parameters:: key (str) – The field name to look up.
Returns:: The value from the ‘Left’ side of the comparison.

not_same()

Filters the report to show only mismatched records.

Concept: This is the primary troubleshooting view, isolating only the discrepancies between datasets.

property repr_cols: Returns the standard list of columns for display.

right(key)

Retrieves the comparison value for a specific key.

Parameters:: key (str) – The field name to look up.
Returns:: The value from the ‘Right’ side of the comparison.

class tts_data_utils.core.diff.DiffItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)

Bases: DataItem

An individual record representing a field-level comparison between two objects.

The Concept: Think of a DiffItem as a single row in a “Spot the Difference” report. It doesn’t just store a value; it stores the value from the “Left” side, the value from the “Right” side, and a pre-calculated verdict on whether they are identical.

How it works: When two DataContainers are compared, each shared key is packed into a DiffItem. If the data types differ between the sides (e.g., comparing a string to an integer), the Type field is automatically set to “various” to alert the user of a schema mismatch.

Parameters:

Key (str) – Name of the attribute or field being compared.
Same (bool) – A boolean flag indicating if Left and Right values match.
Type (str) – The Python data type of the values (or “various”).
Left (int | float | str | None) – The value found in the original (baseline) container.
Right (int | float | str | None) – The value found in the comparison (target) container.

DICT_VALID_KEYS = [('Key', <class 'str'>), ('Same', <class 'bool'>), ('Type', <class 'str'>), ('Left', (<class 'int'>, <class 'float'>, <class 'str'>, None)), ('Right', (<class 'int'>, <class 'float'>, <class 'str'>, None))]

DO_NOT_DIFF = {'source': ['left', 'right']}: Fields to ignore in diffs

TIME_FORMATS = {}

time()

[Disabled]

DiffItems represent logical differences rather than temporal events, so time parsing is disabled by default.

tts_data_utils.core.generic

class tts_data_utils.core.generic.GenericContainer(raw_data=None, name=None, **kwargs)

Bases: DataContainer

Container provided for hooking into DataContainer infrastructure for which there is no extended class.

Usually this class shouldn’t be used in production, but is nice to have in development, especially when prototyping behavior for data types for which you are not certain you will want to eventually define an extension.

Some behavior will be missing without data validation defined, and no default styles for html or text repr will be available.

For parameters, see DataContainer

DATA_ITEM_CLS: alias of GenericItem

NAME = 'Generic Items'

property default_time_label: Returns the primary key used for time-based operations.

property repr_cols: Columns to be used for representations (terminal, HTML, etc.).

class tts_data_utils.core.generic.GenericItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)

Bases: DataItem

DataItem to go with GenericContainer

No validation requirements on these items.

DICT_VALID_KEYS = []

NAME = 'Generic Data Item': Name of the data item to be printed in some situations

TIME_FORMATS = {}

property default_html_row_style: Default CSS styles for an HTML table row.

property time: Must define some way to time-tag each data item.

property time_str

tts_data_utils.core.lorem_utils

Utilities for generating lorem ipsum dummy data for DataContainer instances.

This module provides functions to generate realistic dummy data based on the structure defined in a DataItem class.

tts_data_utils.core.lorem_utils.generate_lorem_data_for_item(data_item_cls, num_records=10)

Generate lorem ipsum data for a specific DataItem class.

Args:: data_item_cls: The DataItem class to generate data for num_records: Number of records to generate
Returns:: List of dictionaries with dummy data matching the DataItem structure

tts_data_utils.core.lorem_utils.generate_lorem_text(min_words=3, max_words=10): Generate lorem ipsum text with a random number of words.

tts_data_utils.core.lorem_utils.generate_smart_value(key, type_spec)

Generate a value for a specific key, using naming conventions to create more realistic data.

Args:: key: The field name type_spec: The type specification
Returns:: A value appropriate for the field name and type

tts_data_utils.core.lorem_utils.generate_value_for_type(type_spec)

Generate a random value that matches the given type specification.

Args:: type_spec: A type or tuple of types (from DICT_VALID_KEYS)
Returns:: A random value of the appropriate type

tts_data_utils.core.visual_diff

class tts_data_utils.core.visual_diff.VisualDiffContainer(raw_data=None, name=None, **kwargs)

Bases: DataContainer

A collection of VisualDiffItems, representing a full comparison report.

Concept: The VisualDiffContainer acts as the manager for a set of diffed records. It distinguishes between “Display” columns (what the user needs to see) and “Metadata” columns (the internal flags used to calculate differences).

Usage Note: By default, any column starting with an underscore (_) is treated as internal metadata and is hidden from the standard repr_cols view, though it remains accessible for CSV exports and logic processing.

Parameters:

raw_data (list[dict]) – A list of dictionaries representing the diffed rows.
name (str, optional) – The name of the container/report. Defaults to ‘Generic Container’.
kwargs – Additional arguments passed to the parent DataContainer.

DATA_ITEM_CLS: alias of VisualDiffItem

NAME = 'Visual Diff Container'

property default_time_label: The primary time-based column label used for sorting or indexing.

property repr_cols: The list of columns intended for visual display (excludes internal metadata).

class tts_data_utils.core.visual_diff.VisualDiffItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)

Bases: DataItem

A specialized data record designed for side-by-side comparison.

The Concept: When comparing two versions of a dataset (a “Diff”), we don’t just care about the raw values—we care about the state of each row relative to its counterpart. Is it new? Was it deleted? Was it modified?

A VisualDiffItem carries the data values along with metadata (prefixed with _visdiff) that tells a renderer exactly how to style the row (e.g., green for an insert, red for a delete) to make differences immediately obvious to a human reviewer.

Styling Logic: * Row Level: The entire background color is determined by the _visdiff_match status. * Cell Level: Individual cells are highlighted if their specific key exists in

the _mismatched_keys list.

DICT_VALID_KEYS = [('_visdiff_index', (<class 'int'>, None)), ('_visdiff_match', (<class 'str'>, None)), ('_mismatched_keys', <class 'list'>)]

MATCH_STATUS_COLORS = {'delete': {'background-color': '#FF614A', 'color': '#333333'}, 'empty_from_delete': {'background-color': '#BDBEBD', 'color': '#BDBEBD'}, 'empty_from_insert': {'background-color': '#BDBEBD', 'color': '#BDBEBD'}, 'equal': {'background-color': '#FFFFF', 'color': '#333333'}, 'insert': {'background-color': '#95FB95', 'color': '#333333'}, 'replace': {'background-color': '#ADC4DF', 'color': '#333333'}, 'unmatched': {'background-color': '#FF614A', 'color': '#333333'}}

NAME = 'Generic Data Item': Name of the data item to be printed in some situations

PALETTE = {'blue': {'background-color': '#ADC4DF', 'color': '#333333'}, 'error': {'background-color': '#000000', 'color': '#FF614A'}, 'green': {'background-color': '#95FB95', 'color': '#333333'}, 'grey': {'background-color': '#BDBEBD', 'color': '#BDBEBD'}, 'red': {'background-color': '#FF614A', 'color': '#333333'}, 'white': {'background-color': '#FFFFF', 'color': '#333333'}}

TIME_FORMATS = {'Time': '%Y-%jT%H:%M:%S.%f'}

property default_html_cell_styles: Returns a mapping of column keys to CSS styles, highlighting specific mismatched cells.

property default_html_row_style: Returns the CSS style dictionary for the table row based on the match status.

property time: Returns the native time object for the item.

property time_str: Returns a formatted string representation of the item’s timestamp.