tts_data_utils.core
Submodules
tts_data_utils.core.container_history
- class tts_data_utils.core.container_history.DataContainerHistoryContainer(name, metadata)
Bases:
DataContainerA special container that is added by default to all data containers (except history containers to avoid infinite recursion).
As containers are filtered, appended, and sliced, it can become difficult to trace how a particular came to be. The DataContainerHistoryContiner tracks actions that have been taken in order to build a container from its initial metadata and including each manipulation that happens going forward.
Data Container history is not meant to make previous states of the container reproducable, but just to offer a crutch in debugging code.
The number of records is also provided to enable some level of data analysis based on filtering values.
- Parameters:
name (str) – The name of this instantiation of the data container.
metadata (dict) – Dictionary of open-ended metadata that can be provided in extensions of DataContainer
- DATA_ITEM_CLS
alias of
DataContainerHistoryItem
- NAME = 'data container history'
- property repr_cols
Columns to be used for representations (terminal, HTML, etc.).
- class tts_data_utils.core.container_history.DataContainerHistoryItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)
Bases:
DataItemDataItem to go with DataContainerHistoryContainer
- Parameters:
Action (str) – Which action was taken at the step represented in this row?
Description (str) – Description of the action taken at the step represented in this row
Count (Starting) – Number of records after the action was taken
Count – Number of records before the action was taken
Remaining (Percent) – Percentage of records at this step relative to previous step
- DICT_VALID_KEYS = [('Action', <class 'str'>), ('Description', <class 'str'>), ('Ending Count', (<class 'int'>, <class 'float'>, <class 'str'>)), ('Starting Count', (<class 'int'>, <class 'float'>, <class 'str'>)), ('Percent Remaining', (<class 'int'>, <class 'float'>, <class 'str'>))]
- time()
Must define some way to time-tag each data item.
tts_data_utils.core.data_container
- class tts_data_utils.core.data_container.DataContainer(raw_data=None, subcontainers=None, csv_path=None, xlsx_path=None, django_records=None, metadata=None, name=None, cast_fields=False, validate=True, lorem=None, **kwargs)
Bases:
ABCPrimary (abstract) class for this library. Provides representation of 2D data with extension hooks for easy definition of quality-of-life features for any bespoke data type across projects.
Concept: Allows for easy tabular representation in terminals and HTML, playing nicely with html_utils to provide easy reporting of tabular data and nested tabular data.
When defining an extension of this class, a DataItem class is also provided, which controls the expected columns in each row.
Each row of the 2D data is represented by an instance of the associated DataItem class, stored in self.records. Most dunder methods have been defined such that this class behaves like a list (mapping to self.records), but carries the container’s metadata and history along with it.
TO DO: Provide gallery of examples of outputs (see ticket #34 TO DO: Migrate out of JPL-internal issues)
- Parameters:
raw_data (list[dict], optional) – 2D data to be transformed into DataContainer.
subcontainers (list[dict[str, DataContainer]], optional) – List of dictionaries where key is a label and value is a DataContainer. Must match length of raw_data.
csv_path (Path | str, optional) – Path for CSV to be transformed into DataContainer.
xlsx_path (Path | str, optional) – Path for XLSX to be transformed into DataContainer.
django_records (QuerySet, optional) – Django object containing data to be transformed.
metadata (dict, optional) – Arbitrary user information to be carried with the container.
name (str, optional) – Name of the DataContainer instance.
cast_fields (bool) – If True, attempts to force data into types defined in DataItem.
validate (bool) – If True, validates inputs against DataItem’s valid keys/types.
lorem (int, optional) – If provided as an integer, generates that many rows of dummy data.
- DATA_ITEM_CLS = None
Associated DataItem that must be defined alongside a DataContainer.
- DO_NOT_DIFF = []
Keys to ignore when running self.diff.
- abstract class property NAME
Name of the data type being contained, i.e. ‘evr’, ‘transpire_commands’.
- after(time, time_label=None, inclusive=False, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” occur after the value in the “time” parameter.
Unlike most other filter methods, time MUST be a datetime.
Note that “key” is not requried since DataItems have default time columns. If an object takes multiple time columns (or if using something like GenericContainer with no default time label), the time_label kwarg is provided.
- Parameters:
value (datetime) – Time to compare against
time_label (str) – Name of time column to use if not the default
inclusive (bool) – If a row’s time matches “time” exactly, should it be included?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- append(items, cast_fields=False, fill=False)
Adds one or more items to the end of the container’s records.
- Parameters:
items – A dictionary, DataItem, or a list of either to append.
cast_fields – If True, attempts to force data into types defined in DataItem.
fill – If True, fills in missing keys with default values.
- assert_records_match_hash(expected_hash)
Validates the integrity of the records against a known hash.
- before(time, time_label=None, inclusive=False, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” occur before the value in the “time” parameter.
Unlike most other filter methods, time MUST be a datetime.
Note that “key” is not requried since DataItems have default time columns. If an object takes multiple time columns (or if using something like GenericContainer with no default time label), the time_label kwarg is provided.
- Parameters:
value (datetime) – Time to compare against
time_label (str) – Name of time column to use if not the default
inclusive (bool) – If a row’s time matches “time” exactly, should it be included?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- between(key, lower, upper, inclusive='both', minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” occur between the values in the “lower” and “upper” parameters.
Unlike most other filter methods, time MUST be a datetime.
Note that “key” is required on this method unlike the before and after methods. This is just an error by the developer. It is slated to be fixed at the next major release since it will be a breaking change: issue #32 (TO DO: Migrate out of JPL-internal issues)
- Parameters:
key (str) – Name of time column to use
value (datetime) – Time to compare against
inclusive (str (should be upper, lower, both, or neither)) – If a row’s time matches “time” exactly, should it be included?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- calculate_records_hash()
Generates a SHA256 hash representing the current state of all records.
The Process: Normalizes timestamps based on DataItem time formats to ensure consistent string representation before hashing the JSON-encoded record set.
- compare_rows(l, r)
Calculates the similarity between two DataItems by counting matching values.
Concept: This is used by the visual diff engine to determine if two rows are similar enough to be considered a ‘replacement’ rather than an ‘insertion’ and ‘deletion’. It iterates through keys in the left item and checks for equality in the right item.
- contains(key, substring, case_sensitive=True, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field contains the value in the “substring” parameter as a substring.
- Parameters:
key (str) – Name of column to filter on
substring (str) – Value to compare against
value (bool) – Should the substring match be case sensitive?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- property default_html_row_style
Returns default CSS dictionary for HTML rows.
- property default_time_label
Returns the primary key used for time-based operations.
- diff(left='48vf34VD)$', right='48vf34VD)$', name='', ancestors='', diff_container=None, summarize=False, debug=False, do_not_diff_keys=[], ignore=[], float_tol=1e-10)
Generates a DiffContainer with a comprehensive comparison between two objects. Recursively trees down through all attributes until the structures are fully diffed.
The Concept: This method is the backbone of the library’s regression testing suite. It is designed to compare a runtime DataContainer against a “vetted” baseline (typically from a CSV). It identifies missing keys, mismatched values, and type discrepancies across nested lists and dictionaries.
Handling Differently Ordered Data: Note that this method does not yet handle reordered containers gracefully; it is optimized for structures that are expected to be very similar in sequence.
The Null Guard: The default value ‘48vf34VD)$’ is used instead of None to allow None to be passed as a valid value to be diffed without triggering the “missing argument” logic.
- Parameters:
left – The primary value or container to compare.
right – The second value or container to compare. If omitted, self is treated as left and the first argument is treated as right.
name – Internal tracker for the current field name (used in recursion).
ancestors – Internal tracker for the breadcrumb path (used in recursion).
diff_container – The accumulator for diff results.
summarize – If True, returns a boolean (True if all match) instead of the container.
do_not_diff_keys – Keys to skip (useful for history or dynamic IDs).
ignore – Output paths to prune from the final results.
float_tol – Maximum allowance for floating-point precision drift.
- Returns:
A DiffContainer object or a boolean result.
- docx_table(template=None)
Produces a Microsoft Word table representation.
- Parameters:
template – Path to an optional template docx for styling.
- Returns:
Rendered DocxTable object.
- doesnotcontain(key, substring, case_sensitive=True, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field does not contain the value in the “substring” parameter as a substring.
- Parameters:
key (str) – Name of column to filter on
substring (str) – Value to compare against
value (bool) – Should the substring match be case sensitive?
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- doesnotmatch(key, pattern, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” does not match the regex in the parameter “pattern”.
- Parameters:
key (str) – Name of column to not match against
time_label (str) – Name of time column to use if not the default
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- eq(key, value, minimum=None, maximum=None, exactly=None, tolerance=0)
Return a decimated verison of this DataContainer where all rows where column in “key” field matches value in “value” field.
- Parameters:
key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- file_contents_as_string(*args, **kwargs)
Placeholder for breaking tests. One day at a time here…
- gt(key, value, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field is greater than value in “value” parameter.
- Parameters:
key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- gte(key, value, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field is greater than or equal to value in “value” parameter.
- Parameters:
key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- inject_error(lamb)
Iterates through records and applies a transformation lambda. Useful for error injection or data simulation.
- Parameters:
lamb – A function that accepts a record and returns (bool, key, value).
- insert(index, record)
Inserts a record at the specified index and returns a new container instance.
- isin(key, values, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field matches value any of the values in the list “values”.
- Parameters:
key (str) – Name of column to filter on
value – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- lt(key, value, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field is less than value in “value” parameter.
- Parameters:
key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- lte(key, value, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field is less than or equal to value in “value” parameter.
- Parameters:
key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- matches(key, pattern, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” matches the regex in the parameter “pattern”.
- Parameters:
key (str) – Name of column to match against
pattern (r-string) – Regex pattern
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- ne(key, value, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field does not match value in “value” field.
- Parameters:
key (str) – Name of column to filter on
value (Varies depending on contents of "key" column) – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- notin(key, values, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” field does not match any value in the list “values”.
- Parameters:
key (str) – Name of column to filter on
value – Value to compare against
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- on_change(key, minimum=None, maximum=None, exactly=None)
Return a decimated verison of this DataContainer where all rows where column in “key” is different than in the row before. Will always include first row.
- Parameters:
key (str) – Name of column to inspect for changes
time_label (str) – Name of time column to use if not the default
minimum (int) – Minimum number of records to return. Will raise an exception if too few records match
maximum (int) – Maximum number of records to return. Will raise an exception if too many records match
exactly (int) – Exact number of records to return. Will raise an exception any other number of records match
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
filtered outputs (except if exactly=1, in which case it will return a DataItem only). :rtype: DataContainer or DataItem
- power_table(superheader=None, columns=None, bypass_styles=False, row_styles=None, cell_styles=None, **kwargs)
Produce a rich, interactive HTML table representation of this DataContainer.
Concept: This method integrates with html_utils to translate the 2D records into a PowerTable. It handles complex nesting by recursively calling power_table on any subcontainers linked to specific rows.
- Parameters:
superheader (str) – Title row spanning the full width of the table.
columns (list[str]) – Labels to include. Defaults to self.repr_cols.
bypass_styles (bool) – If True, default CSS and row-level styles are ignored.
row_styles (list[dict[str, str]]) – Custom CSS for each row. Must match self.records length.
cell_styles (list[list[dict[str, str]]]) – Custom CSS for each cell. Must match self.records length.
kwargs – Passthrough arguments for PowerTable (e.g., id, add_filters).
- Returns:
A rendered PowerTable component.
- read_csv(csv_path)
Reads a CSV file into a list of record dictionaries.
- read_xlsx(xlsx_path)
Reads an Excel file into a list of record dictionaries, handling NaN values as None.
- property repr_cols
Columns to be used for representations (terminal, HTML, etc.).
- simple_record_table(*args, **kwargs)
Placeholder for breaking tests. One day at a time here…
- sort(by=None, lam=None, reverse=False)
Return a version of the DataContainer with rows sorted by the row in the “by” kwarg or by the lambda funciton in the “lam” kwarg.
Always sorts by ascending (for now, see https://github.jpl.nasa.gov/teamtools-studio/data_utils/issues/33)
- Parameters:
by (str) – Name of column to sort by
lam (lambda) – Lambda to control how values are sorted
reverse (bool) – By default, sorts like Python list sort. This works the same as reverse kwarg on default list sort
- Returns:
Returns a new DataContainer exactly the same as this one, but with updated history and
sorted outputs. :rtype: DataContainer
- property source
Returns a list of raw source dictionaries for all contained records.
- stamp_all(dispo_choice, dispo_format)
Iterates through all data and applies a disposition stamp.
- classmethod subdivide(sub_data, bypass_validation=False)
Class method to create a new ‘sub-container’ instance.
- subdivide_f(sub_f)
Returns a subdivided container based on a filter function.
- summarize(key, expected_values=None, include_times=True)
Generates a summary table counting occurrences and time ranges for unique values in a specific column.
The Concept: This method transforms the current data into a frequency report. If expected_values are provided, it validates the data against them and ensures the output table follows the user’s preferred ordering, while still appending any unexpected “rogue” values at the end of the list.
- Parameters:
key (str) – The column name to summarize.
expected_values (list, optional) – Optional list of values to check for and order by.
include_times (bool) – If True, adds “First Occurrence” and “Last Occurrence” columns.
- Returns:
A GenericContainer containing the summary records.
- table(columns=None)
Explicitly prints the ASCII grid table representation to standard output.
Concept: While __repr__ handles automatic display in the terminal, this method allows for programmatic printing with an optional subset of columns.
- Parameters:
columns (list[str], optional) – List of column labels to include. Defaults to self.repr_cols.
- to_csv(csv_path, mkdirs=False)
Writes the container’s records to a CSV file.
- Parameters:
csv_path – Target file path.
mkdirs – If True, creates the target directory if it does not exist.
- unique(key, exclude=[], sort=True)
Returns a list of unique values found in a specific column.
- Parameters:
key (str) – Name of the column to inspect.
exclude (list) – List of values to filter out of the final unique list.
sort (bool) – If True, the resulting list is sorted ascending.
- Returns:
A list of unique values.
- property valid
Returns True if all records pass validation (or if validation is bypassed).
- visual_diff(right, ignore_cols=[], tolerance={})
Generates a side-by-side visual alignment between this container and another.
The Concept: This uses SequenceMatcher to find the best horizontal alignment between two datasets. It identifies identical rows, modified rows (replace), and inserted/deleted rows. It then injects “empty” placeholders into the resulting containers so that matching records stay horizontally synchronized when rendered.
- Parameters:
right (DataContainer) – The DataContainer to compare against.
ignore_cols (list[str]) – Columns to exclude from the row-matching signature.
tolerance (dict[str, float]) – Drift allowance for numeric or datetime columns.
- Returns:
A tuple of two VisualDiffContainers (left, right).
- with_cols(columns)
Returns a new version of this container with the display columns changed. Will add new (empty) columns if they do not currently exist in the records.
- Parameters:
columns (list[str]) – List of column names to display/return.
- Returns:
A new DataContainer instance with updated column settings.
- tts_data_utils.core.data_container.after(time, time_label, inclusive=False)
Returns a predicate for datetime comparison (later than).
- Parameters:
time (datetime) – Time for comparison
time_label (str) – Label for time column
inclusive (bool) – Should we include a time that is exactly equal? Defaults to False
- tts_data_utils.core.data_container.before(time, time_label, inclusive=False)
Returns a predicate for datetime comparison (earlier than).
- Parameters:
time (datetime) – Time for comparison
time_label (str) – Label for time column
inclusive (bool) – Should we include a time that is exactly equal? Defaults to False
- tts_data_utils.core.data_container.between(key, lower, upper, inclusive='both')
Returns a predicate for range comparison.
- Parameters:
key (str) – column to filter on
lower – Lower value for range comparison
upper – Upper value for range comparison
inclusive – One of ‘both’, ‘neither’, ‘lower’, or ‘upper’.
- tts_data_utils.core.data_container.contains(key, substring, case_sensitive=True)
Returns a predicate for substring matching.
- Parameters:
key (str) – column to filter on
substring (str) – Substring to check values for
case_sensitive (bool) – whether to check with case sensitiveiy or not. Defaults to True
- tts_data_utils.core.data_container.doesnotcontain(key, substring, case_sensitive=True)
Returns a predicate for negative substring matching.
- Parameters:
key (str) – column to filter on
substring (str) – Substring to check values for
case_sensitive (bool) – whether to check with case sensitiveiy or not. Defaults to True
- tts_data_utils.core.data_container.doesnotmatch(key, pattern)
Returns a predicate for negative regex matching.
- Parameters:
key (str) – column to filter on
pattern (str) – regex pattern to match with
- tts_data_utils.core.data_container.eq(key, value, tolerance=0)
Returns a predicate for: field == value.
- Parameters:
key (str, int, float, datetime) – column to filter on
value (any) – value to check against
- tts_data_utils.core.data_container.find_bad_utf8_characters(filepath)
Helper function to identify non-UTF-8 characters in CSV files, common when interacting with Windows-generated Microsoft Excel files.
The Problem: When CSVs are saved via Excel on Windows, non-UTF-8 characters—like curled quotation marks, single-character arrows, and degree symbols—are often added. Attempting to read these into Pandas causes a UnicodeDecodeError.
The Solution: This script allows developers to catch that exception, read the file in binary mode, and report the exact line and byte offset of the first error to facilitate cleaning.
Future Improvements: * Amend to repair the file automatically (referencing the M20 dictionary
input management logic).
Report all encoding errors instead of just the first.
See ticket #31 ( TO DO: Migrate out of JPL-internal issues))
- Parameters:
filepath (str or pathlib.Path) – Path to the CSV file to be checked.
- tts_data_utils.core.data_container.gt(key, value)
Returns a predicate for: field > value.
- Parameters:
key (str) – column to filter on
value (int or float) – value to check against
- tts_data_utils.core.data_container.gte(key, value)
Returns a predicate for: field >= value.
- Parameters:
key (str) – column to filter on
value (int or float) – value to check against
- tts_data_utils.core.data_container.isin(key, values)
Returns a predicate for: field in list_of_values.
- Parameters:
key (str) – column to filter on
value (list) – value to check against
- tts_data_utils.core.data_container.lt(key, value)
Returns a predicate for: field < value.
- Parameters:
key (str) – column to filter on
value (int or float) – value to check against
- tts_data_utils.core.data_container.lte(key, value)
Returns a predicate for: field <= value.
- Parameters:
key (str) – column to filter on
value (int or float) – value to check against
- tts_data_utils.core.data_container.matches(key, pattern)
Returns a predicate for regex matching.
- Parameters:
key (str) – column to filter on
pattern (str) – regex pattern to match with
- tts_data_utils.core.data_container.ne(key, value)
Returns a predicate for: field != value.
- Parameters:
key (str) – column to filter on
value (any) – value to check against
- tts_data_utils.core.data_container.notin(key, values)
Returns a predicate for: field not in list_of_values.
- Parameters:
key (str) – column to filter on
value (list) – value to check against
- tts_data_utils.core.data_container.on_change(key)
Returns a stateful predicate that triggers when the value in a column changes relative to the previous record.
- Parameters:
key (str) – Column to check for changes in
tts_data_utils.core.data_item
- class tts_data_utils.core.data_item.DataItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)
Bases:
ABCThe fundamental atomic unit of a DataContainer, representing a single row.
The Concept: A DataItem acts as a “Smart Dictionary” with a memory. It separates data into two layers: Source and Derived.
Think of it like a piece of trace paper over an original document. The source (the document) remains untouc hed for auditability. Any edits, calculations, or augmentations are written on the derived_values (the trace paper). When you ask for a value, the DataItem looks at the trace paper first; if nothing is there, it reads from the original document.
Traceability & Integrity: This architecture ensures that the original raw data is never destroyed or altered by processing logic, which is critical for engineering applications requiring data lineage.
- Parameters:
source (dict) – Dictionary of raw data for the row.
subcontainers (dict[str, DataContainer], optional) – Nested containers attached to this row.
copy_data (bool) – If True, deep-copies the source data to prevent mutation.
cast_fields (bool) – If True, attempts to force fields into canonical datatypes.
fill (bool) – If True, adds None for missing columns defined in DICT_VALID_KEYS.
is_django (bool) – Is the source coming from a Django model?
validate (bool) – Should we raise an exception if fields are the wrong type?
default_dispo (str) – Default disposition object (Dexter only).
- DEFAULT_DISPO = None
- DO_NOT_DIFF = []
Fields to ignore in diffs
- FLOAT_FORMAT = {}
- NAME = 'data item'
Name of the data item to be printed in some situations
- TIME_FORMAT_PRECISION = {}
- add_dispo(disposition)
Dexter only, and should be broken out into its own extension if possible.
Add a disposition to a dexter object.
TBD. Ask Nick Peper FMI
- Parameters:
disposition (Dexter Disposition) – Disposition for whatever has happened to this DataItem
- any_batches()
Dexter only, and should be broken out into its own extension if possible.
TBD. Ask Nick Peper FMI
- choose_and_stamp(dispo_choice, dispo_format)
Dexter only, and should be broken out into its own extension if possible.
Does the same as choose_dispo(), but also stamps the original DataItem with either a string or a list of dispositions.
- Parameters:
dispo_choice (DISPO_CHOICE) – FIRST, LAST, ALL
dispo_format (DISPO_FORMATL) – HTML, EXCEL, TEXT
- choose_dispo(dispo_choice)
Dexter only, and should be broken out into its own extension if possible.
In Dexter, any DataItem can receive many dispositions. This method chooses which to present to the user
- Parameters:
dispo_choice (DISPO_CHOICE) – How would you like to roll up dispositions? FIST, LAST, ALL?
- property default_html_cell_styles
A mapping of keys to CSS styles for individual HTML cells.
- property default_html_row_style
Default CSS styles for an HTML table row.
- property default_rich_text_row_style
Default styles for rich-text terminal output.
- classmethod empty(keys=[])
Returns an instance with all columns set to empty strings or None.
- in_batch(batch)
Dexter only, and should be broken out into its own extension if possible.
TBD. Ask Nick Peper FMI
- Parameters:
batch (DataUtils Batch) – Pointer to the data on which a disposition should be checked
- new_dispo()
Dexter only, and should be broken out into its own extension if possible.
Add a disposition to a dexter object. Like add_dispo, but just adds an empty disposition.
TBD. Ask Nick Peper FMI
- property printable_values
Same as values(), but with formatting applied: - TIME_FORMATS used to convert datetimes to strings - FLOAT_FORMAT used to format float values (if defined)
- row_signature(ignore_cols=[])
Generates a hashable representation of the row.
Concept: Used to determine if two items are effectively identical. It recursively converts mutable types (lists, dicts) into immutable tuples to ensure the signature can be hashed.
- stamp(dispo_value)
Dexter only, and should be broken out into its own extension if possible.
Adds a disposition value to source. Not used in isolation, and should probably be made private in a future version
- tag_with_batch(batch)
Dexter only, and should be broken out into its own extension if possible.
TBD. Ask Nick Peper FMI
- Parameters:
batch (DataUtils Batch) – Pointer to the data on which a disposition should be checked
- abstract property time
Must define some way to time-tag each data item.
- property valid
Runs self.validate, but retunds a simple bool instead of a list of invalid records.
- validate()
Compares values in each field to those provided in DICT_VALID_KEYS, which must be provided in each extension of DataItem.
- property values
Property to return all values, be they from the original source, newly added, or altered.
tts_data_utils.core.diff
- class tts_data_utils.core.diff.DiffContainer(name, metadata, csv_path=None)
Bases:
DataContainerThe results object generated by a DataContainer.diff() operation.
The Concept: A DiffContainer is a specialized collection that holds DiffItems. Its primary purpose is to act as a “Filterable Report.” By default, when you print or view this container, it filters itself to show only the differences (Same == False), allowing you to ignore identical data and focus on what changed.
Usage Note: You generally do not instantiate this class manually. It is returned automatically when you call .diff() on any standard DataContainer.
- Parameters:
name (str) – The identifier for this diff report.
metadata (dict) – Contextual information about the two containers being compared.
csv_path (str, optional) – Optional path to load/save the diff results.
- DO_NOT_DIFF = ['left', 'right']
Keys to ignore when running self.diff.
- NAME = 'diff'
- is_same()
Filters the report to show only matching records.
Concept: Useful for “sanity checking” to confirm which values were verified as identical.
- left(key)
Retrieves the baseline value for a specific key.
- Parameters:
key (str) – The field name to look up.
- Returns:
The value from the ‘Left’ side of the comparison.
- not_same()
Filters the report to show only mismatched records.
Concept: This is the primary troubleshooting view, isolating only the discrepancies between datasets.
- property repr_cols
Returns the standard list of columns for display.
- right(key)
Retrieves the comparison value for a specific key.
- Parameters:
key (str) – The field name to look up.
- Returns:
The value from the ‘Right’ side of the comparison.
- class tts_data_utils.core.diff.DiffItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)
Bases:
DataItemAn individual record representing a field-level comparison between two objects.
The Concept: Think of a DiffItem as a single row in a “Spot the Difference” report. It doesn’t just store a value; it stores the value from the “Left” side, the value from the “Right” side, and a pre-calculated verdict on whether they are identical.
How it works: When two DataContainers are compared, each shared key is packed into a DiffItem. If the data types differ between the sides (e.g., comparing a string to an integer), the Type field is automatically set to “various” to alert the user of a schema mismatch.
- Parameters:
Key (str) – Name of the attribute or field being compared.
Same (bool) – A boolean flag indicating if Left and Right values match.
Type (str) – The Python data type of the values (or “various”).
Left (int | float | str | None) – The value found in the original (baseline) container.
Right (int | float | str | None) – The value found in the comparison (target) container.
- DICT_VALID_KEYS = [('Key', <class 'str'>), ('Same', <class 'bool'>), ('Type', <class 'str'>), ('Left', (<class 'int'>, <class 'float'>, <class 'str'>, None)), ('Right', (<class 'int'>, <class 'float'>, <class 'str'>, None))]
- DO_NOT_DIFF = {'source': ['left', 'right']}
Fields to ignore in diffs
- TIME_FORMATS = {}
- time()
[Disabled]
DiffItems represent logical differences rather than temporal events, so time parsing is disabled by default.
tts_data_utils.core.generic
- class tts_data_utils.core.generic.GenericContainer(raw_data=None, name=None, **kwargs)
Bases:
DataContainerContainer provided for hooking into DataContainer infrastructure for which there is no extended class.
Usually this class shouldn’t be used in production, but is nice to have in development, especially when prototyping behavior for data types for which you are not certain you will want to eventually define an extension.
Some behavior will be missing without data validation defined, and no default styles for html or text repr will be available.
For parameters, see DataContainer
- DATA_ITEM_CLS
alias of
GenericItem
- NAME = 'Generic Items'
- property default_time_label
Returns the primary key used for time-based operations.
- property repr_cols
Columns to be used for representations (terminal, HTML, etc.).
- class tts_data_utils.core.generic.GenericItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)
Bases:
DataItemDataItem to go with GenericContainer
No validation requirements on these items.
- DICT_VALID_KEYS = []
- NAME = 'Generic Data Item'
Name of the data item to be printed in some situations
- TIME_FORMATS = {}
- property default_html_row_style
Default CSS styles for an HTML table row.
- property time
Must define some way to time-tag each data item.
- property time_str
tts_data_utils.core.lorem_utils
Utilities for generating lorem ipsum dummy data for DataContainer instances.
This module provides functions to generate realistic dummy data based on the structure defined in a DataItem class.
- tts_data_utils.core.lorem_utils.generate_lorem_data_for_item(data_item_cls, num_records=10)
Generate lorem ipsum data for a specific DataItem class.
- Args:
data_item_cls: The DataItem class to generate data for num_records: Number of records to generate
- Returns:
List of dictionaries with dummy data matching the DataItem structure
- tts_data_utils.core.lorem_utils.generate_lorem_text(min_words=3, max_words=10)
Generate lorem ipsum text with a random number of words.
- tts_data_utils.core.lorem_utils.generate_smart_value(key, type_spec)
Generate a value for a specific key, using naming conventions to create more realistic data.
- Args:
key: The field name type_spec: The type specification
- Returns:
A value appropriate for the field name and type
- tts_data_utils.core.lorem_utils.generate_value_for_type(type_spec)
Generate a random value that matches the given type specification.
- Args:
type_spec: A type or tuple of types (from DICT_VALID_KEYS)
- Returns:
A random value of the appropriate type
tts_data_utils.core.visual_diff
- class tts_data_utils.core.visual_diff.VisualDiffContainer(raw_data=None, name=None, **kwargs)
Bases:
DataContainerA collection of VisualDiffItems, representing a full comparison report.
Concept: The VisualDiffContainer acts as the manager for a set of diffed records. It distinguishes between “Display” columns (what the user needs to see) and “Metadata” columns (the internal flags used to calculate differences).
Usage Note: By default, any column starting with an underscore (_) is treated as internal metadata and is hidden from the standard repr_cols view, though it remains accessible for CSV exports and logic processing.
- Parameters:
raw_data (list[dict]) – A list of dictionaries representing the diffed rows.
name (str, optional) – The name of the container/report. Defaults to ‘Generic Container’.
kwargs – Additional arguments passed to the parent DataContainer.
- DATA_ITEM_CLS
alias of
VisualDiffItem
- NAME = 'Visual Diff Container'
- property default_time_label
The primary time-based column label used for sorting or indexing.
- property repr_cols
The list of columns intended for visual display (excludes internal metadata).
- class tts_data_utils.core.visual_diff.VisualDiffItem(source, subcontainers=None, copy_data=False, cast_fields=False, fill=False, is_django=False, validate=True, default_dispo=None)
Bases:
DataItemA specialized data record designed for side-by-side comparison.
The Concept: When comparing two versions of a dataset (a “Diff”), we don’t just care about the raw values—we care about the state of each row relative to its counterpart. Is it new? Was it deleted? Was it modified?
A VisualDiffItem carries the data values along with metadata (prefixed with _visdiff) that tells a renderer exactly how to style the row (e.g., green for an insert, red for a delete) to make differences immediately obvious to a human reviewer.
Styling Logic: * Row Level: The entire background color is determined by the _visdiff_match status. * Cell Level: Individual cells are highlighted if their specific key exists in
the _mismatched_keys list.
- DICT_VALID_KEYS = [('_visdiff_index', (<class 'int'>, None)), ('_visdiff_match', (<class 'str'>, None)), ('_mismatched_keys', <class 'list'>)]
- MATCH_STATUS_COLORS = {'delete': {'background-color': '#FF614A', 'color': '#333333'}, 'empty_from_delete': {'background-color': '#BDBEBD', 'color': '#BDBEBD'}, 'empty_from_insert': {'background-color': '#BDBEBD', 'color': '#BDBEBD'}, 'equal': {'background-color': '#FFFFF', 'color': '#333333'}, 'insert': {'background-color': '#95FB95', 'color': '#333333'}, 'replace': {'background-color': '#ADC4DF', 'color': '#333333'}, 'unmatched': {'background-color': '#FF614A', 'color': '#333333'}}
- NAME = 'Generic Data Item'
Name of the data item to be printed in some situations
- PALETTE = {'blue': {'background-color': '#ADC4DF', 'color': '#333333'}, 'error': {'background-color': '#000000', 'color': '#FF614A'}, 'green': {'background-color': '#95FB95', 'color': '#333333'}, 'grey': {'background-color': '#BDBEBD', 'color': '#BDBEBD'}, 'red': {'background-color': '#FF614A', 'color': '#333333'}, 'white': {'background-color': '#FFFFF', 'color': '#333333'}}
- TIME_FORMATS = {'Time': '%Y-%jT%H:%M:%S.%f'}
- property default_html_cell_styles
Returns a mapping of column keys to CSS styles, highlighting specific mismatched cells.
- property default_html_row_style
Returns the CSS style dictionary for the table row based on the match status.
- property time
Returns the native time object for the item.
- property time_str
Returns a formatted string representation of the item’s timestamp.