Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBE-3085 added Dockerfile and files, scripts for testing #4

Closed
wants to merge 2 commits into from

Conversation

ChaosHour
Copy link

DBE-3085
Add_functionality_chunk_level

The --stop-at-top-level functionality is primarily implemented in two files:

work/GannettDigital/data-diff/data-check.py

This file adds the command-line argument:

def create_parser():
    parser = argparse.ArgumentParser()
    # ...existing code...
    parser.add_argument(
        "--stop-at-top-level",
        action="store_true",
        help="Stop as soon as any difference is found at the top level (faster but less detailed)",
        default=False,
    )


work/GannettDigital/data-diff/data_diff/hashdiff_tables.py
This file contains the core implementation in the `_diff_segments` method of the `HashDiffer` class:


Made changes.

@attrs.define(frozen=False)
class HashDiffer(TableDiffer):
    # ...existing code...
    stop_at_top_level: bool = False

    def _diff_segments(self, ti: ThreadedYielder, table1: TableSegment, table2: TableSegment, info_tree: InfoTree, max_rows: int, level=0, segment_index=None, segment_count=None):
        # ...existing code...

        if checksum1 == checksum2:
            info_tree.info.is_diff = False
            return

        info_tree.info.is_diff = True

        # If at top level and stop_at_top_level is True, just return
        if level == 0 and self.stop_at_top_level:
            logger.info("Differences found at top level. Stopping as requested.")
            return

        return self._bisect_and_diff_segments(ti, table1, table2, info_tree, level=level, max_rows=max(count1, count2))


The key part is the check if level == 0 and self.stop_at_top_level: 

which prevents further bisection if differences are found at the top level. This makes the diff operation much faster when you only need to know if differences exist, without needing to know what those differences are.

@ChaosHour
Copy link
Author

Will add some example of usage to help with the reviews

Copy link

@kantselovich kantselovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are good as POC to demonstrate the concept, however

  1. complete implication should involve modifying data-diff code , so it would be possible to use it as

data-diff DB1 TABLE1 DB2 TABLE2 [options]

or

# Optional: Set logging to display the progress of the diff
import logging
logging.basicConfig(level=logging.INFO)

from data_diff import connect_to_table, diff_tables

table1 = connect_to_table("postgresql:///", "table_name", "id")
table2 = connect_to_table("mysql:///", "table_name", "id")

for different_row in diff_tables(table1, table2):
    plus_or_minus, columns = different_row
    print(plus_or_minus, columns)

and the --stop-at-top-level flag should be possible to add as one of the command line options or class parameters, or config file option.

Current implementation involves creating a separate utility data-check.py that adds its own parameters parsing and uses subset of data-diff functionality.

In short, this should be implemented as a change to data-diff utility itself, not as an additional utility.

  1. Testing
    The data-diff code is covered by tests, for example parsing config file options is covered in test_config.py and actual checking of the difference in tables is implemented in test_diff_tables.py . We testes should be extended to cover new functionality that is being added.

PS: caveat is that tests currently do not run on our repo, perhaps we need to work on a PR to add workflow that would enable tests

@ChaosHour
Copy link
Author

ChaosHour commented Dec 17, 2024

Modified the test file to use HashDiffer instead of the non-existent diff_json function.

import pytest
from data_diff.hashdiff_tables import HashDiffer

def test_stop_at_top_level():
    differ = HashDiffer(stop_at_top_level=True)
    # Test data with nested structures
    data1 = [
        {"top1": {"nested": "value1"},
         "top2": {"nested": "value2"}}
    ]
    data2 = [
        {"top1": {"nested": "changed1"},
         "top2": {"nested": "value2"}}
    ]
    
    # Test without stop_at_top_level
    differ.stop_at_top_level = False
    full_diff = differ.diff_tables(data1, data2)
    assert any("nested" in str(d) for d in full_diff)
    
    # Test with stop_at_top_level
    differ.stop_at_top_level = True
    top_level_diff = differ.diff_tables(data1, data2)
    assert any("top1" in str(d) for d in top_level_diff)
    assert not any("nested" in str(d) for d in top_level_diff)

def test_stop_at_top_level_equal_nested():
    differ = HashDiffer(stop_at_top_level=True)
    # Test case where nested values are equal
    data1 = [
        {"top": {"nested": "same"}}
    ]
    data2 = [
        {"top": {"nested": "same"}}
    ]
    
    diff = differ.diff_tables(data1, data2)
    assert len(diff) == 0  # Should show no differences

Key changes made:

  1. Changed import from diff_json to HashDiffer
  2. Modified test data to be lists of dictionaries (as required by HashDiffer)
  3. Used diff_tables method instead of diff_json
  4. Updated assertions to work with HashDiffer's output format
  5. Added proper HashDiffer initialization

@kantselovich
Copy link

@ChaosHour , good news I was able to fix the tests and they actually work
#5

@ChaosHour ChaosHour force-pushed the DBE-3085-Add_functionality_chunk_level branch from 60d060f to 0edc927 Compare December 23, 2024 17:48
@@ -0,0 +1,38 @@
import pytest

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-import  pytest
+import unittest

@ChaosHour ChaosHour closed this Dec 23, 2024
@ChaosHour ChaosHour deleted the DBE-3085-Add_functionality_chunk_level branch December 24, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants