Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not_blank test #632

Closed
epapineau opened this issue Aug 4, 2022 · 6 comments · Fixed by #634
Closed

not_blank test #632

epapineau opened this issue Aug 4, 2022 · 6 comments · Fixed by #634
Assignees
Labels
enhancement New feature or request good first issue

Comments

@epapineau
Copy link
Contributor

Describe the feature

A clear and concise description of what you want to happen.
dbt has the not_null test to monitor for null values, but in certain datasets blank values are also problematic. However, there is no not_blank test that I found when searching utils or dbt-expectations.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.
Creating a custom test that errors if there are blanks.

Additional context

Is this feature database-specific? Which database(s) is/are relevant? Please include any other relevant context here.
I think blank entries are a possibility on all databases

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.
Anyone needing to check data integrity to ensure there are no blank values in their dataset

Are you interested in contributing this feature?

Yes plz 🙏🏻

@epapineau epapineau added enhancement New feature or request triage labels Aug 4, 2022
@epapineau
Copy link
Contributor Author

@dbeatty10 and / or @joellabes does this seem like a valuable addition to utils? I need it for a client project I'm working on, but I've had several other projects where I would have benefited from it in the past as well.

@joellabes
Copy link
Contributor

joellabes commented Aug 4, 2022

Yes I'm into it!

Would this also trim whitespace before evaluating? I.e. a totally empty string would fail, but what about a string that contains a single space?

I'm asking based on this similar method, IsNullOrWhitespace() from my C# days, which I preferred over its sibling IsNullOrEmpty().

I don't have strooong feelings about whether it's necessary or not. Perhaps it needs an extra argument that says whether to trim or not?
Something like

{% test not_blank(model, column, ignore_whitespace) %}

select * from {{ model }} where 
{% if ignore_whitespace %}
    len(column) = 0 --remember to use {{ dbt.length() }} instead of a hardcoded one, natch https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros#length
{% else %}
    len(ltrim(rtrim(column))) = 0
{% endif %}

{% endtest %}

@tigitz
Copy link

tigitz commented Aug 5, 2022

Some food for thoughts on this:

In Java:

We consider a string to be empty if it's either null or a string without any length. If a string only consists of whitespace, then we call it blank.

Source: https://www.baeldung.com/java-blank-empty-strings#empty-blank

Also a popular validation framework in PHP says this:

Validates that a value is not blank - meaning not equal to a blank string, a blank array, false or null (null behavior is configurable).

Source: https://symfony.com/doc/current/reference/constraints/NotBlank.html

Which raise the question on how the rule should behave when encountering a blank array for database that support them.

My suggestion would be to be explicit in the rules name and avoid grouping different kind of tests into a single one using parameter flags. That would cause confusion.

Meaning introducing in a first iteration:

  • not_blank_string
  • not_empty_string

Following the "Java" meaning but not testing null as it's already provided by the not_null rule.

And then think about how to support

  • not_empty_array
  • not_blank_array
  • not_empty
  • not_blank

@epapineau epapineau self-assigned this Aug 5, 2022
@joellabes
Copy link
Contributor

@tigitz thanks for the feedback and the links to prior art!

Which raise the question on how the rule should behave when encountering a blank array for database that support them.

I would be surprised if we could validate both strings and arrays in the same test without knowing their type in advance (which I think is your point in recommending _string and _array variants?)

My suggestion would be to be explicit in the rules name and avoid grouping different kind of tests into a single one using parameter flags. That would cause confusion.

Can you say more about the confusion element? I'm pretty sure I disagree with you on this one, at least on tests of the same column type (e.g. empty string vs blank string). I consider it to be analogous to the inclusive param on accepted_range, and think that it'd be tidier/less cognitive overhead to have a single test whose behaviour you tweak as opposed to having two almost identical tests with suspiciously similar names.

(Accidental proving of my point: I wrote a lot of this response with blank and empty the wrong way around, and only discovered my mistake when I scrolled back up to your original comment before posting).

What shall we name the baby test?

A meme that gets in the way of my point

I would prefer that we go with not_empty_string, which conforms to the Java definition of empty (apart from the null stuff) and makes semantic sense at a glance. Then I'd have an optional parameter on it, trim_whitespace 1 which is false by default.

1 In an earlier version, I suggested ignore_whitespace. Coming to this with fresh eyes, that's unclear - what's doing the ignoring and how does ignore manifest?

Arrays

I could go either way on building a test to check whether arrays are empty or not - if either you or @epapineau wants to build one then I wouldn't be against it, but I'm also happy to punt for a while.

@tigitz
Copy link

tigitz commented Aug 8, 2022

which I think is your point in recommending _string and _array variants?

Yes exactly

Can you say more about the confusion element?

Yes, if I'm looking to avoid blank strings in my data, I'll look for the related rule. Having to go through the whole documentation to understand that a non corresponding not_empty rule could actually achieve my not_blank goals by the use of a specific parameter is the confusion I'm talking about.

Having explicit and straightforward not_empty and not_blank rules seems a better DX from my POV.

Also, mixing both tests into a single not_empty_string rule makes testing strictly blank string impossible.

Say there are 3 scenarios you would like to test, "not_empty", "not_blank", "not_empty and not blank":

Case empty '' blank ' ' - not_empty_string - not_blank_string - not_empty_string(trim_whitespace)
not empty and not blank - not_empty_string - not_blank_string - not_empty_string(trim_whitespace = true)
strictly not empty - not_empty_string - not_empty_string(trim_whitespace = false)
strictly not blank - not_blank_string

@epapineau
Copy link
Contributor Author

epapineau commented Aug 9, 2022

Thanks for all the incredible thought poured into this @joellabes and @tigitz. A quick first pass at the not_empty_string(trim_whitespace) version of the test is here.

Is there a preference on using length = 0 over column_name = ''?

I can add an array counterpart + tests to round out the PR in the next day or two :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants