Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1216] [Bug] {{ concat }} macro not working on Spark / Databricks #5888

Closed
2 tasks done
jelstongreen opened this issue Sep 20, 2022 · 12 comments
Closed
2 tasks done
Labels
bug Something isn't working Team:Adapters Issues designated for the adapter area of the code

Comments

@jelstongreen
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

hash(concat(['a','b','c']))

compiles to:

md5(cast(a, b, c as string))

Expected Behavior

The behaviour should mimic the previously used :

dbt_utils.hash(dbt_utils.concat(['a','b','c']))

which compiles too:

md5(cast(concat(a, b, c) as string))

Steps To Reproduce

Create an example model containing:

{% set native_query = hash(concat(['a','b','c'])) %}

{{ log(native_query, info=True) }}

{% set dbt_utils_query = dbt_utils.hash(dbt_utils.concat(['a','b','c'])) %}

{{ log(dbt_utils_query, info=True) }}

SELECT 1

Relevant log output

17:00:51  1 of 1 START view model global_update_metrics.test ............................. [RUN]
17:00:51  md5(cast(a, b, c as string))
17:00:51  Warning: the `hash` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `hash` (no prefix) instead. The datalake_models.test model triggered this warning.
17:00:51  md5(cast(concat(a, b, c) as string))
17:00:51  1 of 1 OK created view model global_update_metrics.test ........................ [OK in 0.87s]


### Environment

```markdown
- OS: Mac 12.6
- Python: 3.8.12
- dbt: 1.2.1

Which database adapter are you using with dbt?

spark

Additional Context

Databricks

@jelstongreen jelstongreen added bug Something isn't working triage labels Sep 20, 2022
@github-actions github-actions bot changed the title [Bug] {{ concat }} macro not working on Spark / Databricks [CT-1216] [Bug] {{ concat }} macro not working on Spark / Databricks Sep 20, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 20, 2022

@jelstongreen Thanks for opening, and providing the clear reproduction case. Given the log message you're seeing, it looks like the concat() call gets skipped over entirely. I'm not sure why that would be...

Which version of dbt_utils do you have installed? I'm not able to reproduce this locally using the latest version. I see dbt correctly calling spark__concat, and including it in the rendered code:

# packages
packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
{% set native_query = hash(concat(['a','b','c'])) %}

{{ log("Without namespace: " ~ native_query, info=True) }}

{% set dbt_utils_query = dbt_utils.hash(dbt_utils.concat(['a','b','c'])) %}

{{ log("With dbt_utils namespace: " ~ dbt_utils_query, info=True) }}

SELECT 1
$ dbt parse
17:37:38  Running with dbt=1.2.1
17:37:38  Start parsing.
17:37:38  Dependencies loaded
17:37:38  ManifestLoader created
17:37:38  Without namespace: md5(cast(concat(a, b, c) as string))
17:37:38  Warning: the `concat` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `concat` (no prefix) instead. The test.my_model model triggered this warning.
17:37:38  Warning: the `hash` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `hash` (no prefix) instead. The test.my_model model triggered this warning.
17:37:38  With dbt_utils namespace: md5(cast(concat(a, b, c) as string))
17:37:38  Manifest loaded
17:37:38  Manifest checked
17:37:38  Flat graph built
17:37:38  Manifest loaded
17:37:38  Performance info: target/perf_info.json
17:37:38  Done.
...

@jtcohen6 jtcohen6 added Team:Adapters Issues designated for the adapter area of the code awaiting_response and removed triage labels Sep 20, 2022
@jelstongreen
Copy link
Author

jelstongreen commented Sep 20, 2022

Hey @jtcohen6 thanks for looking into this! These are my packages:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - package: dbt-labs/spark_utils
    version: 0.3.0
  - package: dbt-labs/codegen
    version: 0.8.0
  - package: brooklyn-data/dbt_artifacts
    version: 2.0.0

I'm thinking this could be do with my dispatch config which I've not really updated since first starting out with dbt - does this need to be amended?

dispatch:
  - macro_namespace: dbt_utils
    search_order:
      - my_project
      - spark_utils
      - dbt_utils

@jtcohen6
Copy link
Contributor

@jelstongreen Yeah! What happens if you remove that dispatch config entirely? The migration of utility macros into dbt-core should mean that "shim" packages like spark_utils are no longer needed here

@jelstongreen
Copy link
Author

@jtcohen6 Unfortunately no change with that section removed:

23:47:14  Running with dbt=1.2.1
23:47:14  Start parsing.
23:47:14  Dependencies loaded
23:47:14  ManifestLoader created
23:47:17  Without namespace md5(cast(a, b, c as string))
23:47:17  Warning: the `concat` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `concat` (no prefix) instead. The datalake_models.test model triggered this warning.
23:47:17  Warning: the `hash` macro is now provided in dbt Core. It is no longer available in dbt_utils and backwards compatibility will be removed in a future version of the package. Use `hash` (no prefix) instead. The datalake_models.test model triggered this warning.
23:47:17  With dbt_utils namespace: md5(cast(concat(a, b, c) as string))
23:47:31  Manifest loaded
23:47:31  Manifest checked
23:47:32  Flat graph built
23:47:32  Manifest loaded
23:47:32  Performance info: target/perf_info.json
23:47:32  Done.

@jelstongreen
Copy link
Author

I've worked this one out:

We had an internal macro called concat which when we were using dbt_utils.concat would not conflict because of the explicit namespace. This was the same for all packages which prefixed their use of concat with dbt_utils. However, as these packages now use concat with no package prefix it was conflicting with our internal macro. I'm not sure why our internal macro would just not provide any output at all in this case but I have renamed the macro to something more unique to our use case (concatenation with affixes).

Probably no need for any resolution here although it would be handy to know if an internal macro is conflicting with a built in macro of the same name, particularly after a version change as can be difficult to determine otherwise. Thanks for your help @jtcohen6 !

@jtcohen6 jtcohen6 removed the triage label Sep 21, 2022
@jtcohen6
Copy link
Contributor

@jelstongreen Ah, thanks for the update!

Definitely open to thinking about how it could be easier to debug these ones in the future. It's true that there's more possibility for namespace collision when calling macros from the global namespace, though with the benefit of making it simpler for folks to override specific macro behavior when they see the need.

Unfortunately, because this override happens during macro resolution, rather than dispatch resolution (subtle distinction), there's no way to identify at parse time / via depends_on.macros that dbt_utils.surrogate_key will be calling my_project.concat instead of dbt_utils.concat. This is what appears in manifest.json:

    "macro.dbt_utils.default__surrogate_key": {
      "unique_id": "macro.dbt_utils.default__surrogate_key",
      "package_name": "dbt_utils",
      "root_path": "/Users/jerco/dev/scratch/testy/dbt_packages/dbt_utils",
      "path": "macros/sql/surrogate_key.sql",
      "original_file_path": "macros/sql/surrogate_key.sql",
      "name": "default__surrogate_key",
      "macro_sql": "\n\n{%- macro default__surrogate_key(field_list) -%}\n\n{%- if varargs|length >= 1 or field_list is string %}\n\n{%- set error_message = '\nWarning: the `surrogate_key` macro now takes a single list argument instead of \\\nmultiple string arguments. Support for multiple string arguments will be \\\ndeprecated in a future release of dbt-utils. The {}.{} model triggered this warning. \\\n'.format(model.package_name, model.name) -%}\n\n{%- do exceptions.warn(error_message) -%}\n\n{# first argument is not included in varargs, so add first element to field_list_xf #}\n{%- set field_list_xf = [field_list] -%}\n\n{%- for field in varargs %}\n{%- set _ = field_list_xf.append(field) -%}\n{%- endfor -%}\n\n{%- else -%}\n\n{# if using list, just set field_list_xf as field_list #}\n{%- set field_list_xf = field_list -%}\n\n{%- endif -%}\n\n\n{%- set fields = [] -%}\n\n{%- for field in field_list_xf -%}\n\n    {%- set _ = fields.append(\n        \"coalesce(cast(\" ~ field ~ \" as \" ~ type_string() ~ \"), '')\"\n    ) -%}\n\n    {%- if not loop.last %}\n        {%- set _ = fields.append(\"'-'\") -%}\n    {%- endif -%}\n\n{%- endfor -%}\n\n{{ hash(concat(fields)) }}\n\n{%- endmacro -%}",
      "resource_type": "macro",
      "tags": [],
      "depends_on": {
        "macros": [
          "macro.dbt_utils.type_string",
          "macro.dbt_utils.hash",
          "macro.dbt_utils.concat"
        ]
      },

FYI @dbeatty10 @joellabes - this is an issue that would have been prevented by calling the macro explicitly as dbt.concat, instead of just concat, within dbt_utils.surrogate_key. I know we've seen other issues with the former approach, though: #5720.

I'm going to close this specific issue as resolved for now.

@joellabes
Copy link
Contributor

Oh boy, I had no idea that dbt.something() and something() would give different results! I thought it was purely down to clarity or not.

Personally I prefer {{ dbt.something() }} because it makes it more obvious where the magic is coming from (and I just had an issue where someone else was confused by this: dbt-labs/dbt-utils#666). I think you and Doug decided the opposite in my absence and I was OK with that when I thought it didn't change anything.

Now that it does... 🤔 which failure case is more common/difficult to work around? This or the one raised in 5720?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 22, 2022

Oh boy, I had no idea that dbt.something() and something() would give different results! I thought it was purely down to clarity or not.

@joellabes To be clear, they yield different results only if you've defined a macro in your own project named {{ concat() }}, which then wins the race for global macro resolution. Whereas, if you've defined a macro in your project named {{ default__concat() }}, that wins the race for dispatch in the dbt namespace. I think the latter is preferable, because it's a bit harder to find yourself there accidentally, and the dependency shows up in depends_on.macros.

So, I also find myself leaning toward explicit namespaces... I don't remember exactly why @dbeatty10 and I opted for prefixless, but knowing us, there's definitely a thread somewhere.

The bug in #5720 is decidedly a bug, not just a point of user confusion — it's something we can try to fix. It's a gnarly bug, though. We can invest more time in trying to resolve it, which would clear the way toward us being able to recommend explicit namespaces unambiguously, but it won't be fixed by dbt-core v1.3 + dbt-utils v1.0. At the same time, the issue will just go away when we remove the deprecated back-compat macros from dbt_utils.

@dbeatty10
Copy link
Contributor

Wholeheartedly agree with using explicit namespaces IFF we can't avoid bugs otherwise! i.e., {{ dbt.type_string() }} rather than {{ type_string() }}

@jtcohen6 our original discussion was basically this:

  • Include dbt. prefix for macros? Or drop?
    • The result feels more readable without the prefix to me.

I prefer without the prefix, too.

An underlying assumption for me was that with/without prefix was equivalent. We should be explicit if we will have bugs otherwise. We will need to add dbt. prefixes to all the code examples here.


To fully unpack my ordered preferences for syntax (if we could create our ideal world):

  1. select concat(['d', 'b', 't'])
  2. select {{ concat(['d', 'b', 't']) }}
  3. select {{ dbt.concat(['d', 'b', 't']) }}

In the first instance, all the Jinja has melted away as-if concat were a SQL standard function implemented by all databases (or if there existed some kind of black magic that made it appear that way 😉 ).

@jtcohen6
Copy link
Contributor

@dbeatty10 I think I may have a fix for the bug that crops up when users explicitly specify dbt.*: #5907

Given that, I'm inclined to encourage explicit dbt.* in user projects and packages, to avoid unintended namespace collisions. (We're not the only people who have this problem: https://stackoverflow.com/questions/70458086/how-to-correctly-import-pyspark-sql-functions)

We will need to add dbt. prefixes to all the code examples here.

✅ We can open an issue for this.

@joellabes I imagine we'd also want to update the references in dbt-utils, which might be a much bigger lift, and warrants opening a separate issue.

@dbeatty10
Copy link
Contributor

We will need to add dbt. prefixes to all the code examples here.

✅ We can open an issue for this.

I already have a PR open for this page in our documentation, so I'll add it there:

@joellabes
Copy link
Contributor

To fully unpack my ordered preferences for syntax (if we could create our ideal world):

My preferences are actually the exact inverse of yours! If we are being clever, then I want to see where the magic is coming from. Mostly because:

as-if concat were a SQL standard function implemented by all databases

is a great greenfield stance to take, but it's already a SQL standard function implemented by a lot of databases, so this would open us up to wild amounts of clashes - it would really only be possible with the dbt proxy server rewriting everything before the warehouse saw it. Which isn't impossible I guess! But certainly a surprise for the new employee when they discover that something is swizzling their functions from under them. Maybe I'm thinking too small? I'm open to the scales falling from my eyes

I think I may have a fix for the bug that crops up when users explicitly specify dbt.*: #5907

:infinitypartyparrot:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests

4 participants