Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add argument to get_column_values allow for alternate sorting of column values #289

Closed
wants to merge 7 commits into from

Conversation

clausherther
Copy link
Contributor

@clausherther clausherther commented Oct 12, 2020

This is a:

  • bug fix PR with no breaking changes (please change the base branch to main)
  • new functionality
  • a breaking change

Description & motivation

This PR adds a sort_column and sort_direction argument to get_column_values to allow for alternate sorting of column values. This based on work from @joshpeng-quibi
Closes #288

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to the changelog

@clausherther clausherther requested a review from clrcrl as a code owner October 12, 2020 23:18
@shedd
Copy link

shedd commented Dec 21, 2020

@clausherther thanks for implementing! was looking for this exact feature just now!

from {{ target_relation }}
group by 1
order by 2 {{ sort_direction if sort_direction else "desc" }}
{% endif %}
Copy link

@shedd shedd Dec 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi - I created a local copy of this and hit an error on this endif in DBT v18:

Encountered an error:
Compilation Error in macro (macros/sorted_get_column_values.sql)
  Got an unexpected control flow end tag, got endif but never saw a preceeding if (@ 63:8)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for highlighting that! Just pushed a fix to remove that extra endif

@clausherther
Copy link
Contributor Author

@clrcrl anything else we need to do here to get this merged?

{% else %}
{# We take the max sort value for each value to make sure
there are no duplicate rows for each value #}
max({{ sort_column }}) as sort_column
Copy link
Contributor

@clrcrl clrcrl Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, this is interesting. I thought this PR would primarily for the use-case of it alpha-sorting, i.e. "order by {{ column }}" (which I typically achieve through a Jinja sort)

What kind of ways do you typically want to sort these results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to let you sort by frequency in ascending (or descending) order, which I think was was not possible previously and is not possible in Jinja with just the results since the counts aren't returned.

(This is based on an internal version of this we had to implement at Quibi as a result.)

macros/sql/get_column_values.sql Outdated Show resolved Hide resolved
macros/sql/get_column_values.sql Outdated Show resolved Hide resolved
{% endif %}
from {{ target_relation }}
group by 1
order by 2 {{ sort_direction if sort_direction else "desc" }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with the above comment, it may be easier to do something like this:

Suggested change
order by 2 {{ sort_direction if sort_direction else "desc" }}
order by {{ order_by }} {{ sort_direction or "desc" }}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to figure out how to make this work with non-aggregate columns. E.g. you pass in column_2 instead count(*) we need to wrap in an aggregation, like max(column_2), or group by it as well.

@clausherther
Copy link
Contributor Author

clausherther commented Dec 24, 2020

@clrcrl lmk what you think of the way I implemented the order_by arg. Basically, if we're ordering by the column, I use max to force the aggregation to be 1:1, otherwise we use count(*) or whatever the caller specified (e.g. max(created_at)

Copy link
Contributor

@clrcrl clrcrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this locally:

Here's my testing file:

{% set list_default_order = dbt_utils.get_column_values(
        table=ref('stg_payments'),
        column='payment_method')
%}

{% do log('list_default_order: ' ~ list_default_order, info=True) %}

{% set list_alpha_order = dbt_utils.get_column_values(
        table=ref('stg_payments'),
        column='payment_method',
        order_by='payment_method')
%}

{% do log('list_alpha_order: ' ~ list_alpha_order, info=True) %}

{% set list_recency_order = dbt_utils.get_column_values(
        table=ref('stg_payments'),
        column='payment_method',
        order_by='max(created_at)',
        sort_direction='desc')
%}

{% do log('list_recency_order: ' ~ list_recency_order, info=True) %}

It all worked as expected!

list_default_order: ['credit_card', 'bank_transfer', 'coupon', 'gift_card']
list_alpha_order: ['bank_transfer', 'coupon', 'credit_card', 'gift_card']
list_recency_order: ['credit_card', 'bank_transfer', 'gift_card', 'coupon']

I did have a slightly different implementation in mind, which I think makes the sort order a little less confusing for the two cases — lmk what you think:

    {%- set order_by = order_by or 'count(*) desc' -%}
...
            with sorted_column_values as (

                select
                    {{ column }} as value -- note: no second column declared
                from {{ target_relation }}
                group by 1
                order by {{ order_by }}
                {% if max_records is not none %}
                
                limit {{ max_records }}
                {% endif %}
            )
            select value from sorted_column_values

By removing the second column in the query, we don't have to get tricky with the order_by column. I also just ended up
removing the sort_direction input as I figured it could be pretty easily packaged up in the order_by definition.

The above implementations would end up looking like so (i.e. no "sort_direction" parameter)

{% set list_default_order = dbt_utils.get_column_values(
        table=ref('stg_payments'),
        column='payment_method')
%}

{% do log('list_default_order: ' ~ list_default_order, info=True) %}

{% set list_alpha_order = dbt_utils.get_column_values(
        table=ref('stg_payments'),
        column='payment_method',
        order_by='payment_method')
%}

{% do log('list_alpha_order: ' ~ list_alpha_order, info=True) %}

{% set list_recency_order = dbt_utils.get_column_values(
        table=ref('stg_payments'),
        column='payment_method',
        order_by='max(created_at) desc'
%}

{% do log('list_recency_order: ' ~ list_recency_order, info=True) %}

This does mean there's fewer built-in defaults, but may be a little easier to document and reason about?

@clrcrl clrcrl mentioned this pull request Mar 30, 2021
10 tasks
@clrcrl
Copy link
Contributor

clrcrl commented Mar 30, 2021

Closing in favor of #349 — I let this get out of date and it was easier for me to open a new PR to get the code in a mergeable state.

@clrcrl clrcrl closed this Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The order by clause in get_column_values should be configurable
3 participants