Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about influence of factor vars on Prophet and waterfall contribution #384

Closed
Steven-Livingstone opened this issue May 11, 2022 · 7 comments
Assignees

Comments

@Steven-Livingstone
Copy link

Steven-Livingstone commented May 11, 2022

Project Robyn

Describe issue

Border lock-downs due to COVID have had an obvious and negative impact on the dependent variable. To measure this impact, Border lock-downs have been encoded as a categorical context variable in Robyn(0=open_border or 1=closed_border). Lets call this Border_Closure_Control.

I know (by eyeballing the graph) that closing the border has had a prolonged negative impact on dep_var and so added Border_Closure_Control as a control var. However, my results differ significantly when also adding Border_Closure_Control as a factor_var V.S. when I don't.

Why would excluding Border_Closure_Control as a factor_var give the expected results of a negative contribution(see examples below) and how are context and factor vars used in Prophets decomposition?

I also ran tests with just Border_Closure_Control as the only context variable (with several media vars) which produced the similar results. Perhaps I have misunderstood how Prophet is using context and factor vars?

Thank-you for any help or insight you can offer! Cheers :)

Further details

Note:

  • I considered the border lock-down start date to be early March 2020.

Screenshot 2022-05-10 at 2 34 49 PM

I followed these documentation recommendations when including Border_Closure_Control

If there are organic or context variables which are categorical, you will need to specify which variables are categorical in the factor_vars parameter` (from ProphetSeasonalityDecomposition).

If you need to measure the impact of a specific day or event as a separate variable, consider adding that information into context_vars. In this case, make sure that you do not provide duplicate information across the two sources (i.e. Prophet holiday and additional column)` (from Model Design).

Factor_vars is to specify which variables are factorial/categorical. For example if you have variable "offline_events" that contains only 0 and 1, you should use it in context_var or organic_var AND specify it in factor_vars.` Issue #214

The following models differ only in their use of Border_Closure_Control. Almost all candidate models from each Robyn run share the same characteristics concerning Border_Closure_Control.

Example: Adding Border_Closure_Control as both a context_var and factor_var

  • In prophet_decomp.png, I see that Border_Closure_Control has been included in the Prophet's deseasonalization plot (above).
  • In the waterfall chart (below), Border_Closure_Control has a very positive contribution towards sales. This does not make sense to me.
  • Border_Closure_Control used default in paid_media_signs.

Screenshot 2022-05-10 at 1 41 12 PM

Example: Adding Border_Closure_Control as just a context_var

  • In prophet_decomp.png, I can see that Border_Closure_Control is missing from Prophet's deseasonalization plot.
  • In the waterfall chart(below), Border_Closure_Control now has a very negative contribution towards sales.
  • Border_Closure_Control used default in paid_media_signs.

Screenshot 2022-05-10 at 1 43 28 PM

Example: Adding Border_Closure_Control as both a context_var and factor_var AND forcing Border_Closure_Control to be negative.

  • In prophet_decomp.png, because Border_Closure_Control is included as a factor_var it is also included Prophet's deseasonalization plot.
  • In the waterfall chart(below), Border_Closure_Control now has zero contribution towards sales.
  • Border_Closure_Control used negative in paid_media_signs.

Screenshot 2022-05-10 at 2 51 39 PM

Investigations

  • Looking in the function prophet_decomp, here seems to be the determining logic that decides if factor_vars gets one-hot-encoded and added to Prophet's deseasonalization process.
  • Here is code that determines if factor_vars are to be included in the prophet_decomp.png plot. I.e. empty factor_vars list means no Border_Closure_Control in prophet_decomp.png.

I understand that trend and seasonality are used as extra coefficients in the ridge regression model in order to give equal opportunity to explain the dependent variable, but how does deseasonalizing the dep_var relate to the context and factor vars?

Environment & Robyn version

R version = 3.6.4

Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/envs/race_mmm/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.22 Robyn_3.6.4   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         here_1.0.1         lubridate_1.8.0    lattice_0.20-45   
 [5] png_0.1-7          rprojroot_2.0.2    assertthat_0.2.1   glmnet_4.1-3      
 [9] digest_0.6.29      foreach_1.5.1      utf8_1.2.2         IRdisplay_1.0     
[13] R6_2.5.1           plyr_1.8.6         repr_1.1.3         ggridges_0.5.3    
[17] evaluate_0.14      ggplot2_3.3.5      pillar_1.6.4       rlang_0.4.12      
[21] lazyeval_0.2.2     uuid_1.0-3         data.table_1.14.2  nloptr_1.2.2.3    
[25] Matrix_1.3-4       splines_4.1.1      stringr_1.4.0      igraph_1.2.9      
[29] munsell_0.5.0      compiler_4.1.1     pkgconfig_2.0.3    base64enc_0.1-3   
[33] shape_1.4.6        rPref_1.3          htmltools_0.5.2    tidyselect_1.1.1  
[37] tibble_3.1.6       codetools_0.2-18   fansi_1.0.0        crayon_1.4.2      
[41] dplyr_1.0.7        grid_4.1.1         jsonlite_1.7.2     gtable_0.3.0      
[45] lifecycle_1.0.1    DBI_1.1.1          magrittr_2.0.1     scales_1.1.1      
[49] RcppParallel_5.1.4 stringi_1.7.6      doRNG_1.8.2        doParallel_1.0.16 
[53] ellipsis_0.3.2     generics_0.1.1     vctrs_0.3.8        IRkernel_1.2      
[57] iterators_1.0.13   tools_4.1.1        prophet_1.0        glue_1.6.0        
[61] purrr_0.3.4        rngtools_1.5.2     parallel_4.1.1     fastmap_1.1.0     
[65] survival_3.2-13    colorspace_2.0-2   minpack.lm_1.2-1   pbdZMQ_0.3-6      
[69] patchwork_1.1.1 
@Leonelsentana
Copy link
Contributor

Leonelsentana commented May 13, 2022

Hey @Steven-Livingstone prophet and factors vars are used right after prophets decomposition and inclusion into the model as extra coefficients just as you said in your last comment. In this case it may be worth to change intercept_sign parameter to "unconstrained" which is set by default to "non_negative", sometimes the intercept being forced to be positive may harm the effects of context_vars, please check how results change with that. As a workaround hack to also try, you could add the border control event as a regressor in prophet or as a holiday input in the holiday file for prophet. Cheers!

@Leonelsentana Leonelsentana self-assigned this May 13, 2022
@Steven-Livingstone
Copy link
Author

Hi @Leonelsentana, thank-you for those suggestions!

While working on them could you link to where I would be able to "add the border control event as a regressor in prophet". Is this what you're referring to in the "Pro-tip: Customize holiday & event information" of the analysts-guide-to-MMM

Add an additional context variable: If you use the ‘holiday’ information from Prophet, all the various holidays/events will be aggregated and modelled as one variable. If you need to measure the impact of a specific day or event as a separate variable, consider adding that information into context_vars. In this case, make sure that you do not provide duplicate information across the two sources (i.e. Prophet holiday and additional column). Alternatively, if good quality data has been collected to explain certain different holidays and events, this can be modelled as a specific independent variable and turn off the holiday option in Prophet.

So far Border_Closure_Control has always been included in Robyn in context_vars, is there another way to include it as a Prophet regressor like you suggested?

@Leonelsentana
Copy link
Contributor

Leonelsentana commented May 18, 2022

Please find the link about how to do it on prophet here: https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#additional-regressors
On Robyn, you can set a custom parameter under robyn_inputs() function. ? robyn_inputs() for more help. The problem is that add_regressors() seems to be an embedded function under the object prophet(), there could be tweaks you can do to be able to refer a function instead of a parameter in robyn_inputs custom_params. We usually recommend for users of your level to sense chack if adding regressors helps first by running prophet separately first, once you are sure this is something that really helps and that you prefer to treat there instead of as a context_var, then you may go through tuning Robyn's code to your own preference of including the add_regressor() function.

@Steven-Livingstone
Copy link
Author

Steven-Livingstone commented May 25, 2022

@Leonelsentana

Change intercept_sign parameter to "unconstrained"

Unfortunately this did not have much of an effect, with the waterfall chart still showing counter-intuitive results, albeit with a slightly negative intercept. These tests were also repeated with Border_Closure_Control as the only factor and/or context var added to Robyn, with the resulting waterfall charts also showing a massive positive contribution to Border_Closure_Control.

Attempts to add additional regressors to Prophet

Looking through Prophet's reference manual on cran, the prophet function supports only the following parameters (below). Robyn passes custom_params to the prophet function here.

prophet(
    df = NULL,
    growth = "linear",
    changepoints = NULL,
    n.changepoints = 25,
    changepoint.range = 0.8,
    yearly.seasonality = "auto",
    weekly.seasonality = "auto",
    daily.seasonality = "auto",
    holidays = NULL,
    seasonality.mode = "additive",
    seasonality.prior.scale = 10,
    holidays.prior.scale = 10,
    changepoint.prior.scale = 0.05,
    mcmc.samples = 0,
    interval.width = 0.8,
    uncertainty.samples = 1000,
    fit = TRUE,
    ... # Additional arguments, passed to fit.prophet
)

The only way (I can see) to include additional regressors in prophet is via the add_regressor function which you linked to. However, this requires access to actual (invoked) prophet function object given above, which I'm not sure can be dug out of Robyn? EDIT - I see you came to the same conclusion ;)

None the less, I can see that Robyn already adds all factor vars as prophet regressors by using the same add_regressor function mentioned above.

if (!is.null(factor_vars) && length(factor_vars) > 0) {
    dt_ohe <- as.data.table(model.matrix(y ~ ., dt_regressors[, c("y", factor_vars), with = FALSE]))[, -1]
    ohe_names <- names(dt_ohe)
    # Adds factor_vars
    for (addreg in ohe_names) modelRecurrence <- add_regressor(modelRecurrence, addreg)
    # Adds context_vars and paid_media_spends
    dt_ohe <- cbind(dt_regressors[, !factor_vars, with = FALSE], dt_ohe)
    mod_ohe <- fit.prophet(modelRecurrence, dt_ohe)

Context_vars and paid_media_spends are also included as prophet regressors using dt_regressors

dt_regressors <- cbind(recurrence, subset(dt_transform, select = c(context_vars, paid_media_spends)))

Add Border_Closure_Control as a Prophet Holiday

This had the effect of swinging the holidays waterfall contribution negative, similar to using Border_Closure_Control as just a context_var but excluding it as a factor_vars.

Follow on Questions

From analyzing how the above code, it seem that Robyn uses the add_regressors function to always pass all context_vars, paid_media_spends, and factor_vars to Prophet. Should Robyn, by default, pass all regressors for Prophet deseasonaisation? Naively it seems that Prophet is seasonilising all "explainability" out of Border_Closure_Control leaving Robyn's ridge-regression little to work with. And what's left happens to show as a positive contribution in the waterfall chart?

Is this reasonable (I'm willing to be completely off the mark though :P )?

@Leonelsentana
Copy link
Contributor

Hi @Steven-Livingstone apologies about the delay in my response, I have been pretty busy lately. Look, context_vars are added as regressors only when they're factors, this way we convert factors to numeric vars for easy processing leveraging prophet. We believe that continuous numeric variables as context_vars can reflect better the variance in time than factor context_vars, therefore you may want to try that, this will not go under regressors in prophet, just ridge regression directly. Hope it helps!

@Steven-Livingstone
Copy link
Author

Hi @Leonelsentana, no problem :) Honestly the support you (and the Robyn dev team) are providing is simply awesome! Thank you all so much.

Your suggestion seems to be working as intended given my testing (see the graph from above in "Example: Adding Border_Closure_Control as just a context_var") Here, Border_Closure_Control is unsurprisingly very negative. I think we'll go a head with this for now, given the your explanation.

The only nitpick I have is that I cant seem to find the reason for it working in the code. Below is the code that I believe "convert[s] factors to numeric vars for easy processing leveraging prophet." With some of my comments added.

  # If there exists factor_vars, convert to numeric vars and add them using add_regressor fn
  if (!is.null(factor_vars) && length(factor_vars) > 0) {
    dt_ohe <- as.data.table(model.matrix(y ~ ., dt_regressors[, c("y", factor_vars), with = FALSE]))[, -1]
    ohe_names <- names(dt_ohe)
    for (addreg in ohe_names) modelRecurrence <- add_regressor(modelRecurrence, addreg)
    dt_ohe <- cbind(dt_regressors[, !factor_vars, with = FALSE], dt_ohe)
    mod_ohe <- fit.prophet(modelRecurrence, dt_ohe)
    dt_forecastRegressor <- predict(mod_ohe, dt_ohe)
    forecastRecurrence <- dt_forecastRegressor[, str_detect(
      names(dt_forecastRegressor), "_lower$|_upper$",
      negate = TRUE
    ), with = FALSE]
    for (aggreg in factor_vars) {
      oheRegNames <- na.omit(str_extract(names(forecastRecurrence), paste0("^", aggreg, ".*")))
      forecastRecurrence[, (aggreg) := rowSums(.SD), .SDcols = oheRegNames]
      get_reg <- forecastRecurrence[, get(aggreg)]
      dt_transform[, (aggreg) := scale(get_reg, center = min(get_reg), scale = FALSE)]
    }
  # Else, simply use dt_regressors for prophet decomp
  } else {
    mod <- fit.prophet(modelRecurrence, dt_regressors)
    forecastRecurrence <- predict(mod, dt_regressors)
  }

Looking above to line 688 to see where dt_regressors is assigned.

  dt_regressors <- cbind(recurrence, subset(dt_transform, select = c(context_vars, paid_media_spends)))

Here we see context_vars is included along with paid_media_spends. This means that should it fall into the above "else" block due to no factor_vars, context_vars will still be used as prophet regressors.

Perhaps its my inexperience with Prophet, but is my interpretation of this code correct?

@Leonelsentana
Copy link
Contributor

Hi @Steven-Livingstone apologies for the delay again, we are adding those columns but we are not using them at any point in the prophet fit, prophet will just take ds and y from the dataframe, so no context_vars. Apologies for the confusion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants