Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a custom 'label' to allow more flexible batching #92

Open
alexmturner opened this issue Sep 8, 2023 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@alexmturner
Copy link
Collaborator

Currently, the aggregation service only allows each 'shared ID' to be present in one query. A set of reports with the same shared ID cannot be split for separate queries, even if the resulting batches are disjoint.

One option to add more flexibility is to support an optional, custom field (a ‘label’) that is factored into the shared ID generation. We could consider a few different options:

  1. Putting the field in the shared_info: The reporting origin would be able to easily split reports into separate batches based on the label. However, this approach would require the label to be set outside the isolated (Shared Storage or Protected Audience) context. It also would require the report to be deterministic similar to the context ID, i.e. sending a null report if no contributions are made. This approach is therefore unlikely to work for Protected Audience bidders (see related discussion) and could increase the number of reports sent.
  2. Putting the field in the payload: This avoids the deterministic report requirement and would allow the label to be based on cross-site data, i.e. set from inside the isolated contexts. But, this also prevents the reporting origin from directly determining the label embedded in the report. The reporting origin may therefore have to send a larger number of reports to the aggregation service and ask it to filter based on a given set of labels. For certain use cases, the reporting origin may be able to maintain a context ID to label mapping that would avoid this increased scale, albeit less ergonomically than above.
  3. Allowing bucket range filtering: Instead of using an explicit label, we could allow filtering based on a range of buckets, with budget only used for that range. This could be more flexible but also increases the complexity of the Aggregation Service’s privacy budgeting implementation.
  4. A combination of the above: We could implement multiple of the above options and allow them to be used together or in different situations.

For all of the above approaches, we’ll also need a mechanism to limit the scale impact on the Privacy Budget Service. For example, we want to prevent developers from specifying a unique ‘label’ per report. There are a few options we could consider, including:

  1. The Aggregation Service could limit the number of labels/bucket ranges or shared IDs per query
  2. We could limit the space of allowed labels/bucket ranges directly, e.g. only allowing integer labels up to a maximum value.

This functionality would also be useful for the Attribution Reporting API, so we may want to align on an approach. (For example, bucket range filtering has been proposed earlier.) Note that Attribution Reporting does not currently support making deterministic reports.

@csharrison
Copy link

Thanks Alex, I want to note that the context ID / deterministic reports approach is compatible with this related proposal WICG/attribution-reporting-api#974, although it isn't clear all deployments could use that option.

@alexmturner alexmturner added the enhancement New feature or request label Sep 11, 2023
@michal-kalisz
Copy link

Thank you for proposing this solution. It seems to be very interesting.

I'm wondering how exactly assigning a label to PAA data would look like. Would it be possible to assign a label for each key, value pair separately, or only once per entire auction?

We have several use cases in which we would like to use PAA: machine learning, monitoring, and reporting. For example, we would like to report:

privateAggregation.contributeToHistogram({bucket: key1, value: val1, label: "ml"})
privateAggregation.contributeToHistogram({bucket: key2, value: val2, label: "ml"})
privateAggregation.contributeToHistogram({bucket: key3, value: val3, label: "monitoring"})
privateAggregation.contributeToHistogram({bucket: key4, value: val4, label: "monitoring"})
privateAggregation.contributeToHistogram({bucket: key5, value: val5, label: "reporting"})

This is related to the fact that each of these cases has different requirements:

  • ML expects a large amount of data with low noise - we would like to wait a few hours for this data and query the Aggregation Service for aggregated results.
  • Monitoring expects data as quickly as possible to diagnose problems rapidly.
  • Reporting is in between - it expects data broken down by hours but can wait for them a bit longer.

It seems that this can also be achieved using proposal 3 - "bucket range filtering". However, if a label can be attached per individual histogram, this solution seems more convenient.

@kwanmacher
Copy link

This is a very interesting proposal, thank you!

The support that will be most useful to us are very similar to what @michal-kalisz described above, but applies to ARA summary reporting rather than PAA. There are several use cases that we have which have different latency requirements and operate on data aggregates that have very different cardinality for the different aggregation keys. For example, a reporting use case has many different breakdowns and can wait longer, while a real time monitoring use case might have much fewer breakdowns but require data to be batched up with minimal latency.

Considering that these different use cases will have their values set under different aggregation keys ("reporting", "monitoring") and they will collectively share the same total L1 budget for the report, it will be great if we can have the "label" attached to each of the aggregation keys (i.e. option 2 + per key label), and have the ability to include the same aggregatable report in multiple summary reports, as long as each query uses a disjoint set of labels.

A secondary optimization (can be built on top) is to go with option 1 and store the set of labels in the shared_info to allow for more efficient batching of reports, but this is more of a nice to have.

alexmturner added a commit that referenced this issue Dec 15, 2023
Details a proposal for allowing more flexible querying. See
#92
for earlier discussion.
alexmturner added a commit that referenced this issue Dec 15, 2023
Details a proposal for allowing more flexible querying. See
#92
for earlier discussion.
@alexmturner
Copy link
Collaborator Author

Thanks for all the feedback! We've put up a proposal that we hope satisfies your use cases: https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md.

Note that we've used different terminology to this issue but the proposal aligns with Option 2 (with a possible extension of adding Option 1 later). This proposal allows a separate label for each contribution within a report. And, while the proposal focuses on Private Aggregation, we plan to explore extending it to Attribution Reporting in a separate GitHub issue.

alexmturner added a commit that referenced this issue May 6, 2024
Specs the ability to set a filtering ID (and modify the default ID space). See https://github.com/patcg-individual-drafts/private-aggregation-api/blob/main/flexible_filtering.md#proposal-filtering-id-in-the-encrypted-payload and issue #92.

To support this new functionality, we increase the report version. Note that this also requires aggregation service versions to support the new version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants