Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Privacy guarantees of Ranked Granular Report #16

Open
michaelkleber opened this issue Jul 14, 2020 · 9 comments
Open

Privacy guarantees of Ranked Granular Report #16

michaelkleber opened this issue Jul 14, 2020 · 9 comments

Comments

@michaelkleber
Copy link

Prompt for discussion in SPARROW technical workshop July 16th (see #14):

Can you explain what makes a variable un-protected? It seems that it's supposed to be something that can be known only to one party (either the advertiser or the publisher). But how can the reporting system be sure of that?

It's certainly not enough to say that a value is un-protected if one party comes up with it without talking to the other party. To take an obvious example, the publisher's code knows the time when the page loaded, and the advertiser's code knows the time when the ad rendered, but if these were both un-protected variables, then of course they would immediately let the two parties join up their reports. (As I mentioned in #9, I don't believe that an agreement to not collude or share data is a viable protection.)

I think that without un-protected variables, this is very nearly the same as aggregate reporting, though you're proposing k-anonymity instead of differential privacy. That is, a report containing each row, with k-anonymity used to redact rare values, is very similar to a report that lists each sufficiently-popular event and the number of times that event occurred. (The ranking preference idea here matches up with the Aggregated Reporting section "Grouping multiple keys in one query")

But joining these with un-protected variables seems like a pretty substantial change in the reporting model.

@BasileLeparmentier
Copy link
Collaborator

Following our workshop, please find below a summary of the discussions around this issue. The full video can be found here, password:0Y$y0.R$. We invite participants to comment if they want to add a specific point or want to correct the following summary of the discussions.

Unprotected vs protected variable.

What are unprotected variables:

Those are variables that cannot be used to identify a user on the publisher website and therefore do not need to be hidden in the report. They include, for instance:

  • Click, as the ad is rendered in the opaque iFrame and the publisher is not aware of the presence and absence of a click, meaning that this cannot be used to identify a user
  • Interest group: Again the publisher is not aware of which interest group an ad was done for, he, therefore, cannot use this information to link a particular display to a particular user.
  • In general, any variable that cannot be used to guess the content of a variable that is available in the bid request should be fine. However, this means that the set of unprotected variables is actually likely not to be large.

Some work is still needed to reach a consensus that these "unprotected" variable do not introduce a vulnerability in the privacy protections.

Differential privacy vs ranked privacy-preserving granular report

The ranked granular report in SPARROW relies on K-anonymity on protected variables vs differential privacy for Chrome reporting.

One of the reason is the presence of unprotected variables, that are not easily introduced within differential privacy framework.

According to Charlie Harrison, Chrome engineers are working on a version of differential privacy that would allow handling such cases, but the framework is not complete yet.

Another reason is that we have reservations about differential privacy as the correct tool for online advertising, as explained here (https://github.com/Pl-Mrcy/privacysandbox-reporting-analyses/blob/master/differential-privacy-for-online-advertising.md ).

No consensus was reached during the meeting about the best tool to provide anonymous reporting between differential privacy and k-anonymity. Further discussion about trade-offs will be needed.

@michaelkleber
Copy link
Author

Thanks for the summary, Basile.

As @csharrison brought up in the meeting, the Chrome proposal for an event-level conversion measurement API does have a little of the "unprotected variable" nature to it. In that proposal, a report contains an actual event ID which can be associated with the ad auction. But in exchange for making the auction-time signals "unprotected", we place substantial limits on what other information can be joined with the event — just a few bits of information from conversion time, and even those bits involve DP-style noise.

In the same way, the existence of "unprotected variables" in your proposal triggers the need for substantial protections, otherwise it runs the risk of letting a large amount of user-specific information travel from the publisher site to the advertiser site.

To illustrate the risk here, let's consider what it would take for a colluding publisher and advertiser to join actual user identifiers using this report.

Suppose every user on Large Publisher P has a unique user ID. In the signals it contributes to the ad auction, the publisher includes the first bit, first 2 bits, first 3 bits, etc., of the user's ID. These are all protected variables, so probably "first 16 bits" would not pass the k-anonymity threshold, and would be suppressed. But in your proposal, the Gatekeeper would allow a report which contained as many bits as k-anonymity allows.

Now suppose Large Advertiser A also has a unique ID for each of their customers, and associates that with the click ID whenever a customer clicks on their ad and visits their site. If the Gatekeeper's report includes the click ID as an unprotected variable, then Advertiser A can learn many bits of the Publisher P ID for its customers. Of course it's not a unique ID, due to k-anonymity.

But the following week, Publisher P could switch to sending the last 1,2,3,etc bits of its user IDs. Since both IDs are stable over time, a person who clicks on two different ads from P to A leaks twice as many bits of identity. At that rate it doesn't take long to join IDs.

@BasileLeparmentier
Copy link
Collaborator

Thank you for this example.

Comparing it to the one you gave in the first version of SPARROW reporting, where only a display was needed to pass user ids between two colluding actors, we take the fact there now needs to be at least two clicks and a very specific configuration as a testimony to the tremendous progress we made!

Let us detail exactly what it would take to run this attack.

  1. Let us assume that the publisher_user_id can indeed be encoded in 16 bits. In order to report on the first 8 bits, as it is a protected variable, you need at least k users in the bucket to get the report. This means that you must have k * 2^8 = 256k users in every modality of dimensions crossing (excluding the user id) you want a report on. This would force the advertiser to opt for a low granularity in terms of non-user-centric dimensions.

  2. The advertiser gets one report per interest group (not publisher centric). Therefore either all publishers collude exactly in the same way, or the advertiser would have access to an extremely very limited reporting for all publishers that do not collude, just to run this attack on the colluding publishers. Even on colluding publishers, the K-anonymity would limit its reporting on non-user-centric dimensions.

  3. For the advertiser to run this attack successfully on a user, the user must click twice on the same publisher, in two separate occurrences where the reporting are different in terms of what bits are reported. Clicks are rare events (5 out of a thousand display, approximately) so two clicks on the same publisher are even rarer.

Technically, the example you gave, albeit convoluted, is indeed feasible. But we think that the requirements to run such an attack are too costly (the attacker would need to buy many impressions) and rare (both in term of success rate and coverage) to make it even remotely practical, particularly at scale. The cost of the attack exceeds by several factors the potential benefits. On the other end, a reporting such as the one we propose allows for practical advertising, which would result in publisher spend. I guess this is a case of pondering the privacy impact with other considerations (as mentioned by your team during one of the W3C IWABG calls), and in that case, we believe that the potential privacy gain there is far too small compared to the impact on publisher revenues.

Also to note: a similar, much simpler attack could be given on contextual requests, affecting all proposals, where one would just need to pass the user_id in the link descriptor to get exactly the same level of information you described. In that case, why would anyone run a complex attack such as the one you described when there are easier ways to proceed? We also want to pinpoint that similar attacks relying on persistent ids could also be conducted in a differential private framework. Or, to assume differential privacy over multiple reports, you would require to strongly increase the added noise – making the reporting unusable.

@michaelkleber
Copy link
Author

First, I definitely agree that we've been making tremendous progress! I'm sorry if I haven't made that clear enough.

Second, my point is that the "unprotected variable" idea lets a pretty large amount of information flow across sites.

It's large enough that I described a way to transmit a whole user ID. But even if that particular use is unlikely, it's still a way for the advertiser to get an awful lot of information that is (a) about a specific user of their site, and (b) about behavior that happens while that person is not on their site.

As I wrote in the Potential Privacy Model for the Web at the beginning of the Chrome Privacy Sandbox work, we acknowledge that some use cases rely on one site learning a little bit about some user's off-site behavior.

But the Click Through Conversion Measurement Event-Level API is an example of what "a little bit" of information might look like: it's limited to 3 bits of information, with 5% noise on the value of those bits. By contrast, you're proposing a potentially unbounded amount of cross-site information with no noise.

@BasileLeparmentier
Copy link
Collaborator

We are glad to know that you welcome this progress.

Do you see ways of putting numbers behind “an awful lot” or “a little bit” of information? I do think that quantifying the information leak potential as an "awful lot” is too strong. Besides the interest group (which is already known by the advertiser thanks to his first party id), most unprotected variables would relate to the ad, not to the user itself.

The status of "unprotected variable" would be awarded on a case by case basis to make sure that doesn't reveal any sensitive information. We think label (click and view) are important unprotected variable, and we don't think they are particularly sensitive.

Taking a step back from these technicalities, it seems that we disagree on the appropriate method to solve the last mile (or, dare I say, meter) issues on user privacy. You wish to design a system in which the users is shielded from any attack, however costly and convoluted, via technical means.
We think that:

  1. There are much simpler attacks outside of this system
  2. we won't meet the guarantee you aim at without gripping this entire system, and eventually killing it.

From our point of view, the Click Through Conversion Measurement Event-Level API is way too constrained to model conversion flow. Average Conversion Rate (conversions / clicks) is in the range of a percent. We would lose crucial information on all intermediate events (page-viewed, etc.) and on the conversion itself (price, etc.). Advertisers simply won't be able use this report.

Pulling the thread, they will eventually transfer their money where they can actually still measure something: walled gardens, youtube videos, search ads, etc.

On the 2nd of June 2020 IWABG call, @ablanchard1138 asked you tentative values for parameters that would be used in the reports, to which you answered:
kleber: I'll say something on TURTLEDOVE
... design rules for what level of privacy is not something for chrome to >answer
... we want the answer to come from conversation, among stakeholders
... including here
and @charrisson:
charlieharrison:
For aggregate reporting, differential privacy seems the way to go
... we want to see some analysis of the tradeoffs
... what happens as you adjust epsilon
... and where's the graceful degradation

That’s what I am trying to show here, and in the piece I wrote here, and through the scripts we published here to simulate differential private reporting: the level of noise to get the privacy you’re requiring is not compatible with actionable reporting for advertisers and publishers.

@michaelkleber
Copy link
Author

Let me clarify what I meant when I said unprotected variables seem to me like "a way for the advertiser to get an awful lot of information that is (a) about a specific user of their site, and (b) about behavior that happens while that person is not on their site." Maybe I misunderstand something about your proposal.

(a) Suppose that we somehow ensure that unprotected variables reveal data known only to the advertiser. For example, this could include things like "did the user convert?", or "dollar value of the user's conversion", or even "encrypted form of a unique ID of the user on the advertiser's side".

(b) The unprotected variables can include arbitrary data known to the publisher if it reaches the report's k-anonymity threshold. For example, suppose the publisher's ad network logs the behavior of a user just on the publisher's site, and remembers the IAB Tech Lab Content Taxonomy that each user interacted with the most. Levels 1+2 give 371 categories, but most sites concentrate in many fewer, so it's quite reasonable to think that many people's most-interacted-with-category-per-site would be moderately popular.

Now when a user visits the advertiser's page, what information will the granular report allow the advertiser to learn about this specific user? It seems to me that the advertiser gets to learn their favorite content taxonomies from all web sites where that user saw an ad for that advertiser.

That definitely seems like "an awful lot of information" to me. The behavior I describe is neither costly nor convoluted. Am I misunderstanding some kind of limit that would prevent this, or even make it unlikely?

@BasileLeparmentier
Copy link
Collaborator

Hi Michael,

I think there is some misunderstanding here. Please excuse us if it originates from a lack of clarity on our side.

What I am about to say might have to be amended when incorporating RTB House proposal. For the sake of this conversion, I assume that interest groups are defined as per TURTLEDOVE/SPARROW.

I think that the misunderstanding comes from a confusion about who has access to what information/variable.
Unprotected variables are a very specific (and likely small) subset of reported variables.
We defined them as: any available variable that cannot be used to link an ad to a publisher-side user_id. Those are variable that publishers have no way to get nor infer, and that therefore cannot be used to identify a user on their website.
There are two types of variable that could fit in this definition.

  • User related variable that are by design unavailable to the publisher,
  • Technical variable about the display that do not rely on publisher data.

An example of the first type is the interest group, and the second could be background color of the ad (The gatekeeper will have to ensure that no publisher variable are used to define it. Should any publisher variable be used, the variable would become protected).
One important point is that this report is made by the gatekeeper, and he has very limited access to user data.
This means that the only user related unprotected variables I can think of are:

  • the interest group (grouping enough users so that we cannot tie it to a specific user),
  • an AB test ID,
  • Some variables like impression count to avoid flooding the user with the same ads.

All those variable are only available thanks to the browser and are therefore very limited by design.
The "technical" unprotected variable I can think of are:

  • The impression and click id,
  • The click and related metadata (click position for instance)
  • View information,
  • Potential technical variable that are randomized for ABtesting without any publisher input (e.g background color)

The gatekeeper has no access to things like "did the user convert?", or "dollar value of the user's conversion", or even "encrypted form of a unique ID of the user on the advertiser's side".
When a click and conversion happens the advertiser might get the variables you are talking about (conversion, conversion value and potentially an encrypted unique id) and will be able to join it with the proposed report, but this is ONLY if there is a click. If there isn't, the advertiser and gatekeeper have no way of identifying a specific user inside an interest group, as only the browser has this information.
When a click occurs, the advertiser is able to join the user activity on its site and the reports thanks to a click_id, transferred via the click url. This allow to model accurately the conversion flow, and to compare performance of different supply (hence the need for the origin publisher). Again none of this is possible without a click.

The sentence below is thus inaccurate in the SPARROW reporting we envision.

"It seems to me that the advertiser gets to learn their favorite content taxonomies from all web sites where that user saw an ad for that advertiser"

To compare websites with a train station: the train station "advertiser" never gets to know all the stations the user went to in the past (as it is not his business). However, when the user arrives on deck from a train he has chosen to get in via a click, the train station knows where the train comes from.
We think that the user "agreeing" to the ads by clicking on it make this small privacy leak acceptable, as it relies on action on their side.
Is it clearer, or do you think I missed something?

@michaelkleber
Copy link
Author

Thanks, Basile. You're right, I had been assuming that the "unprotected variables" alone were enough to join with a user identifier on the publisher site; I didn't realize that you thought this would only become possible after a click.

But the unprotected variables can join over all impressions shown in the same browser, right? The Gatekeeper-chosen "AB test ID" persists over time and across sites. And something like the "background color" could be randomly selected at ad serving time, and would likewise persist across impressions.

So it seems to me that if unprotected variables join up all impressions shown to a single browser, then one click would be enough for the advertiser to learn about all impressions shown in that browser, not just the one impression that was clicked on.

@BasileLeparmentier
Copy link
Collaborator

Hi Michael,

What you are describing cannot be done in the current proposal, except using the ABtestid. This is why we are proposing a strict limit to this ABtestid (we have proposed that it'd be a number between 1 and 10 with some random resets despite being mostly stable).

Let's assume, for example, that the gatekeeper wants to keep the background color for user_1 constant across websites, for the advertiser to link all displays on user_1 of a specific IG when one click occurs.
The gatekeeper never knows for which particular user the ad requests it received is for. The gatekeeper has only access to the user's interest group. This scheme would not work.

As long as there is more than one user in each ABTestID (and as there are only 10 different ABTestIDs, there should always be many users) this should not be possible within our design, and therefore a click should only give information on the impression the user clicked on.

The leak that you describe could happen if we were to add more variables on the user (e.g. including RTBHouse proposal, or allowing additional information about the number of ads served etc.). The proposal as it currently is doesn't allow it.

To allow more information on the user to be transferred by the browser (e.g. with RTBhouse proposal), we will indeed need to update the proposal accordingly.

K-anonymity, but at the user feature level would be the way to go for this specific report - we would have two sets of protected features, publisher and advertiser, with the k-anonymity computed on different scopes.

This would need to be investigated more in-depth to make sure it would work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants