-
Notifications
You must be signed in to change notification settings - Fork 89
Publish the meaning of cohorts to even the playing field, enable transparency features, and enable scrutiny of protection against sensitive targeting #104
Comments
(cue GDPR, Art. 15 and 22) |
FLoC cohorts do not have any "meanings". They are simply projections onto randomly selected vectors. There is no meaning or interpretation of what they connote. Chrome is simply creating a vector space that represents browsing histories and performing a clustering algorithm to group similar browsing histories into randomly selected groupings based on randomly selected projection vectors. The closest thing to a "meaning" that I can think of would be a histogram. For each FLoC cohort, Chrome should be capable of producing a histogram which shows, for each of the browsers in that cohort, how often each domain appears in the browsing history (with differential privacy noise added of course, and with rare outliers removed... in fact, just the top 20 domains is probably all you really need to get some kind of an impression of it). Is this what you had in mind?
I too believe this is a desirable outcome. Although there is no "meaning" as such to the cohorts, here are a few things we could try to do, to try to detect cohorts that might be highly correlated with "sensitive" characteristics not detectable from browsing history alone (and which should be invalidated for this reason). Basically, we could try to perform the same type of "t-closeness" approach, looking at other types of sensitive data to see if FLoC cohort IDs might inadvertently be exposing such data.
|
@benjaminsavage If cohorts are designed not to have any meaning, does this mean they are designed to have limited utility? If any cohort discovered to have meaning ("demographic skew" in your example), would invalidate the cohort from being available, exactly which marketer use case(s) are they mean to address? If I understand the proposal correct, it seems they are not mean to address measurement, attribution or optimization use cases. If they are not meant to provide utility for focusing marketers limited budget when advertising across publishers, then is what exactly are the success criteria we should be evaluating FLoCs? |
@joshuakoran Because the cohort is available to all sites, the set of personal attributes revealed by the cohort has to be the intersection of all the sets of personal attributes that the user would choose to reveal to each site they visit. For example, a user might be willing to reveal A/S/L to an online fashion retailer, but not to a local blogger. But you don't have a separate cohort for shopping and blog reading, so FLoC ends up having to treat a personal attribute that is sensitive in any web context as a sensitive attribute. (Measurement, attribution and conversion tracking are handled by other systems.) |
The original issue raised was to ensure FLoCs support a "level the playing field." @dmarti your answer is about how FLoCs are generated, rather than which marketer use cases they are designed to support or even how well. I agree with @benjaminsavage that FLoCs are not intentionally designed to have any meaning, due in large part to the unsupervised clustering and large "crowd" of people grouped into each one. So to my open question, what exactly are the success criteria we should be evaluating FLoCs? |
Not necessarily. I cannot speak from experience - having not myself tested the utility of FLoC. But in principle, even if cohorts all contain a roughly equivalent distribution of people (as measured by things like demographics) that doesn't mean this information will be irrelevant from the perspective of selecting a relevant ad. What will matter is to see if there is any correlation between membership in a given cohort and the likelihood to make a purchase on a given ad. Advertisers can simply try to run their ads to all people in all cohorts and see if there is a higher conversion rate from some cohorts compared to others. If there is, they can use that to "bid more" for the higher-converting cohorts and "bid less" for the lower-converting cohorts.
Well, the Google ads team used them to try to help with "interest based ads" shown on their 3rd party publisher ad network. So let's say you're NOT Google search, or Facebook. You're just a small publisher. You know little to nothing about the visitors to your website. How will you select a relevant ad to show them? One option is "retargeting ads", which Chrome is proposing TURTLEDOVE / FLEDGE to try to accomplish. The Google ads team ran a test that aimed to measure how much the performance of these "interest based ads" would suffer for publishers on their 3rd party ad network if instead of using full browsing histories, they just clustered all browsers into cohorts of thousands of people. It worked reasonably well as a replacement for that specific use case.
Agree this proposal doesn't help with measurement or attribution. It does help with a specific "optimization" use-case (the small publisher use-case outlined above). |
@benjaminsavage There are also some interesting dynamic pricing use cases. Retail sites will likely be able to identify more or less price-sensitive and price-insensitive cohorts in order to optimize discount offers. |
It doesn't have to be a histogram but some representation of the browsing history that forms the cohort. It could be based on the website categorization that underlies the t-closeness analysis or even higher level labels. Nit: |
Seems like the vector space of browsing histories, the projection, and the clustering algorithm, if made public, would allow mapping a FLoC ID back to a set of projections in the vector space, and then to infer differences in likely browsing history for users with this FLoC ID relative to others. i.e. a meaning. Is there enough public information shared per spec (either per spec, or in Chrome's current implementation) to do this kind of analysis? |
I was highlighting the need for transparency of interpretations of cohorts in part because the algorithm that generates the cohort identifier from browsing history (or even from some other set of data, like a user selecting topics of interest) won't tell the user all that can or will be inferred about them from the identifier, even if the browser's code is open source. My expectation is that under a widely-deployed cohort system that some firms (especially in market research) will survey or otherwise gather various information from a panel of people with their cohorts, and then sell access to mappings of cohort identifiers to marketing categories. e.g. "to target Cooking Enthusiasts, buy ads for cohorts 12345, 45678, 98765 and 40404 for version chrome.1.1", or "women ages 25-34 are most heavily represented in cohorts 34567 and 87654". It definitely provides an ongoing incentive to gather data from a population of users to 1) enrich the cohort identifiers and 2) subdivide the cohorts (e.g. cohort 12345 merges two distinctive groups, but if this particular user has visited X.example, they're probably the first kind). The browser vendor publishing some data publicly (about the distinctive domains or other data sources for each cohort, or some market survey data) could help with review (by policymakers or researchers), transparency (to users) and lower the barrier to using cohorts for targeting, but I suspect it would always be incomplete. |
Hi folks, sorry for the delay in joining into the conversation. It was a busy week for FLoC. I do think the discussion in @npdoty's #101 is of great relevance here. In essence, @johnwilander asked for "the browser vendor approving the cohorts to make their meaning public", while #101 is about asking the same thing from parties who want to use FLoC. These both seem like reasonable things to ask for, but perhaps where it's hard to know whether to be satisfied with your answer. Do you have thoughts on how to decide what constitutes a good or bad answer to the question of what a particular FLoC id means? I see that John floated the proposal that "the browser vendor [...] make all its own knowledge about cohort IDs public", but that doesn't seem plausible to me — that would make sense if mere aggregation across a cohort were enough to protect privacy, but in fact the privacy properties here depend on a lot more than aggregation alone. |
Does this mean the privacy properties do not hold with respect to the browser vendor or other party that generates the IDs? |
I'm not sure what you mean. Of course my browser knows all the URLs I've visited; the "History" menu in Chrome or Safari gives access to that information. But equally obviously, we can't make that information public, even on a cohort level. I'm just pointing out that the answer to "What should we make public about cohort X?" is definitely not "Everything the browser knows about everyone in cohort X." |
Issue #101 argues that the browser should "offer transparency to users about cohort interpreted meanings." An even better way is for the browser vendor approving the cohorts to make their meaning public.
If the meaning of cohort IDs is not made public, these things seem to hold:
However, if the meaning of cohort IDs is made public, we'd get this:
I'd be surprised if listing the meaning of cohort IDs would be deemed sensitive in any way. If so, the whole premise of ad tech "deciphering" cohort IDs is equally sensitive and the privacy analysis doesn't hold up. It would then be "privacy by obscurity."
The only way to prove that the browser vendor believes in the privacy aspects of FLoC would be to make all its own knowledge about cohort IDs public.
The text was updated successfully, but these errors were encountered: