-
Notifications
You must be signed in to change notification settings - Fork 89
SimHash may leak information about aggregate traffic of specific publishers #90
Comments
Hi Bennett, thanks for writing this out, it's an interesting question. We posted some details of the locality-sensitive hash we're using in the first Origin Trial here: https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc. A couple of observed statistics seem potentially relevant to the attack you're considering. First, FYI each cohort is defined by an LSH prefix whose length is between 13 and 20 bits. Second, as you already knew, each cohort has at least 2000 people in it (forced by the clustering design). But also noteworthy is that each cohort has at least 735 different sets of domains mapping to it. This number is measured, not forced by the clustering, but it gives a sense of how much hash collision is going on. Of course you're right that a Bayesian could write down a model that used the revealed bits to update priors. My intuition is that there is very little signal left after disentangling competitor.example from the many other likely-correlated contributions in the hundreds of browsing histories in even a single flock. And of course any single other domain that's correlated with either yours or competitor.example's would introduce bias — and removing that bias would require knowing an answer to the very problem that your attack was about trying to answer in the first place. (The "among different demographics" part seems harder to believe, though — wouldn't those also correlate with new sets of noise for each slice?) Your conclusion "maybe it's so fuzzy as to be useless, but I think it's worth looking into" seems plausible; I would be happy to read an analysis. But overall, I suspect that you would get much better data by running a survey where you ask 1% of your users "What do you think of competitor.example?" |
Thanks for the reply. I'll try to game this out a little more and post something that looks more like code -- I also think there's a good chance that this wouldn't be particularly useful.
I'm assuming that the site doing the learning is using something other than floc to get demographics. Say you're Facebook and you already know the self-reported gender of each user. You can easily segment the whole population into M/F/other and run this analysis on each segment individually.
you're probably right. but if this works, it's silent and free! |
Hi @michaelkleber, could you please tell me where I can find the details of the chrome.1.1 algorithm?
Based on my tests this version works differently than chrome.2.1., for example, I noticed that my browser received a FLoC ID after visiting only one website (from your link: "An individual browser instance's cohort is filtered if the inputs to the cohort id calculation has fewer than seven domain names.") Is there any way available to test version chrome2.1? |
@millengustavo What did you to do get a If you want to start up your own browser in a way that filters cohorts until you've been on seven domains, you could change your command-line flags to include the string |
I just followed the instructions on floc.glitch.me, running canary from terminal with flags:
After visiting any domain, this "chrome.1.1." version appears as my browser's FLoC cohort version.
Oh, that makes sense, thanks. |
Just add more clarification. The version is another parameter configurable server side when we rollout a new configuration, and we will use an exclusive version for each new configuration. To be able to see the correct version during testing, you can just specify it in command line under the the same FederatedLearningOfCohorts feature, like: |
Sounds related to #100 |
I'd like to point out that for any traffic statistics that might be exposed through this, the browser vendor (Chrome/Google) already has access to them. This scheme is inherently extracting a pattern from the history training data, and therefore that pattern must already exist in the training data. |
Short version
Since SimHash floc IDs are just sums of vectors that correspond to individual domains, I think this version of floc could let large actors estimate traffic volume and aggregate demographics of visitors to other websites.
Related to #41 and #45, but I haven't seen this particular attack described yet.
disclaimer: I am not a chrome developer or a mathematician. if one of my assumptions here is off, please let me know!
Long version
The experimental version of FLoC uses SimHash, which is a deterministic mapping of browsing history -> floc ID. One of the project goals is to prevent sites/trackers from learning too much about any individual's browsing history. It should be impossible to use a single floc ID to determine with high likelihood whether a user visited a particular site. (longitudinal privacy is different, but leave that aside.)
But each floc ID will carry some information about the sites that are likely to make it up.
As best I can tell from here, SimHash in floc works like this:
At a high level, each site has its own floc vector. A user's floc vector is the sum of the floc vectors of all the sites they've visited, and the floc ID is a coarser version of that. You could also say each site has its own floc ID.
For each bit in a user's floc ID, and for each site they visited, there is a higher-than-50% probability that the bit in their floc ID matches the bit in the site's ID. For example, if you know a user visited a site with the 4-bit floc ID 1111, without knowing what else they visited, you know each bit in their floc ID is (slightly) more likely than not to be 1. Some sites might even have dramatic floc vectors -- with several vector values more than a couple standard deviations away from 0 -- which will have a higher impact on user floc IDs.
Now suppose you're the admin of a large site, and you see millions of floc IDs per day. You want to estimate how many of your readers also visit competitor.example. You might have an idea of competitor.example's traffic from a source like Alexa, which can serve as your prior belief.
Each floc ID you observe lets you perform a Bayesian update on your prior belief about how your readership overlaps with competitor.example. Say floc ID 11011 is slightly more likely than average to contain competitor.example, while ID 01100 is slightly less likely than average. Seeing a 11011 will boost your estimate of competitor.example's traffic, and an 01100 will deflate it. Each ID carries very little information, but millions of them could give you a pretty accurate idea of a specific site's volume.
If this works, you could also segment your own readership to figure out cross-traffic to competitor.example among different demographics. For example, U.S. readers of your site might be twice as likely to visit your competitor as other nationalities.
This would leak information about visitorship of all sites that are included in floc calculations. You could run experiments to find out just how accurate this method would be -- maybe it's so fuzzy as to be useless, but I think it's worth looking into.
This will also be a more valuable tool for actors who observe lots of traffic in lots of different contexts. Since floc only uses information about top-level frame navigations, it will only leak information about first-party traffic. Websites that don't own ad networks will reveal information about their traffic, while actors that receive lots of third-party requests will learn more information than they expose about themselves.
The text was updated successfully, but these errors were encountered: