-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ad Topic Hints #26
Comments
Have been spending some time on this topic lately and reached a similar condition. We put together our early thoughts in this new bird spec: https://github.com/AdRoll/privacy/blob/main/PAURAQUE.md |
Also some historical context for a company/browser extension called 'Bynamite' that attempted something similar: The extension would
There are some images here: https://bynamite.en.softonic.com/mac Last I checked, one of the authors went to work for Facebook and the other for Google I believe. https://www.nytimes.com/2010/07/18/business/18unboxed.html (1/2 way down the article)
|
The AdNostic paper (also from 2010) lays out different architectures which would reveal different amounts of information in order to provide behaviorally-targeted advertising. Proposals like this one would fall in the "Reveal the behavioral profile" category, and the authors speculated that a system could include ad interests as HTTP request headers. That's less revealing than disclosing the clickstream/browsing history (which many would consider the status quo), but more revealing than client-operated auctions. That's clearly not the only dimension of privacy that we should consider for these proposals: there's how disclosure of interests can be used for identification/fingerprinting; transparency and control for the end user; etc. Browser-provided UI to manage interests has the potential for more meaningful transparency and direct control. Rotating and fuzzing interests in order to limit fingerprinting is worth considering, although how effective that would be would require deeper analysis. Paper reference for those who haven't already seen it: Toubiana, Vincent, Arvind Narayanan, Dan Boneh, Helen Nissenbaum, and Solon Barocas. “Adnostic: Privacy Preserving Targeted Advertising.” In Proceedings Network and Distributed System Symposium, 2010. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2567076. |
Recap of feedback from June 24 Privacy CGThank you everyone for all of the valuable feedback, and the great discussion. Overall, the proposal was met with positivity. Many expressed a desire to put people more in control of the type of ads they see, and an interest in exploring user-stated / controlled interests. You can see all the feedback in the minutes: (link). In this comment I’d like to respond specifically to the feedback from Johann Hofmann and John Wilander. I am proposing a significant update to the proposal based on their comments. Concern about default stateJohann expressed a concern about the proposal as currently written. When people take no action to configure their preferences, if automatic inferences are returned based on behavioural data, there is a risk of sensitive information disclosure. I think this is a fair point. The alternative is for the user agent to just return a fully random “Ad Topic Hint” when no explicit preferences have been configured. First let me say, I think it best to incorporate this feedback from Johann. As such, I’m amending this proposal to take the alternative route. So; when no ad preferences are configured, let’s assume the user agent returns fully random “Ad Topic Hint” vectors. I previously suggested behaviorally derived default behaviour in the original proposal because I’m concerned that if this API is returning random vectors for 95% of people, it will not be useful for the purpose of ad selection. As such, ad-tech vendors would just ignore this API, leading to low adoption and user frustration when their stated preferences then appear to have no control over the ads they see. However, we can try to devise alternative solutions to address this risk. Here are two ideas:
Null-state behaviorJohn Wilander asked that it be impossible to distinguish between browsers where people had, and had not curated a set of ad interests. I agree with this. The API should always return a vector, never null. If no better alternative is available, the API should just return a random vector. This should make sure websites cannot block people from using a website until they enter preferences. You should be able to lieJohn stated that people should be able to lie about themselves, specifying interests that are NOT in fact their own interests. I agree, and this is also aligned with my original intent in this proposal. This API does not aim to capture any information about a person’s attributes, characteristics, or behaviour. The only feedback people would be providing would be “More like this” and “Less like this” clicks on ads. The internal storage would just be two lists, a list of “Ads they provided positive feedback on” and “Ads they provided negative feedback on”. This is 100% within the user’s control. There is nothing to stop people from providing feedback which is NOT actually aligned with their own interests. If I dislike seeing ads for credit cards, I could still say “More like this” on every ad for a credit card I see in order to see what kind of credit card ads are out there on the internet. On-device auctionJohn Wilander also asked if we could move to a fully on-device auction so that these ad topic hints never leave the device at all. I don’t think this is realistic. Ad networks are choosing between millions of ads. These cannot all be sent down to the device. My opinion is that it is OK for “Ad Topic Hints” to be saved by websites and associated with PII, so long as we can reach the design goal that these do not enable cross site tracking. This will require thoughtful design, and is critical to the proposal. First of all, there is so much random noise inherent in the API that they do not even reveal the specific ads the person provided positive / negative feedback upon. Secondly, to the extent that these stated preferences are being shared, this is OK (in my opinion) because expectations should be clearly set with people that their preferences will be shared with websites for the purpose of selecting ads more to their liking. While the API is designed to share these only in approximate form to prevent fingerprinting attacks, the concept that the data is shared is aligned with user expectations. Thank you!Once again, thank you everyone for all the valuable feedback. I really appreciate all of the positive shows of support and interest in this concept. I’d like to discuss these changes to the proposal next Privacy CG to see if they address the last set of concerns, and to see what additional concerns people might raise. |
Thanks for capturing my concern Ben! I agree that disallowing automatic inference by default is a good step, but as you said this drastically reduces the usefulness of the API, depending on how much the majority of end users actually choose to interact with the UI provided by browsers. Most "regular" folks I know aren't going to interact with nonessential UI unless they're being coerced through dark patterns. I'm genuinely curious whether there's a "sweet spot" for such a UI where we can ensure both informed user choice and massive usage. Regarding your point 2), is there data/research that shows that a significant number of users are interested in carrying over personalized advertisement choices? How can we verify that the ad data is truly from the user and not built by the website to improve their ad campaigns? Finally I think it would be helpful if the proposal laid out the noising/DP aspect of the API more clearly. Is noising achieved through adding random ads to the list or is it a side-effect of using embedding vectors somehow? Will the specification enforce a level of noising? As far as I can see the same incentives mentioned above also apply to noising, where a lower level of noise yields better results for advertisers. |
Ben, |
To what degree were you proposing that inferred or explicit user preferences should determine what advertising a user sees? Or, were you proposing that, potentially, a system of behavioral targeting that used embeddings and user input to incorporate user preference while limiting privacy risk could effectively compete against other methods of behavioral targeting, such as those relying on centralized records of user behavior, perhaps making the old methods unnecessary/obsolete? Also, wanted to endorse your suggestion of enabling plug-ins, so competing alternative solutions could solve problems without needing to rely on browsers to build all of the features and anticipate every need. Michael, ShareThis |
I assume it is implicit that this processing (behavior data into a feature vector) will not be enabled without explicit consent from the subject? And how is consent to be provably (in court) acquired if several users share the same device? |
I'd like to suggest an extension to the proposal, building on John Wilander's idea of an on-device auction and having a feature vector on each ad. Currently we are in a world of nested auctions (ad server, ad exchanges, SSPs, DSPs) which in concept return the ad that will pay the highest price to the publisher - in essence, a one-dimensional objective function that is something of a proxy to advertiser expected value, but seen through the lens of layers of ad tech and attribution games and general confusion and disinterest. Consider a different objective function: "choose the ad that pays a high-enough price to the publisher and maximizes user engagement." User engagement might be defined as ("more like this" + clicks* + lingers - 10 x "fewer like this"). * I am not sure we want "clickier" ads, but I'm thinking that an annoying or inappropriate ad will get downvotes that will outweigh the clicks. This enables a new model where the publisher returns the browser a list of say 100 ads. The browser takes this list of ads plus the ad topic hints data, runs a local ML model, and selects the ad that maximizes user engagement. If the model returns a low likelihood of engagement, the browser will not serve an ad at all, which prevents gaming and encourages publishers to send back a broad range of ads. The browser is now able to infer the topics that the user is interested in from the features in the ads, and this data is kept locally. The user can edit and manage this information in the browser as per the original proposal, but we don't need to send it to the ad tech ecosystem (where I worry that it gets fingerprinted and persisted). Here's where it gets kind of fun. The publisher now starts to build a model of what ads their users are actually choosing - effectively building an aggregated descriptor of their audience. When they want to get ads from ad networks, they can share this model (its coefficients or whatever) and use this model to select the ads to return to the browser. I am sure I am missing some things here, but I think this is a way to effectively change the objective function for advertising on the internet. To quote Seth Godin in his blog post today (which inspired this idea), "It’s beyond dispute that industry is an efficient way to produce more. The question is: More of what?" |
In reply to @michaelsgorman
I am just proposing a new web-API, that I suggest we add to browsers. There are a lot of factors which might have an impact on the extent to which it is used in practice. I think it's hard to predict. One major factor might be the existence of alternatives. As browsers like Safari and Firefox take actions to curtail tracking, there might not be a lot of alternatives left. In reply to @sandsmark It's hard to follow this proposal in the current form (threaded conversation on a single issue). Perhaps soon I will move this proposal into an actual folder within this repo. But in a follow up comment I noted that in response to @johannhof's expressed concerns about the use of behavioral data, I was amending this proposal to NOT use any behavioral data, including in the initial default stage. In response to @bokelley This is an interesting idea, but out of scope for this proposal. I'd like to keep this proposal very simple: just a signal provided from the browser that is 100% driven by explicit user choices about what type of ads they would like to see. As for what ad-tech vendors do with that signal, and what ML algorithms they run, and what they optimize for, that's out of scope for now. In reponse to @johannhof
I too hope we can find this "sweet spot"! The good news is, people will not need to provide feedback on all ads, or even a majority of ads for this to be useful. It's early days and hard to say, but if, over time, people provide feedback on a total of ~20 ads I suspect that will provide a pretty useful signal. Do you think that, over time, "regular" people might click a "relevant topic" or "not relevant topic" button on ~20 ads? There would be diminishing returns with more and more feedback, still useful, just not as incrementally so as those first few pieces of feedback. Unless people's interests change and they'd like to start seeing ads for different topics, they should be able to stop provided feedback after a certain point.
Today, I don't think any website collects this type of data, so there isn't the possibility of carrying it over. If such an API were available though, websites could design new experiences to collect feedback on ads (relevant topic / irrelevant topic). To ensure that this is truly indicative of user choices, and not just made up by the website, I'm proposing an API that shows the person the ads and tells them: "This website claims you told them this ad is a "relevant topic" for you. (Show picture of the ad inline) Is this correct?" I'm hoping that by involving the user in the flow, showing them the exact ad in a visual form, and asking for them to positively affirm that it is a representation of their preferences that we can prevent folks from gaming this.
I wrote some code =). That seems like an easier way to explain my idea than with words =).
This plot imagines that the person has provided positive feedback on 3 ads. Those 3 ads have the following embedding vectors:
It also imagines that the person has provided negative feedback on 2 ads. Those 2 ads have the following embedding vectors:
It also imagines that we will try to select a random "Ad Topic Hint" 20% of the time. (The level of "exploration" built into the system) This plot shows 500 "Ad Topic Hint" vectors generated from just these 5 pieces of feedback.
That's all that really matters I think. The specific algorithm isn't as important. Here is the super simple algorithm I implemented:
It's up to the browser to decide how often to give the website an updated "Ad Topic Hint" vector. Every page load? Every session? Perhaps time-limited? If you change the number in the for-loop to say 10, and keep refreshing the graph by saving, you'll see there is an awful lot of randomness. It would be very hard to uniquely identify someone from even 10 "Ad Topic Hints". My goal is for this signal to NOT be usable for fingerprinting. The hard-coded "0.05" factor I used as the "location parameter" in the Laplacian noise generation is just there for illustrative purposes. I imagine the selection of that specific factor would be up to the browser vendor. The 20% random vector selection as well is just a hard-coded constant for the purposes of this illustration. The default would be up to the browser vendor, but I think this is a good one to let people customize should they want. People might want more or less randomness in their ad selection. Again, the browser vendor could select limits on this that they felt were appropriate. |
I feel like there could be a place where PAURAQUE and Ad Topic Hints merge some of their better ideas to create an |
I'd think this being user configurable but also defaulted by the browser would create a useful tension. The reason I see this as useful is it allows the user to take a useful feature like discovery and pits it against another useful feature of privacy. Then since this is user configurable rather than always defaulted by the browser, it allows the user to make a determination on their own if they so choose to weight it in alignment with their own values. So for example, if they set their "Ad Topic Hint" to 100% it generates a random "Ad Topic Hint" all of the time, and if it's set to 0% they get perfectly targeted ads based only on their preferences. The advantage here is that because this is user configured (but defaulted by browsers in a sane way - say 20% as you suggest) it also creates an interesting competition for adverstisers that's no longer zero-sum (support the feature or don't). Instead, advertisers have to compete on their ability to deliver a high enough value that the user opts to not want to keep their data private (although they still have the option to if they want to). |
I like this proposal and am looking forward to us digging into it! I think have a few things I'd like to note.
7: I think there is a particular issue here that remains unaddressed but is a serious one if we want to solve the algorithmic discrimination that can present in ML-driven user-targeting systems. Specifically: how do you get to see ads that are specifically targeted in a way that a user is unlikely to ever see them through normal browsing and t/f exclusionary by their lack of presence and opportunity for the user. The big examples here are job and real estate listings. If these ads are targeted to specific vectors and the user is never given the opportunity to acquire those vectors we are replicating the current bad state of algorithmic red-lining. However, I think we can take the lead from some previous work on this in the ad space. There's an opportunity here to not just lie to the algorithm when given the opportunity to do so on specific ads (as noted above) but to look at a way to emulate behaviors and approvals of a very different ad targeting profile, and this is a clear opportunity derived from the suggestion above that sites be able to store and later push user choices of ads into the browser. I propose that we have explicit methods to suspend, switch, export, and import a user's ad topic hints from within their own browser and to exchange with others or with sites that might choose to request, share and create such "profiles" and make the available to others. This would allow a UX like Firefox's Track This project to work or allow users to lie about their interests more effectively and with less friction. To be clear, I think that allowing users to lie intentionally and easily about their interests to drive different ad outcomes is and should continue be a goal of this proposal. It would also allow users to easily switch between profiles depending on what they want presented to them at that moment and potentially acquire a profile they might not ever have presented to them due to other targeting factors. Further, to add an additional solution on point 2, it would allow a user or site or browser to define specific categories of disliked ads and ask the user if they would like to accept a list of ads that would allow them to exclude themselves from targeting based on sensitive categories. This might be especially helpful for sites that are focused on sensitive categories, as they can work with their administrators and users in order to define a list of ads that would normally be targeted to them contextually and allow their users to opt out of being targeted on the basis of those ads. I think this makes sense as a further extension of the data-portability discussion above. |
Sharing data for a particular purpose (to select more relevant ads) is much narrower than accepting that data on ad preferences will be stored by every website, combined with other personally-identifying information and used for other purposes. We should absolutely consider the threat model of using this data for fingerprinting or re-identification and mitigate, but I also think that if we are designing an API for a very specific use and choosing to expose new and potentially sensitive information for that use, that we should make that use limitation explicit. There may be both technical and non-technical ways to enforce those limitations. |
What if we flipped digital advertising around?
Today, when you visit a website, each ad network roughly follows three steps:
What if we flipped the script entirely, to make the web more private, but also to put people in control:
This not only skips over the resolution of the user-identity step (which is poised to break in light of browser tracking prevention efforts), it also means the ad networks no longer need to keep a profile of behavioral data about individuals.
But perhaps most interesting of all, it moves that decision of “what ad topics would you be interested in seeing” into a place where people can exert control over the process.
Through a combination of sensible automatic defaults, with the opportunity for people to manually override the system (if they so desire) perhaps we can have both relevant ads about interesting topics, and also preserve user privacy and autonomy.
Addressing the risk of fingerprinting
People have multiple interests, and these interests change over time. What’s more, people don’t necessarily know what they like. An important function of ads is to help people discover new products and services that they’d love, but didn’t know about before.
As such, the “Ad Topic Hints” returned by the browser should change constantly. Some topics of interest may show up more frequently than others, and the user might express the desire to see other topics less. And finally, there ought to be some randomness thrown in - to mix things up and explore topics they haven’t seen before.
This is great news from a privacy perspective, because it means these “Ad Topic Hints” couldn’t be used as some kind of tracking vector, or fingerprinting surface. If the “Ad Topic Hints” returned by the browser include a lot of random variation and change over time, not only across sites, but even across multiple sessions on the same site, we should be able to ensure they can’t be used for fingerprinting. This is one of the major points of criticism about FLoC that this “Ad Topics Hints” proposal seeks to address.
Addressing the risk of sensitive information disclosure
These ad interests aren’t communicating data about what websites a person has visited, their attributes or characteristics. FLoC indirectly does this (to some extent), and this is another piece of criticism this proposal seeks to address. Since we’ve flipped the script, this proposed API would instead be sending out information about characteristics of ads, not people.
But perhaps more importantly, this API would, by design, provide the user with the ability to inspect (and if they so desire, override) the set of “Ad Topic Hints” their browser is telling sites to show to them. Any inferences being made about what ad topics their browser thinks they may find interesting would be clearly displayed. Rather than have the browser vendor determine what is “sensitive” or not, if the person felt that a given “Ad Topic” revealed something they didn’t want revealed, they could ask their browser to stop requesting ads of that topic.
Ad topics as vectors of numbers
Rather than describe an “Ad Topic” with a categorical label, we propose using a neural network to convert ads into embedding vectors (introductory explanation here if you're not familiar with the concept). This has numerous benefits. It’s precise, can be universally applied without the need for human annotation, smoothly captures concepts that don’t have simple names, works across all languages, and allows us to compute the numerical “similarity” of any two ads.
Imagine an open-source ML system into which you feed an ad. It analyses the image / video as well as text, and emits a list of 64 numbers. Like this:
1.56, -3.82, 3.91, -2.27, -7.16, …, 1.81
Anyone can run this code on any ad to instantly get the list of numbers that are the “embedding” for that ad. We can design a system which can deal with all kinds of inputs, so that it works for image ads, video ads, text ads, anything.
This ML system would be designed so that ads about similar topics result in nearby points. In this way, we can easily compare any two ads to see how “similar” they are. We just compute the cosine of the angle between these two vectors. It’s as simple as just computing the dot-product of both embedding vectors and dividing by both magnitudes. It’s computationally fast and cheap.
Now that we have a simple, standard way to understand the “topic” of an ad, and a way to compare the similarity of two ads, let’s describe how it would be used.
The browser can select a vector that’s “similar” to other ads the person finds interesting / relevant. It can avoid selecting vectors similar to ads for which the person has expressed dislike. And every now and again, the browser should just pick a random area it hasn’t tried before - to explore, and learn if the person is interested in that topic or not.
Sensible defaults
Most people will not want to take the time to interact with an interface that asks them about their ad interests. That’s fine, so long as we have a reasonable default for people who don’t bother to configure this themselves.
The browser can collect information about ads the person is shown across the web, ads they click on, and ad conversion events.
Based on this information, the browser can infer what ad topics the person seems to engage with (and what they do not).
Autonomy through centralized transparency and control
Unlike much behavioural advertising today, where inferences derived from behavioural data are often invisible and unknowable - the browser can make all of this available to the user. It can show them not only the inferred interests it has generated, but also the raw data used to generate that prediction.
This leads to the second big difference with most forms of behavioural advertising. The user may choose to modify or override these inferred interests.
The fact that these inferences are all centralised within the browser is what makes this a tractable user experience. It’s not realistic for people to identify all the ad networks which may be making inferences about them based on behavioural data. It’s even less realistic to imagine that people will modify / override these inferences across all those networks. Centralisation gives the user a realistic form of control.
This should also address concerns about “autonomy”. When it’s possible to see all the data, and all the inferences, and to override / modify them in one place, we can say that this puts people in control over the ads they want to see and what information their browser transmits about those interests.
What’s more, the browser should allow people to configure how much “exploration” they’d like. Some people might desire more variety, while others might prefer their browser to stick to a narrower range of ad topics.
This proposal isn’t prescriptive about the exact algorithm the browser should use to select the ad interest vector to be sent to a given page, as this should be a great opportunity for browser vendors to compete with one another, in terms of ease of use and relevance of ads, as well as ease of user understanding and control.
Ideas about ways to incorporate user-stated/controlled interests
Several important proposals about ads and privacy involve labeling ads in a way that the browser can understand. While these proposals are primarily about attribution / measurement use-cases, we could utilize this here as well.
Once a browser understands what pieces of content are ads, it could potentially introduce a universal control that allows people to tell the browser how they feel about the “Ad Topic” of that ad. Perhaps a “right click” or a long-press on mobile could reveal options like “More ads of this topic” or “Fewer ads of this topic”.
Another idea would be for the browser to have a special UI somewhere with an infinite feed of ads. These could either be a hard-coded list, or could be fetched through ad requests to networks that wanted to participate in such a UI. People could go through this “ad feed” selecting “More ads of this topic” or “Fewer ads of this topic” on each. This would help the browser quickly understand more about what this person did / didn’t want to see ads about.
There are no doubt many other ideas out there which merit experimentation. This is just the beginning of this conversation.
Concern about centralized browser control
But there are also downsides to this level of centralization within the browser. Browser vendors who operate “Search Ads” that rely on first-party data would be able to personalize ads with or without this “Ad Topic Hints” API. They wouldn’t have much incentive to make this system work particularly well (from the perspective of ad monetization). As such, they might under-invest in this “Ad Topic Hints” API.
How can we stimulate more competition in this space? One possible approach would be to make this API “pluggable”. Such browser plugins would need to be reviewed / vetted to ensure user privacy and stop abuse. Plugins would have access to the ad-interaction data described in the “sensible defaults” section as well as user feedback on ads, and could design their own user-interfaces as well as algorithms to generate the “Ad Topic Hints” returned.
Making “Ad Topic Hints” pluggable is just one idea. There may be even better solutions available.
Understanding Ad Topic Hints
Advertisers will naturally want to develop some understanding of these “Ad Topic Hints” and map them to concepts they already understand, like the IAB taxonomy of ad topics.
The easiest way to understand these “Ad Topic Hints” would be to take a sample of ads that represent all the various categories in the IAB taxonomy of ad topics, and run them through the ML system. Ideally one would produce mappings for multiple examples of each category.
Then, for any “Ad Topic Hint” vector, one could compare it to these reference points. A simple approach would be to just consider the topic of the ad with the “closest” vector. A more sophisticated approach might consider the actual “distance”. If the closest reference point is sufficiently far away, this may be an unlabelled part of the ad topic spectrum. We may discover that additional categories need to be added to existing taxonomy systems.
To help illustrate this mapping process, imagine these embedding vectors were just two dimensional. By coloring the space which is closest to a given reference point all the same color you’d wind up with a Voroni Diagram like this:
Image of a Voronoi diagram from Wikipedia
Imagine that each of those black dots represents a “reference ad” deemed to belong to a particular “Ad Topic” in the IAB’s taxonomy. Any “Ad Topic” vector would fall into one of these colored regions. A simple approach would be to deem that topic the same as the reference point within that region.
The text was updated successfully, but these errors were encountered: