-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protecting ML models #67
Comments
No definitive conclusion but this was deemed difficult in the case of 3D assets, at least from a technical perspective: applications need to manipulate 3D assets in a number of ways for rendering, and it seems hard to split code that touches on these assets to a separate sandbox. It was also noted that 3D assets are not protected per se in native applications. The issue is particularly relevant in Web scenarios where it is easier to extract them, simply going through dev tools. This may suggest a mode where content is not protected per se but "hidden" from users, which could perhaps also apply to ML models. That would be at odds with the usual ability to copy and paste web content though. For 3D assets, some hacks (such as splitting assets into pieces and re-assembling the pieces on the fly) can be used to make the extraction more challenging. |
Indeed my thoughts at least were along the lines of:
Note the secureExecution attribute (or whatever it were to be called) indicating the browser should retrieve this special 1 time link to download and store in private memory for execution away from devtools / webpage JS scope.
These are all just ideas I am riffing on right now and I welcome further discussion from others as I am just brain dumping some initial thoughts so this is by no means a final proposition or anything like that :-) Would love to hear your thoughts on this topic though as we have requests for this very frequently from users of TFJS, especially businesses. Adding @pyu10055 @dsmilkov @nsthorat @annxingyuan @tafsiri @lina128 for any thoughts related to this topic too (TF.js team) and for visibility. |
thank you for bringing in your more detailed thoughts. I think your point 4 would be worthy of a separate dedicated issue (since it is overall independent of the question of protecting models) - would you mind creating one? A potential risk that bringing in this kind of black box is that we already know that e.g. WebAssembly is being abused for cryptomining and making this harder to detect or debug would likely not be an improvement. This I guess links to some of the discussion in #72. Stepping back a bit from specific proposals, how is this being managed in native space? Are there clear requirements to the level of protection that would be sought? (I assume that not all model providers would have the same requirements, but getting a somewhat clearer landscape of what kind of model providers would require what kind of requirements would help getting a sense of the needs in this space) |
some discussion on this in the ML Loader API explainer /cc @jbingham |
In essence for TensorFlow.js at least when you make an ML system you have:
For this reason all 3 parts of this would probably be desirable to execute in a secure context which is why I was suggesting JS being run in its own environment instead of just the model data being kept there. That way it is flexible for ML devs to reveal as much or as little as they are comfortable to do. Also this means if we are talking about JS it can apply more generically to other areas too - eg maybe authentication data - API keys etc for other forms of JS use cases too which can be kept securely away from the inspectable code of the web app. |
This could be similar to content protection standard like DRM for video, the difference is it needs a protected execution context (PEC):
From developer point of view:
|
Reflecting on yesterday's live session discussion, and on top of other considerations raised already, I think it would be useful to clarify how the lack of content protection would affect ML businesses, especially for people outside of ML circles (like me) who may struggle to evaluate the impacts. For instance, one possible way to present some of the media dimensions around content protection could be: movies may cost $300 millions to produce. Most people view movies only once. A substantial portion of the income comes from the first few days/weeks of distribution so an early leak has a huge impact. Also, the companies that distribute media content may not be the ones that produce it, and distributors are required per contract to protect content. There are most likely other dimensions to consider. Back to ML models, some possible questions:
|
As I noted in the discussion session, we need to consider serious privacy and security concerns here. If end-users are being asked to download and run code they can't inspect, and permit it to send data they can't control, both open wide gaps in privacy and security expectations. The scope -- and hence the potential harms -- seem much less constrained than was the case for EME CDMs and encrypted video. |
My thoughts: If one is gathering custom data for an ML model it is certainly not unheard of to see costs reach in the magnitude of 100's of thousands of dollars to buy time of many humans to do very niche tasks (especially if complex) and repeat that enough times to get suitable quantity of data for a certain thing if you want high quality data. Of course you could use something like Amazon Mech Turk or some tasks but whilst this is cheaper the quality of results coming back may be less so you end up needing more repetition to weed out incorrect labelling etc and a lot of data sanitization so it sort of balances out. Obviously this can fluctuate a lot depending on the complexity of the task at hand. Some may be much cheaper - especially if existing data sets exist or if the task is easy to understand and fast to do. This cost is just for the data collection however. You must then add to this the cost of training - which can take weeks if you have terabytes of data from that data collection effort using many servers in the cloud concurrently, plus the cost of hiring ML Engineers to design and make that model (each developer is probably on a 6 figure salary if in the main cities and there may be multiple working on one project), and then of course the ongoing cost of optimizing and refining to improve and iterate. I could see this cost for a very robust and niche model easily hitting the millions too depending on the size of the task at hand / complexity of the model etc in terms of end costs to the company involved and this is why they are so protective over such models. Many production use cases for business right now are accessed via Cloud API but this is mainly because model security can not be guaranteed in browser so this is the only option for business to give access to their model via a remote API which is locked down but then of course you lose the benefits of executing client side - offline inference, privacy, lower latency, and potentially lower cost too as less server usage needed - just a CDN to deliver the model vs all the GPU/CPU/RAM you would need to run otherwise for inference. I think leaking a model is much like leaking a movie. If you have it, you can certainly distribute it and the original owner has no control if that ends up on bittorrent or whatever service it may get shared on for then be downloaded by thousands of others. Of course if the model could be traced and verified it was an illegal copy somehow then legal action could be taken, but still that is a slow and costly process in itself to prove and is probably enough to put off smaller/mid sized companies from getting into the situation in the first place not to mention the legal system has not really caught up with the finer points of the ML industry yet either. With regard security; there may be some middle ground where only the certain things are processed in this way maybe - eg this sandboxed environment has no networking ability - some features of JS disabled etc so anything that could be ultimately sent is still inspectable - eg whatever is passed back from this black box is inspectable and when inside the black box the only way out is by returning the result which is then ultimately inspectable by client side after it ran through all the weights of the model to be transformed. Though obviously this needs more thought as to what such a sandboxed environment would look like and if that is then still useful enough to do various operations required by models today etc. Maybe @pyu10055 can chime in with what is needed to execute a model these days on the lower level side of things as I am not working at that level right now and if any of those ops etc would need networking ability or any features that could be deemed insecure? One of the key points of bringing this to client side is so that data does not need to be sent to a server from device sensors, thus increasing privacy for the end user in fact, and if it does send anything on network, that would be not in the sandbox environment so inspectable as per normal? I may have missed something here though so feel free to let me know as it is rather late here :-) |
thanks @jasonmayes to help quantifying the issue! During the live session , we discussed the potential usage for "split" models, where the latency/privacy sensitive inferencing would be done on the client, while some of the IPR-critical pieces might be kept on the server. @pyu10055 suggested there had been work in that direction - could you share relevant pointers in this space? (in my superficial exploration, I've seen work on distributed/split learning, not so much on distributed/split inferencing) |
@jasonmayes highlights the need from some ML providers to ensure their ML models cannot be extracted from a browser app.
This need is similar to what was raised by some media providers (which led, not without controversy to the definition of Encrypted Media Extensions.
A similar need was also expressed at the Games workshop for 3D assets last year - @tidoust, was there any conclusion there that may apply here as well?
It would be useful to understand exactly what level of protection would be needed for what use cases, since these types of considerations are known to be both technically challenging and at odds with the role of the browser acting on behalf of the end-user.
The text was updated successfully, but these errors were encountered: