-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow the use of "OTHER" as a declared license #7836
Comments
@sschuberth I absolutely see your points. This was a request from our curation community. I'll let @ariel11 weigh in. |
@sschuberth and @fossygirl - ClearlyDefined uses several non-SPDX identifiers: OTHER, NONE, and NOASSERTION. These are further explained in the curation guidelines here. The challenge we have is a fair number of projects/components do not have a license with a SPDX match. When there is license information but the tooling is not able to tell what the license is, it says NOASSERTION. We want a way to be able to say "hey, a human looked at this, there is license text/language here, but there's not a SPDX for it." That's when we use "OTHER." I agree, "OTHER" becomes a large bucket that can cover multiple scenarios - it could be for projects that say they are in the public domain, or projects with proprietary or commercial licenses, or projects that modified their license text so it no longer fits the SPDX matching guidelines. Another goal of the project is to resolve as many NOASSERTIONS as we can. Sometimes we look at these and we are able to identify a SPDX identifier for the license so we update the definition. Other times, we find license info but there's no SPDX match, so we put OTHER. If we just left these as NOASSERTION, we would not have a way to track which definitions we've looked at, if they are left as NOASSERTION. In my opinion, we need OTHER or an equivalent (or better) solution. Maybe we want to add a "public domain" option, knowing it does not equate to any specific text - rather, it just means the author(s) have dedicated their project in some fashion to the public domain. We've also talked about creating ClearlyDefined specific extensions to SPDX. I think that would be potentially great. I think @pombredanne raised this idea. Thanks |
Please note that out of these only
I must say I'm having a bit of a hard time following that rationale. I guess when you say "tooling" you mean license scanners, like ScanCode. But in my view ClearlyDefined data is not supposed to correct license findings from tooling like ScanCode, but it's supposed to amend a license that was (or not) declared as part of a software package's meta data. So, either there is license meta data in a software package, or there is not (at the example of NPM, either the license was filled out as part of If a license was declared in package meta data where there is no SPDX ID for it, in my view ClearlyDefined as two options: Not curate that license at all, or come up with a
Exactly, and in my particular example using data from ClearlyDefined ("OTHER") actually gives you less specific information than just looking at the original package ("Public Domain"). To be quite frank, I believe this is not acceptable for the goal that (I though that) ClearlyDefined has.
I agree that we need an equivalent that is fully SPDX compliant.
Here, I disagree. Creating "extensions" to a standard that is not really meant to be dynamically extended is not a good idea IMO. But in my view we also don't need to do that, as SPDX already provides all the means to provide an equivalent to "OTHER" by using |
@sschuberth thanks for the feedback. As @ariel11 mentioned, the Are there other ways to do this? Absolutely. If this is a pain point, I think working with the engineering side of the project to meet both the human curator needs (essentially answering the question - have we looked at this before?) and preserve valid SPDX expressions might be helpful. Maybe a flag somewhere in the data? The A few responses to your specific points below. Happy to continue the conversation.
It does both - scan tools sometimes misidentifies licenses or erroneously throw a
Some package ecosystems don't have good metadata, don't use metadata (like a git repo) or use a license link that is proprietary. The curators are not just solving for NPM, their also solving for a swath of ecosystems - gits, NuGets, etc. As Ariel pointed out, you can review the curation community's process here. Like everything, its not perfect. I know the community has been very happy to take feedback on that process.
I would love for SPDX to have a |
This is a great conversation. I'm a fan of using
IMO the best solution here is to use the SPDX text normalization and than hash in a standard way that all tools can use. Then all licenses can have a hash and some licenses will also have a more readable name/id as we see today in SPDX. |
I totally agree here. But I'm not aware of any "SPDX text normalization". Does SPDX define an algorithm on how to normalize text? If so, would you have a link? I'd be very happy if you could implement this rather today than tomorrow. Because if we had this, we could avoid having |
Agreed. IIRC while not rocket science, it also was not quite as easy has trim, hash, go. There were lots of corners/edges to think about technically as well as coordination with the SPDX community and an aliasing strategy for when licenses eventually do get an ID. All to say, I doubt that it's going to happen soon unless some people pitch in an help drive. |
Hello everyone! I'm a the new Microsoft Principal Engineer responsible for Clearly Defined. Lots of good discussion here, I'm going to attempt to summarize it as requirements. I would love your feedback on whether they sound correct or not. ProblemWhenever possible, Clearly Defined matches project licenses with an SPDX license expression However, Clearly Defined must sometimes curate projects that have license information, but the license information does not match an SPDX expression. In this case, our tooling currently defines the license as "NOASSERTION." When a project has a license identifier of "NOASSERTION" in the Clearly Defined database, human curators attempt to discover the license and manually update it to an SPDX identifier. When there is a matching SPDX identifier, the curator updates the license for the project in the Clearly Defined DB. However, there is not always a matching SPDX identifier, sometimes this is because the project is using a non-OSI approved license, sometimes it is because the license is unclear, etc. In that case, the human curator needs a way to indicate that the project has been reviewed and no SPDX identifier match exists. At the moment, the human curator updates the project license to "OTHER" to indicate that a human has reviewed it, but no SPDX matching license identifier exists. The problem is that "OTHER" is not specified in the SPDX standard, this can cause confusion when Clearly Defined data is consumed by other tools. Requirements for a Solution
Possible solutionI don't believe tracking non-SPDX licenses is within the scope of Clearly Defined at this time or within the near future. We could certainly look into doing it at some point, but this strikes me as a big undertaking - not necessarily on the technical level (that is pretty straightforward) but on the coordination level with OSI, other tools we use, etc. Something we could do is provide another way to indicate that a project has been reviewed by a human curator and confirmed to not have a matching SPDX license expression. I propose leaving those as "NOASSERTION", but adding a boolean field which indicates whether it has been reviewed by a curator, then surfacing that field where ever appropriate in the project. |
Thanks Nell. Great summary of the situation. Your discussion triggered a thought that I'm not sure why we didn't have before. We can make up our own I like this over a separate boolean because it
|
BTW, the normalizer I was thinking is essentially an embodiment of https://wiki.spdx.org/view/Legal_Team/Templatizing/tags-matching which talks about the significance (or not) of various parts of a license text. Then the SPDX tools (e.g., https://github.com/spdx/tools/blob/master/src/org/spdx/tools/MatchingStandardLicenses.java) use that via https://github.com/spdx/tools/blob/master/src/org/spdx/compare/CompareHelper.java to normalize the text during templatization/compare. At least that's my understanding. |
On note on the |
While that, also to me, sounded like an elegant solution at first, it unfortunately does not solve the "data quality" problem I've mentioned initially. In ORT, we "blindly" consume curations from all our configured (=trusted) curation providers. That is, whenever we come across jsonify 0.0.0 (in this example) in an NPM project's dependency tree, the ORT analyzer already knows its license is "Public Domain", and we would pass that string down to ORT's evaluator. However, after consuming the ClearlyDefined curation, the license that the ORT evaluator sees is changed to "OTHER" (or "LicenseRef-CD.Other"). And from the perspective of someone who writes ORT policy rules, "Public Domain" is a meaningful string we can act on, but "OTHER" is not; we now even don't know anymore that this some public domain license. That's why I believe @nellshamrell's proposal of a separate flag is the better solution. But going one step back, I'm asking myself why ClearlyDefined even has a curation for jsonify 0.0.0 at all. Looking at the file it actually adds no valuable information (in terms of automating license compliance checks) and at least for ORT it would have been better if that curation wasn't there to begin with, so we wouldn't consume it, and thus not shade "Public Domain" with anything else. BTW, if you have some comments about how ORT is supposed to consume ClearlyDefined curations, or if you believe ORT currently does it in the wrong way, please tell me. ORT came up with its own concept of curations about the same time when ClearlyDefined started. So we ended up adding ClearlyDefined just as another (compatible) provider of curations for ORT. But maybe the concept of ClearlyDefined is not (fully) compatible with ORT's ideas after all. |
I'm likely missing something about the ORT scenario and how you are thinking about curated and non-curated data. For us, we don't really draw a distinction. some definitions are completely automated, some had to be fixed up. Perhaps the gap here is looking at ClearlyDefined "curations" rather than just ClearlyDefined definitions. For example, one could view ClearlyDefined as a backstop such that where ORT can't figure things out, use ClearlyDefined. Or inverted. Or as more authoritative (e.g., if ClearlyDefined has different data then favor it). It might be that CD runs some tools that ORT does not and gets a "better" answer even without human curation. Put another way, how ClearlyDefined gets to a license determination is, to a certain degree, an implementation detail. From a licensing/compliance point of view NOASSERTION, NOASSERTION + curated=true, OTHER and LicenseRef-CD.Other are all the same. As a compliance officer, I need to do more work. The additional information carried by OTHER, the boolean, or LicenseRef-CD.Other is that a human did some more work and verified that indeed the license could not be figured out. As a compliance officer I can skip that part and dive right into what to do about my team using components with unknown licensing. Ideally we'd never have to use this and there would be proper ids for all the licenses and the tools or humans would be able to figure it out and assign an id. For example, Public Domain came up a few times there. In that scenario is the core issue that SPDX doesn't have a generic public domain identifier (at least I don't think it does) or the specific ones needed? What would/do you do differently if you see NOASSERTION with or without knowing that it is curated (recall that a curator can assign NOASSERTION as the value as well). |
That sentence probably is the important bit, at least for me. I was always believing that CD would "simply" be a database of human-curated metadata for software packages. Emphasis on "human" here because I'm not interested in metadata collected by some tool here. (Note that I'm not talking about scan results from a license scanner here, that's a totally different story.) And when I say "metadata", I primarily mean source code location and declared license. So, when the ORT analyzer has determined all transitive dependencies of a project, and then fails to download one of the source packages because the URL is wrong, ORT would query CD to see if it knowns better where the source code is located. Similarly, if CD has a declared license for a package, we "blindly" take that one and it overrides the original license declared in the packages metadata, if any. That license is then basically fed into the ORT evaluator which checks against policy rules. Next, there might be rules that know how to handle a license of "Public Domain", as for a human (who has written the rules) there are semantics attached to that string. And that's where we run into problems when we let CD override "Public Domain" with "OTHER", because "OTHER" cannot be handled in rules in a meaningful way. That's why, sticking to the "Public Domain" example for jsonify, I actually would have expected there to be no curation at all for this package in CD, because CD simply cannot turn "Public Domain" into something that is a better / standardized representation of a Public Domain license, as there is no SPDX identifier for it. In this case, saying nothing would have been better than saying "OTHER", "NOASSERTION", or anything else that is less telling than "Public Domain". |
Ideally ClearlyDefined would never need human intervention. All the tools would run perfectly and discover the required info with 100% certainty and accuracy. It would still serve a purpose as a one stop shop for all that info that originated in disparate forms and locations. So it's better to look at ClearlyDefined as an "really good source of compliance info" rather than a place for curating the data. As it is only a vanishingly small fraction of the 10+ M definitions in ClearlyDefined have human curations. In your scenarios it's still not clear why you draw a distinction between machine or human generated info. With the above stated goal, ClearlyDefined would have better and better tools, the input projects would be better and better, and fewer and fewer humans would be involved. The information is still (potentially) better than what you have. If ORT goes to ClearlyDefined for a source location or a declared license, does it matter to your scenarios if that was determined through automation or human intervention? As to the specifics of OTHER and Public Domain, if the value weren't OTHER, it would be NOASSERTION. Either way it's not Public Domain. This is a consequence of us deciding only to traffic in SPDX ids (with the now regrettable exception of OTHER). Since NOASSERTION and OTHER (or LicenseRef-CD.Other) are essentially ways of us saying "I dunno", would it make sense for you to filter those out and basically say, "If ClearlyDefined doesn't have a definitive answer, ignore them". In essence that's our intention. NOASSERTION is a flag to humans saying "we don't know, you better figure it out". OTHER is a flag saying "we don't know and a human tried to figure it out but could. You better look" I still prefer the LicenseRef approach over adding a boolean
Either way it seems you'll have code to the effect
|
Indeed the data source would not matter in the end if the data gathered through tool automation was of the same quality as the data gathered by humans. But it isn't. The ORT analyzer already is an automation tool to gather data, but it fails sometimes, for example if no license is declared in package metadata, but only in prose on the project's home page. Sometimes it's really forensic effort to determine the license, and that currently requires a human. ORT is looking for such human-created curations to fixup its automatically determined metadata. We have our own ORT-specific database with human-created curations of high quality for that purpose, but I was hoping that CD would be another source in this regard.
Again my question is: Why does it have to be anything? Can't it be just nothing? I believe I start to grasp my own confusion here: CD seems to have the goal to "comment on" / "review" each and every software package out there, and needs a way to "mark it as reviewed". Whereas ORT's curation database only contains entries for those packages where metadata needed to be fixed up by a human in order to make it further processable by automation.
That's exactly what we started doing recently. |
We do have a goal of broad coverage (automated or manual). We do not have a goal of "reviewing". Curating is only done when it is needed and we really hope that it's not. The whole issue here is that we want to capture the work done when someone investigates a NOASSERTION state and is unable to resolve it, so that others don't repeat the work (unknowingly).
|
Hello again all! I was just catching up on the comments in this issue. Here is my summary of what we have established so far: License Expressions
Human Curations
Problems with our current approach
Proposed solutionsIf a license cannot be determined, why can't it be nothing/undefined?
Can we add a boolean to a definition so, when it's license is NOASSERTION, we can still tell whether it has been reviewed by a human?
Use a LicenseRef
My Current ConclusionUsing a LicenseRef seems to be the approach most in line with Clearly Defined's intentions - a machine curated definition should not have heigher weight that a human curated definition. If we were to add a human_curated? boolean we would be implying that we expect all definitions to be human reviewed, which is not the case. Using a LicenseRef with Clearly Defined does need to be scoped out (and that is work I will do next if there are no objections, the work itself will likely not start until next year). However, should it be achievable, it would allow us to indicate that a definition has been reviewed and the license determined to be OTHER (rather than NOASSERTION) but still only use SPDX recognized expression for licenses on definitions. Does this make sense? |
LGTM. Thanks for recapping all the various discussions. Of the top of my head, here are some of the points to be looked at in implementation
|
Several curations use "OTHER" as the declared license, e.g.
curated-data/curations/npm/npmjs/-/jsonify.yaml
Line 8 in da3aff5
First of all, there is a general problem as "OTHER" is not a valid SPDX expression. Secondly, at the concrete example of
jsonify
, consuming the ClearlyDefined curation worsens the meta data from "Public Domain" as declared in its package.json to "OTHER", which is even less telling, and causes ORT (which has a mapping from "Public Domain" to "LicenseRef-scancode-public-domain-disclaimer") to run into issues.That's why I'd like to propose to not use "OTHER" at all. What do you think @capfei @fossygirl?
The text was updated successfully, but these errors were encountered: