[Request for discussion] What software project metadata OMB should ask agencies for #117

konklone · 2016-04-08T20:00:22Z

(I’m Eric, an engineer at 18F, an office in the U.S. General Services Administration (GSA) that provides in-house digital services consulting for the federal government. I’m commenting on behalf of 18F; we’re an open source team and happy to share our thoughts and experiences. This comment represents only the views of 18F, not necessarily those of the GSA or its Chief Information Officer.)

The Implementation section of the policy asks agencies to inventory their open- and closed-source software projects, so that OMB and the public can increase the discoverability of agency software. This seems very similar to what agencies do with their datasets, in support of M-13-13 and Project Open Data.

These fields will be maintained by CIO shops, likely mostly manually. Our premise is that the fewer the fields, and the easier it is for staff to maintain the data by hand, the more complete and timely the data will be.

We wanted to throw out a first pass at the fields we think are the highest priority, and we're trying to stay on the minimal side. We'd love to hear others' thoughts about what fields should be in here.

At the overall listing level:

Date on which this listing was last updated (required, date)

And at the per-project level:

Name of project (required, string, free-form)
Access level (required, string, “public“ or “non-public”)
Description of project (optional, string, free-form)
Date of creation (required, date)
URL (optional, URL)
License / copyright status (required, string, free-form but with suggested values)
- This can also be noting public domain status, or referencing a public domain dedication, so "license" isn't the perfect term.
License / copyright status outside the U.S. (required only if different than inside the U.S., string, free-form but with suggested values)
Point of contact (required, string, email address)
Which OMB exception is being used to justify closing the repository (required if applicable, string or number [depends how OMB structures it])

This is just a starting point, and there's surely things we haven't thought of, or have discounted. However, we definitely are of the mind that fewer fields will produce more complete results.

benbalter · 2016-04-10T17:03:22Z

I'd also suggest the project metadata include a brief, one- to two-sentance description. Given that the open source community loves creative names, having the metadata of "GMan" does not aid discoverability of the project, as much as "A Ruby gem to detect government domains" does.

konklone · 2016-04-10T17:15:17Z

We meant to include that field, actually -- not sure how it got dropped. I went and edited our above comment to include:

Description of project (optional, string, free-form)

jhourcle · 2016-04-12T05:18:38Z

Language(s) used -- will affect if you have the expertise to extend it.
some sort of classification of the projects -- to allow people to more easily filter out stuff they don't care about, or find related projects. (scientific vs. administrative vs. general use, desktop vs. server vs. website ... stuff like that)

Also see discussions in #132 .

Also Note 'URL' is prone to link rot and ambiguous. At the very least, I could see:

URL to obtain the software
URL to a website describing the software / project / whatever.
URL(s) (or DOIs) to publications & presentations about the software.
URL to documentation of the software.
URL to an example installation of the project (if web-based and publicly accessible)

(insert disclaimer here about these being personal comments, and not that the of the agency I work for, blah blah blah).

philipashlock · 2016-04-15T21:59:01Z

I don't think it's seen broad adoption outside the context it's used by European governments for this purpose, but I should note ADMS.SW as a precedent for this. ADMS.SW builds on related schemas including DOAP, SPDX, ISO 19770-2, ADMS, and the Trove software map. cc: @makxdekkers

JJediny · 2016-04-19T03:26:24Z

Per related comments #116 and #40:

In Issue #40 @dsmorgan77 states:

FAR Subpart 27.4 defines "Data" as follows: “Data” means recorded information, regardless of form or the media on which it may be recorded. The term includes technical data and computer software. The term does not include information incidental to contract administration, such as financial, administrative, cost or pricing, or management information.

Source code is information structured in a way that computers can interpret, I agree that source code as structured information is data. To this I suggest ProjectOpenSource start with a minimal but extendable/semi-optional metadata schema based on WC3's Data Catalog Vocabulary (DCAT). The most recent attempt to codify metadata for software was the European Commission's ADMS v2.0 which was drastically updated to remap to DCAT, so too was ProjectOpenData's data.json v1.1 based on DCAT. Basing the TBD code.json schema on DCAT would ensure, a controlled vocabulary, high levels of interoperablity, and promote semantic web / linked data efforts. Other metadata standards/specs/schemas reviewed and referenced included:

Org	Schema
European Commission	Asset Description Metadata Schema ADMS v 2.0
World Wide Web Consortium's (WC3)	DCAT
OpenKnowledgeFoundation	Datapackage
Code for Americas	civic.json and Team API Project
18F	about.yml and Team API
NIST	Specification for Asset Indentification v1.1 & Asset Summary Reporting (ASR)

Beyond the general suggestion to use the DCAT controlled vocabulary, I suggest the specific use of three fields:

dcat:Identifier - should be a UUID / GUID if the record is the original source project (i.e. not forked) and if forked or a derived work field should use a URI to the source repository
dcat:describedBy - should reference the TBD json schema at the dataset/source code tier/level (_as in DCAT heirarchy) not at the aggregate catalog level to ensure that future projects can reference newer versions of a schema without requiring the updating of past records that will eventually reference a superseded version of the TBD schema.
dcat:conformsTo - should be used to encourage the _extending* of the TBD shared (i.e. common core) schema. This secondary schema would allow agencies/programs to extend and/or override enumerated lists/options. Allowing for a custom/extendable schema also addresses concerns that specific/unique data points (i.e fields) with specific meaning/purpose to any specific agencys/orgs can be carried along with a minimal common core that ensures interoperability.

ADMS v 2.0 mapped to DCAT:

ADMS v2.0 Overview

ceefour · 2016-06-30T08:06:30Z

Other formats

Description of a Project was an XML/RDF based format meant to better describe and document software (specifically open source projects). The format itself may not be of much use to you, but the logic and reasoning behind the schema may be of use this project.
ISA², Developing and sharing IT solutions for the EU, is another attempt at this. This is meant to be able to go across the EU, so similar sharing and discovery problems.
@ceefour above already mentioned the NSF-funded, Codemeta as something you might be interested in taking a look at.

It acknowledges that there are already a very large number of metadata standards for describing data & software in the academic community and it's an attempt to make all of them inter-operable.

In-house with this sort of stuff: GSAblog: Category Management

Maybe not a format or file at all...

Why use a metadata file where you can get much of this same information from a projects pre-existing files and the source control host itself.

Native formats are a huge advantage: Collecting metadata from language/platform-specific package files (Gemfile, package.json, pom/ant.xml, build.sh, etc). These will often provide a wealth of metadata about the project. Similarly https://libraries.io has done a bunch of work in this space. Much life self-documenting code is a huge advantage for understanding, self-documenting projects can be a big help in this space as well.

In general, there's little to no widespread adoption of a software metadata file within repos outside of aforementioned package files.

Instead of defining facts in a file, you could let the repository provide some of this information for you (and of course a plug for the advantages of using a modern version control system as well to house these projects ;) ).

Some examples of how the GitHub API can provide more details for you:

### Repo host
Implied by GitHub URL, provided by our API: https://developer.github.com/v3/repos/#get

### Title
Provided by GitHub UI/API: https://developer.github.com/v3/repos/#get

### License
Provided by GitHub UI/API: https://developer.github.com/v3/licenses/

### language
Provided by GitHub UI/API: https://developer.github.com/v3/repos/#list-languages

### Last update
Provided by GitHub API: https://developer.github.com/v3/repos/#get

### Keywords
Pull general information about a project from the description or the README with keyword matching (APsI: https://developer.github.com/v3/repos/#get https://developer.github.com/v3/repos/contents/#get-the-readme)

JJediny · 2016-08-15T16:09:28Z

Related to #116

While i'm still for an established metadata standard as a single .yml file. I agree it should first be made practical and agree with @jbjonesjr KISS approach as a first cut of the proudly public repos/orgs... Though it wouldn't for self-hosted or externally managed source code repositories (bitbucket or gitlabs for example). I put some work a few months ago to compare some of these KISS based inventories already out there for US Federal OSS:
https://gist.github.com/JJediny/9bbaaafdfe205d184cbb902d86c82f74

But I failed to include one of the better lists/approaches in...

Govcode - Git Repo
GovCode.org

API: https://api.govcode.org

working off of Github

Repositories: https://api.govcode.org/repos
Organizations: https://api.govcode.org/orgs
Users: https://api.govcode.org/users
Stats: https://api.govcode.org/stats
Issues: https://api.govcode.org/issues

skybristol · 2016-08-15T22:40:38Z

+1 to the KISS and start with a simple registry approach discussed here from @jbjonesjr and @JJediny. There's a ton of stuff we can get in terms of inventory information through simple deliberate registration in some fashion of the official repos for Federal agencies. Connecting the dots to the budget line item Programs that will put things in context for OMB below the agency/bureau/office/etc level will be a little challenging but could be an interesting linked data problem hooking together GitHub and other repo accounts and identifiers.

USGS (where I'm from) has a couple of orgs here, USGS and USGS-R, that contain our officially sanctioned repos. A few smart people working together to bring those most visible projects together into a registry/index of some kind would go a long way to baselining where we are and to get the ball rolling. It would be a heck of a lot better than the usual gov fare of data calls, forms, and (gasp!) spreadsheets.

You might also check out some recent work from @yolandagil and others on something called OntoSoft under the NSF EarthCube project that has been working on software registries and documentation methods for scientific software, specifically. They've worked up some interesting tools for introspecting a repository and providing a report on its viability for reuse.

rafael5 · 2016-08-27T04:56:53Z

The largest web search and discovery engines (Google, Yahoo, Microsoft, Yandex) have collaborated on creating a Linked Data web schema such that all search engines can index and semantically search all structured data on the web using a single common schema.

This is at https://schema.org

Recommendation: adopt the W3C Linked Data standard (JSON-LD) for the code.gov software catalog using the schema.org metadata model for Software Application:

https://schema.org/SoftwareApplication

Note that the U.S. Department of Veterans Affairs, U.S. National Library of Congress, and many other federal agencies and knowledge organizations have already adopted the W3C Linked Data standard and JSON-LD as their metadata standard, making them fully W3C compliant for web search engines. See, for example the VA's VISTA Data Project, which is all JSON-LD based:

http://vistadataproject.info

mattbailey0 · 2016-08-29T17:44:04Z

Closing - please move further discussion to GSA/code-gov-web#30

This was referenced Apr 12, 2016

[meta] Consider extending the comment period? #155

Closed

[Request for discussion] How agencies should inventory their software #116

Closed

This was referenced Jun 30, 2016

Should we crosswalk to ADMS.SW? codemeta/codemeta#41

Closed

(JSON-LD) Metadata for software discovery mozillascience/code-research-object#15

Open

david-a-wheeler mentioned this issue Aug 10, 2016

Examine other potential ways to get data about OSS projects ossf/census#41

Open

mattbailey0 mentioned this issue Aug 29, 2016

Required content: Metadata schema to help agencies fill out enterprise code inventory (7.2) GSA/code-gov-web#30

Closed

mattbailey0 closed this as completed Aug 29, 2016

philipashlock mentioned this issue Oct 12, 2016

[Request for Discussion] Software inventory metadata schema and inventory collection GSA/code-gov-web#41

Open

wslack mentioned this issue Jul 26, 2017

Comment on OMB Source Code Policy: [Request for discussion] What software project metadata OMB should ask agencies for #117 18F/tts-public-comments#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request for discussion] What software project metadata OMB should ask agencies for #117

[Request for discussion] What software project metadata OMB should ask agencies for #117

konklone commented Apr 8, 2016

benbalter commented Apr 10, 2016

konklone commented Apr 10, 2016

jhourcle commented Apr 12, 2016

philipashlock commented Apr 15, 2016

JJediny commented Apr 19, 2016 •

edited

Loading

ceefour commented Jun 30, 2016

jbjonesjr commented Aug 10, 2016 •

edited

Loading

JJediny commented Aug 15, 2016 •

edited

Loading

skybristol commented Aug 15, 2016

rafael5 commented Aug 27, 2016

mattbailey0 commented Aug 29, 2016

[Request for discussion] What software project metadata OMB should ask agencies for #117

[Request for discussion] What software project metadata OMB should ask agencies for #117

Comments

konklone commented Apr 8, 2016

benbalter commented Apr 10, 2016

konklone commented Apr 10, 2016

jhourcle commented Apr 12, 2016

philipashlock commented Apr 15, 2016

JJediny commented Apr 19, 2016 • edited Loading

Beyond the general suggestion to use the DCAT controlled vocabulary, I suggest the specific use of three fields:

ADMS v 2.0 mapped to DCAT:

ADMS v2.0 Overview

ceefour commented Jun 30, 2016

jbjonesjr commented Aug 10, 2016 • edited Loading

Other formats

Maybe not a format or file at all...

JJediny commented Aug 15, 2016 • edited Loading

skybristol commented Aug 15, 2016

rafael5 commented Aug 27, 2016

mattbailey0 commented Aug 29, 2016

JJediny commented Apr 19, 2016 •

edited

Loading

jbjonesjr commented Aug 10, 2016 •

edited

Loading

JJediny commented Aug 15, 2016 •

edited

Loading