Skip to content
This repository has been archived by the owner on Mar 3, 2022. It is now read-only.

[Request for discussion] What software project metadata OMB should ask agencies for #117

Closed
konklone opened this issue Apr 8, 2016 · 11 comments

Comments

@konklone
Copy link
Contributor

konklone commented Apr 8, 2016

(I’m Eric, an engineer at 18F, an office in the U.S. General Services Administration (GSA) that provides in-house digital services consulting for the federal government. I’m commenting on behalf of 18F; we’re an open source team and happy to share our thoughts and experiences. This comment represents only the views of 18F, not necessarily those of the GSA or its Chief Information Officer.)

The Implementation section of the policy asks agencies to inventory their open- and closed-source software projects, so that OMB and the public can increase the discoverability of agency software. This seems very similar to what agencies do with their datasets, in support of M-13-13 and Project Open Data.

These fields will be maintained by CIO shops, likely mostly manually. Our premise is that the fewer the fields, and the easier it is for staff to maintain the data by hand, the more complete and timely the data will be.

We wanted to throw out a first pass at the fields we think are the highest priority, and we're trying to stay on the minimal side. We'd love to hear others' thoughts about what fields should be in here.

At the overall listing level:

  • Date on which this listing was last updated (required, date)

And at the per-project level:

  • Name of project (required, string, free-form)
  • Access level (required, string, “public“ or “non-public”)
  • Description of project (optional, string, free-form)
  • Date of creation (required, date)
  • URL (optional, URL)
  • License / copyright status (required, string, free-form but with suggested values)
    • This can also be noting public domain status, or referencing a public domain dedication, so "license" isn't the perfect term.
  • License / copyright status outside the U.S. (required only if different than inside the U.S., string, free-form but with suggested values)
  • Point of contact (required, string, email address)
  • Which OMB exception is being used to justify closing the repository (required if applicable, string or number [depends how OMB structures it])

This is just a starting point, and there's surely things we haven't thought of, or have discounted. However, we definitely are of the mind that fewer fields will produce more complete results.

@benbalter
Copy link

I'd also suggest the project metadata include a brief, one- to two-sentance description. Given that the open source community loves creative names, having the metadata of "GMan" does not aid discoverability of the project, as much as "A Ruby gem to detect government domains" does.

@konklone
Copy link
Contributor Author

We meant to include that field, actually -- not sure how it got dropped. I went and edited our above comment to include:

  • Description of project (optional, string, free-form)

@jhourcle
Copy link

  • Language(s) used -- will affect if you have the expertise to extend it.
  • some sort of classification of the projects -- to allow people to more easily filter out stuff they don't care about, or find related projects. (scientific vs. administrative vs. general use, desktop vs. server vs. website ... stuff like that)

Also see discussions in #132 .

Also Note 'URL' is prone to link rot and ambiguous. At the very least, I could see:

  • URL to obtain the software
  • URL to a website describing the software / project / whatever.
  • URL(s) (or DOIs) to publications & presentations about the software.
  • URL to documentation of the software.
  • URL to an example installation of the project (if web-based and publicly accessible)

(insert disclaimer here about these being personal comments, and not that the of the agency I work for, blah blah blah).

@philipashlock
Copy link

I don't think it's seen broad adoption outside the context it's used by European governments for this purpose, but I should note ADMS.SW as a precedent for this. ADMS.SW builds on related schemas including DOAP, SPDX, ISO 19770-2, ADMS, and the Trove software map. cc: @makxdekkers

@JJediny
Copy link

JJediny commented Apr 19, 2016

Per related comments #116 and #40:

In Issue #40 @dsmorgan77 states:

FAR Subpart 27.4 defines "Data" as follows: “Data” means recorded information, regardless of form or the media on which it may be recorded. The term includes technical data and computer software. The term does not include information incidental to contract administration, such as financial, administrative, cost or pricing, or management information.

Source code is information structured in a way that computers can interpret, I agree that source code as structured information is data. To this I suggest ProjectOpenSource start with a minimal but extendable/semi-optional metadata schema based on WC3's Data Catalog Vocabulary (DCAT). The most recent attempt to codify metadata for software was the European Commission's ADMS v2.0 which was drastically updated to remap to DCAT, so too was ProjectOpenData's data.json v1.1 based on DCAT. Basing the TBD code.json schema on DCAT would ensure, a controlled vocabulary, high levels of interoperablity, and promote semantic web / linked data efforts. Other metadata standards/specs/schemas reviewed and referenced included:

Org Schema
European Commission Asset Description Metadata Schema ADMS v 2.0
World Wide Web Consortium's (WC3) DCAT
OpenKnowledgeFoundation Datapackage
Code for Americas civic.json and Team API Project
18F about.yml and Team API
NIST Specification for Asset Indentification v1.1 & Asset Summary Reporting (ASR)

Beyond the general suggestion to use the DCAT controlled vocabulary, I suggest the specific use of three fields:

dcat:Identifier - should be a UUID / GUID if the record is the original source project (i.e. not forked) and if forked or a derived work field should use a URI to the source repository
dcat:describedBy - should reference the TBD json schema at the dataset/source code tier/level (_as in DCAT heirarchy) not at the aggregate catalog level to ensure that future projects can reference newer versions of a schema without requiring the updating of past records that will eventually reference a superseded version of the TBD schema.
dcat:conformsTo - should be used to encourage the _extending* of the TBD shared (i.e. common core) schema. This secondary schema would allow agencies/programs to extend and/or override enumerated lists/options. Allowing for a custom/extendable schema also addresses concerns that specific/unique data points (i.e fields) with specific meaning/purpose to any specific agencys/orgs can be carried along with a minimal common core that ensures interoperability.

ADMS v 2.0 mapped to DCAT:

image

ADMS v2.0 Overview

image

@ceefour
Copy link

ceefour commented Jun 30, 2016

See also Mozilla Science's CodeMeta:

@jbjonesjr
Copy link

jbjonesjr commented Aug 10, 2016

I'm Jamie Jones, a Solutions Engineer at GitHub :octocat: supporting the Federal Government. I'm a big fan of this thing called GitHub, and a big fan of sharing code and making it open source. The opinions within are my own, but ideas are often from many conversations I've had around the community (both OSS, Govt, and commercial). Before coming to GitHub, I wrote software for the govt, and would have loved to not make YAMA (Yet Another Map App) but instead be able to reuse some better code.


Of course, the most important question about an inventory mechanism to answer is: does it help agencies with similar needs find each others' projects and collaborate? But closely behind it, you need to ensure it can grow as your requirements evolve, and is maintainable going forward. So it's a battle between being complete, but not being too much of a burden. Not all problems are technical.

When we look at metadata done by other Open Source projects, a simple example that comes to mind is Netflix's OSSMETADATA standard, used to describe the status of a project. More details can be found in this presentation

Other formats

  • Description of a Project was an XML/RDF based format meant to better describe and document software (specifically open source projects). The format itself may not be of much use to you, but the logic and reasoning behind the schema may be of use this project.
  • ISA², Developing and sharing IT solutions for the EU, is another attempt at this. This is meant to be able to go across the EU, so similar sharing and discovery problems.
  • @ceefour above already mentioned the NSF-funded, Codemeta as something you might be interested in taking a look at.

It acknowledges that there are already a very large number of metadata standards for describing data & software in the academic community and it's an attempt to make all of them inter-operable.

Maybe not a format or file at all...

Why use a metadata file where you can get much of this same information from a projects pre-existing files and the source control host itself.

  • Native formats are a huge advantage: Collecting metadata from language/platform-specific package files (Gemfile, package.json, pom/ant.xml, build.sh, etc). These will often provide a wealth of metadata about the project. Similarly https://libraries.io has done a bunch of work in this space. Much life self-documenting code is a huge advantage for understanding, self-documenting projects can be a big help in this space as well.

In general, there's little to no widespread adoption of a software metadata file within repos outside of aforementioned package files.

  • Instead of defining facts in a file, you could let the repository provide some of this information for you (and of course a plug for the advantages of using a modern version control system as well to house these projects ;) ).

Some examples of how the GitHub API can provide more details for you:

### Repo host
Implied by GitHub URL, provided by our API: https://developer.github.com/v3/repos/#get

### Title
Provided by GitHub UI/API: https://developer.github.com/v3/repos/#get

### License
Provided by GitHub UI/API: https://developer.github.com/v3/licenses/

### language
Provided by GitHub UI/API: https://developer.github.com/v3/repos/#list-languages

### Last update
Provided by GitHub API: https://developer.github.com/v3/repos/#get

### Keywords
Pull general information about a project from the description or the README with keyword matching (APsI: https://developer.github.com/v3/repos/#get https://developer.github.com/v3/repos/contents/#get-the-readme)

@JJediny
Copy link

JJediny commented Aug 15, 2016

Related to #116

While i'm still for an established metadata standard as a single .yml file. I agree it should first be made practical and agree with @jbjonesjr KISS approach as a first cut of the proudly public repos/orgs... Though it wouldn't for self-hosted or externally managed source code repositories (bitbucket or gitlabs for example). I put some work a few months ago to compare some of these KISS based inventories already out there for US Federal OSS:
https://gist.github.com/JJediny/9bbaaafdfe205d184cbb902d86c82f74

  • But I failed to include one of the better lists/approaches in...

Govcode - Git Repo
GovCode.org

API: https://api.govcode.org

working off of Github

@skybristol
Copy link

+1 to the KISS and start with a simple registry approach discussed here from @jbjonesjr and @JJediny. There's a ton of stuff we can get in terms of inventory information through simple deliberate registration in some fashion of the official repos for Federal agencies. Connecting the dots to the budget line item Programs that will put things in context for OMB below the agency/bureau/office/etc level will be a little challenging but could be an interesting linked data problem hooking together GitHub and other repo accounts and identifiers.

USGS (where I'm from) has a couple of orgs here, USGS and USGS-R, that contain our officially sanctioned repos. A few smart people working together to bring those most visible projects together into a registry/index of some kind would go a long way to baselining where we are and to get the ball rolling. It would be a heck of a lot better than the usual gov fare of data calls, forms, and (gasp!) spreadsheets.

You might also check out some recent work from @yolandagil and others on something called OntoSoft under the NSF EarthCube project that has been working on software registries and documentation methods for scientific software, specifically. They've worked up some interesting tools for introspecting a repository and providing a report on its viability for reuse.

@rafael5
Copy link

rafael5 commented Aug 27, 2016

The largest web search and discovery engines (Google, Yahoo, Microsoft, Yandex) have collaborated on creating a Linked Data web schema such that all search engines can index and semantically search all structured data on the web using a single common schema.

This is at https://schema.org

Recommendation: adopt the W3C Linked Data standard (JSON-LD) for the code.gov software catalog using the schema.org metadata model for Software Application:

https://schema.org/SoftwareApplication

Note that the U.S. Department of Veterans Affairs, U.S. National Library of Congress, and many other federal agencies and knowledge organizations have already adopted the W3C Linked Data standard and JSON-LD as their metadata standard, making them fully W3C compliant for web search engines. See, for example the VA's VISTA Data Project, which is all JSON-LD based:

http://vistadataproject.info

@mattbailey0
Copy link
Contributor

Closing - please move further discussion to GSA/code-gov-web#30

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants