Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extensible support for handle systems and metadata #56

Merged
merged 13 commits into from
Sep 14, 2023

Conversation

J535D165
Copy link
Owner

@J535D165 J535D165 commented Sep 12, 2023

This PR adds support for a new API for handles, adds support for metadata extraction in 3000+ formats, and refactors the API. Documentation will follow, but the following examples can be of use already.

Get metadata from resource

import datahugger

# resolve the service from a DOI
service = datahugger.info("10.7910/DVN/KBHLOD")

# get the Citation Style Language metadata of the DOI class in the resolved service
service.resource.metadata.cls()
{'type': 'dataset',
 'id': 'https://doi.org/10.7910/dvn/kbhlod',
 'author': [{'family': 'Han', 'given': 'Hyemin'}],
 'issued': {'date-parts': [[2017]]},
 'DOI': '10.7910/DVN/KBHLOD',
 'publisher': 'Harvard Dataverse',
 'title': 'Markov-Learning',
 'URL': 'https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/KBHLOD'}

The following metadata standards are available:

The citation option can be used to get styled citations (see all 3000 styles at CLS website: https://editor.citationstyles.org/searchByName/).

service.resource.metadata.citation(style="apa", locale="nl_NL")
Han, H. (2017). <i>Markov-Learning</i> [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/KBHLOD

Metadata without resolving service

It is also possible to resolve the metadata before you actually resolve the service. See this example:

import datahugger

# get the metadata
resource = datahugger.parse_resource_identifier("10.7910/DVN/KBHLOD")
resource.metadata.cls()

# the class of the resource
type(resource)

The type of the resource is a DOI (but can also be a Handle or ArXiv)

datahugger.handles.DOI
# resolve the service
service = datahugger.info(resource)

Returns a <datahugger.services.DataverseDataset at 0x106693d60> class.

Metadata without resolving

The benefit of this pipeline it that you can extract metadata while the identifier is not supported by datahugger. For example:

import datahugger

# get the metadata
resource = datahugger.parse_resource_identifier("arxiv:astro-ph/9802301")
resource.metadata.cls()
{'type': 'article',
 'id': 'https://doi.org/10.48550/arxiv.astro-ph/9802301',
 'categories': ['Astrophysics (astro-ph)',
  'FOS: Physical sciences',
  'FOS: Physical sciences'],
 'author': [{'family': 'Motch', 'given': 'C.'},
  {'family': 'Haberl', 'given': 'F.'}],
 'issued': {'date-parts': [[1998]]},
 'abstract': 'Deep optical B band images of the ROSAT HRI error region of RX J0720.4-3125 reveal the presence of two faint stellar-like objects with B = 26.1 +/- 0.25 and B = 26.5 +/- 0.30. Exposures obtained through U, V and I filters are not sensitive enough to detect the two candidates and provide upper limits of U = 24.9, V = 23.2 and I = 21.9. These new observations virtually establish that RX J0720.4-3125 is a slowly rotating, probably completely isolated neutron star. The absence of an optical counterpart brighter than B = 26.1 seems incompatible with a neutron star atmosphere having a chemical composition dominated by Hydrogen or Helium. UBI photometry of field stars shows astonishingly little interstellar reddening in the direction of the X-ray source. Together with the small column density detected by the ROSAT PSPC, this suggests a mean particle density in the range of n = 0.1 - 0.4 cm-3. Such average densities would imply very low velocities relative to interstellar medium (Vrel &lt; 10 km/s) if the source were powered by accretion. These stringent constraints may be relaxed if the neutron star is presently crossing a small size structure of higher density or if the effective temperature of the heated atmosphere is overestimated by the blackbody approximation. Alternatively, RX J0720.4-3125 could be a young and highly magnetized cooling neutron star.',
 'DOI': '10.48550/ARXIV.ASTRO-PH/9802301',
 'publisher': 'arXiv',
 'title': 'Constraints on optical emission from the isolated neutron star candidate RX J0720.4-3125',
 'URL': 'https://arxiv.org/abs/astro-ph/9802301',
 'copyright': 'Assumed arXiv.org perpetual, non-exclusive license to distribute this article for submissions made before January 2004',
 'version': '1'}

While datahugger.get() raises for the given resource:

datahugger.get(resource, "data")
ValueError: Data protocol for astro-ph/9802301 not found.

Records without metadata

The metadata is retrieved via the DOI handle system, this means that some URLs don't have metadata because the corresponding DOI is not known. We are working on improvements on this, but this can be a challenging problem. The advice is to use datahugger always with DOI if available. Example error when no metadata is available:

import datahugger

# get the metadata
resource = datahugger.parse_resource_identifier("https://zenodo.org/record/6614829")
resource.metadata.cls()
AttributeError: 'str' object has no attribute 'metadata'

You can easily catch this error in software integrations.

import datahugger

# get the metadata
resource = datahugger.parse_resource_identifier("https://zenodo.org/record/6614829")

try:
    resource.metadata.cls()
except AttributeError:
    print(f"No metadata available for record {resource}")

@J535D165 J535D165 changed the title Add extensible support for handle systems and URIs Add extensible support for handle systems and metadata Sep 14, 2023
@J535D165 J535D165 added the enhancement New feature or request label Sep 14, 2023
@J535D165 J535D165 marked this pull request as ready for review September 14, 2023 13:42
@J535D165 J535D165 merged commit 3ad235f into main Sep 14, 2023
@J535D165 J535D165 deleted the refactor-handlers branch September 14, 2023 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant