Skip to content

Commit

Permalink
Merge pull request #3 from themerekat/symbiota-converge
Browse files Browse the repository at this point in the history
Converge old readmes and fix occurrences file
  • Loading branch information
themerekat authored Mar 13, 2023
2 parents 1b97a43 + b4d086d commit ba97beb
Show file tree
Hide file tree
Showing 3 changed files with 303 additions and 11 deletions.
306 changes: 299 additions & 7 deletions symbiota/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

## List of associated files

- `symbiota_neon_seinet.sql`
- `symb_combined_occurrences.csv`
- `materialSample.csv`
- `identifications_dbg.csv`
- `measurementOrFact_dbg.csv`
- `multimedia_dbg.csv`
- `materialSample.csv`
- `symbiota_neon_seinet.sql` (final combined dump)

For the purposes of this exercise, records were downloaded from the [NEON Biorepository Symbiota portal](https://biorepo.neonscience.org/) and the Southern Rocky Mountain Herbaria portal](https://www.soroherbaria.org/portal/) (SoRo Portal) in DwCA format. We combined the occurrences files from both of these DwC-As, then also used `materialSample.csv` from NEON, then `identifications_dbg.csv`, `measurementOrFact_dbg.csv`, and `multimedia.dbg.csv` from the SoRo portal.

Expand Down Expand Up @@ -55,8 +55,8 @@ Imported `symb_combined_occurrences.csv` into the table `symbiota_occurrences` c
CREATE TABLE symbiota_occurrences (
id int PRIMARY KEY,
institutionCode text,
collectionName text,
collectionCode text,
collectionName text,
ownerInstitutionCode text,
collectionID text,
basisOfRecord text,
Expand Down Expand Up @@ -96,6 +96,7 @@ Imported `symb_combined_occurrences.csv` into the table `symbiota_occurrences` c
verbatimEventDate text,
occurrenceRemarks text,
habitat text,
verbatimAttributes text,
fieldNumber text,
eventID text,
informationWithheld text,
Expand Down Expand Up @@ -141,14 +142,16 @@ Imported `symb_combined_occurrences.csv` into the table `symbiota_occurrences` c
language text,
recordEnteredBy text,
modified text,
rights text,
rightsHolder text,
accessRights text,
recordID text,
"references" text
);
```

### NEON example
~Laura Rocha Prado~

The following steps were taken to import and map the NEON specimen and its associated material sample record.

Created collection in table `agent`:

```postgresql
Expand Down Expand Up @@ -334,4 +337,293 @@ INSERT INTO taxon_identification (taxon_id, identification_id, taxon_order)
VALUES ('29174bda-9e08-4ff8-a878-e64991ea1121', 'd511a9a7-4867-4083-8d10-fe223242b6d8', 1);
```

Finally, a database dump generated the file `symbiota_neon.sql`.

### SoRo example
~Katie Pearson~

The following steps were taken to incorporate the examples from the Consortium of Southern Rocky Mountain Herbaria specimens, their identifications, their multimedia, and their measurementOrFact files. For these files, I took a more generalized approach that could be used with future Darwin Core Archives (no hard-coded data transfer).

Imported `identifications_dbg.csv` into the table `symbiota_identifications`. For this and the remaining tables, I had to create a new "id" field containing an artificial primary key, because coreid may be duplicated (it refers back to the occurrences file).

```postgresql
CREATE TABLE symbiota_identifications (
id int PRIMARY KEY,
coreid text,
identifiedby text,
dateIdentified text,
identificationQualifier text,
scientificName text,
tidInterpreted int,
scientificNameAuthorship text,
genus text,
specificEpithet text,
taxonRank text,
infraspecificEpithet text,
identificationReferences text,
identificationRemarks text,
recordID text,
modified text
);
```

Imported `measurementOrFact_dbg.csv` into the table `symbiota_measurementorfact`.

```postgresql
CREATE TABLE symbiota_measurementorfact (
id int PRIMARY KEY,
coreid text,
measurementType text,
measurementTypeID text,
measurementValue text,
measurementValueID text,
measurementUnit text,
measurementDeterminedDate text,
measurementDeterminedBy text,
measurementRemarks text
);
```

Imported `multimedia_dbg.csv` into the table `symbiota_multimedia`.

```postgresql
CREATE TABLE symbiota_multimedia (
id int PRIMARY KEY,
coreid text,
identifier text,
accessURI text,
thumbnailAccessURI text,
goodQualityAccessURI text,
rights text,
cwner text,
creator text,
UsageTerms text,
WebStatement text,
caption text,
comments text,
providerManagedID text,
MetadataDate text,
format text,
associatedSpecimenReference text,
type text,
subtype text,
metadataLanguage text
);
```

Then we loaded all the data into these tables from the CSV files attached to this repo.

Next, came mapping the data from our tables into the data model's tables.

Make collections in tables `agent` and `collection`:
**PROBLEM:** There is no way of knowing, from the metadata provided by a Symbiota DwC-A, that a collection is an herbarium, museum, university, etc. This would have to be populated from another source (e.g., GRSciColl record).

```postgresql
INSERT INTO agent (agent_id, agent_type, preferred_agent_name)
SELECT DISTINCT collectionID, 'COLLECTION', institutionCode || ':' || collectionCode
FROM symbiota_occurrences;
INSERT INTO collection (collection_id, collection_type, collection_code, institution_code, grscicoll_id)
SELECT DISTINCT ON (collectionID) collectionID, 'HERBARIUM', collectionCode, institutionCode, uuid_generate_v4()
FROM symbiota_occurrences;
```

Make an entity and corresponding material entity record for every specimen in the occurrences table.
I chose to keep track of the cross-links between identifiers using the identifier table.

```postgresql
INSERT INTO identifier (identifier_target_id, identifier_target_type, identifier_type, identifier_value)
SELECT uuid_generate_v4(), 'MATERIAL_ENTITY','local',id
FROM symbiota_occurrences;
INSERT INTO entity (entity_id, entity_type, dataset_id)
SELECT identifier_target_id, 'MATERIAL_ENTITY','symbiota_dbg'
FROM identifier;
INSERT INTO material_entity (material_entity_id, material_entity_type, preparations, disposition, institution_code, collection_code, collection_id, owner_institution_code, catalog_number, record_number, recorded_by, associated_references, associated_sequences, other_catalog_numbers)
SELECT i.identifier_target_id, 'ORGANISM', s.preparations, s.disposition, s.institutionCode, s.collectionCode, s.collectionID, s.ownerInstitutionCode, s.catalogNumber, s.recordNumber, s.recordedBy, s.references, s.associatedSequences, 'SEINet:' || s.otherCatalogNumbers
FROM symbiota_occurrences s
JOIN identifier i
ON s.id = i.identifier_value::integer WHERE identifier_target_type = 'MATERIAL_ENTITY'
```

Make and organism for every entity.
**PROBLEM:** There is no way of knowing how many organisms are represented by a single entity. Herbarium sheets and fossil specimens, particularly, may represent several organisms but one "specimen" (or material entity). Therefore, we cannot provide an conclusive organism scope.

```postgresql
INSERT INTO organism (organism_id, organism_scope)
SELECT m.material_entity_id
FROM material_entity m
JOIN entity e
ON m.material_entity_id = e.entity_id
WHERE e.dataset_id = 'symbiota_dbg'
```

Make an assertion for every trait measurement.
**Adjustment:** I made up the value 'measurement' to go into the _assertion_type_ field.

```postresql
INSERT INTO assertion (assertion_id, assertion_target_id, assertion_target_type, assertion_type, assertion_made_date, assertion_value, assertion_unit, assertion_by_agent_name, assertion_remarks)
SELECT uuid_generate_v4(), identifier.identifier_target_id, 'MATERIAL_ENTITY', 'measurement', measurementDeterminedDate, measurementValue, measurementUnit, measurementDeterminedBy, measurementType || '; ' || measurementTypeID || '; ' || measurementValueID
FROM symbiota_measurementorfact
JOIN identifier
ON symbiota_measurementorfact.coreid = identifier.identifier_value
```

Make a digital entity for every image in the multimedia file.
```postgresql
INSERT INTO entity (entity_id, entity_type, dataset_id)
SELECT providerManagedID, 'DIGITAL_ENTITY', 'symbiota_dbg'
FROM symbiota_multimedia
INSERT INTO digital_entity (digital_entity_id, digital_entity_type, access_uri, format, rights, rights_uri, rights_holder, source_uri, creator)
SELECT providerManagedID, 'STILL_IMAGE', accessURI, format, usageTerms, rights, 'owner', associatedSpecimenReference, creator
FROM symbiota_multimedia
INSERT INTO identifier (identifier_target_id, identifier_target_type, identifier_type, identifier_value)
SELECT providerManagedID, 'DIGITAL_ENTITY', 'local', coreid
FROM symbiota_multimedia
```

Link images to their material entities.

```postgresql
INSERT INTO entity_relationship (entity_relationship_id, subject_entity_id, entity_relationship_type, object_entity_id)
SELECT uuid_generate_v4(), i2.identifier_target_id, 'shown in', i1.identifier_target_id
FROM identifier i1
LEFT JOIN identifier i2
ON i1.identifier_value = i2.identifier_value
WHERE i1.identifier_target_type = 'DIGITAL_ENTITY' AND i2.identifier_target_type = 'MATERIAL_ENTITY'
```

Add locations and georeferences. Symbiota portals do not yet support multiple georeferences for a single location, nor are locations indexed, so there is a 1-1-1 relationship between specimens, locations, and georeferences.

```postgresql
INSERT INTO location (location_id, continent, water_body, island_group, island, country, state_province, county, municipality, locality, minimum_elevation_in_meters, maximum_elevation_in_meters, minimum_depth_in_meters, maximum_depth_in_meters, location_remarks)
SELECT id, continent, waterBody, islandGroup, island, country, stateProvince, county, municipality, locality, minimumElevationInMeters::numeric,maximumElevationInMeters::numeric, minimumDepthInMeters::numeric, maximumDepthInMeters::numeric,locationRemarks
FROM symbiota_occurrences
INSERT INTO identifier (identifier_target_id, identifier_target_type, identifier_type, identifier_value)
SELECT uuid_generate_v4(), 'LOCATION', 'local', id
FROM symbiota_occurrences
UPDATE location
SET location_id = identifier.identifier_target_id
JOIN identifier
WHERE location_id = identifier.identifier_value AND identifier.identifier_target_type = 'LOCATION'
INSERT INTO georeference (georeference_id, location_id, decimal_latitude, decimal_longitude, geodetic_datum, coordinate_uncertainty_in_meters, georeferenced_by, georeference_protocol, georeference_sources, georeference_remarks)
SELECT uuid_generate_v4(), i.identifier_target_id, s.decimalLatitude::numeric, s.decimalLongitude::numeric, COALESCE(s.geodeticDatum,'unknown'), s.coordinateUncertaintyInMeters::numeric, s.georeferencedBy, s.georeferenceProtocol, s.georeferenceSources, s.georeferenceRemarks
FROM symbiota_occurrences s
JOIN identifier i
ON s.id = i.identifier_value::integer
WHERE s.decimalLatitude IS NOT NULL AND i.identifier_target_type = 'LOCATION'
UPDATE location
SET accepted_georeference_id = georeference_id
FROM georeference g
WHERE location.location_id = g.location_id
AND accepted_georeference_id IS NULL
```

Insert events and occurrences. Symbiota uses the occurrences-based model, so the only provided GUID is the occurrenceID.
**PROBLEM:** Our establishmentMeans field does not have a controlled vocabulary, so it had to be excluded from this import.

```postgresql
INSERT INTO event (event_id, dataset_id, location_id, event_type, field_number, event_date, year, month, day, verbatim_event_date, verbatim_locality, verbatim_elevation, verbatim_depth, verbatim_coordinates, habitat)
SELECT o.occurrenceID, 'symbiota_dbg', i.identifier_target_id, 'OCCURRENCE', o.fieldNumber, o.eventDate, o.year::smallint, o.month::smallint, o.day::smallint, o.verbatimEventDate, o.locality, o.verbatimElevation, o.verbatimDepth, o.verbatimCoordinates, o.habitat
FROM symbiota_occurrences o
JOIN identifier i
ON o.id = i.identifier_value::integer
WHERE i.identifier_target_type = 'LOCATION'
INSERT INTO identifier (identifier_target_id, identifier_target_type, identifier_type, identifier_value)
SELECT occurrenceID, 'OCCURRENCE', 'local', id
FROM symbiota_occurrences
INSERT INTO occurrence (occurrence_id, organism_id, organism_quantity, sex, life_stage, reproductive_condition, occurrence_status, occurrence_remarks, information_withheld, recorded_by, associated_occurrences, associated_taxa)
SELECT o.occurrenceID, i.identifier_target_id, o.individualCount, o.sex, o.lifeStage, o.reproductiveCondition, 'PRESENT', o.occurrenceRemarks, o.informationWithheld, o.recordedBy, o.associatedOccurrences, o.associatedTaxa
FROM symbiota_occurrences o
JOIN identifier i
ON o.id = i.identifier_value::integer
WHERE i.identifier_target_type = 'MATERIAL_ENTITY'
INSERT INTO occurrence_evidence (occurrence_id, entity_id)
SELECT occurrence_id, organism_id
FROM occurrence
```

Add identifications. This was a little tricky because, currently, the occurrences file contains taxonomic information for the most recent determination, while the identifications table contains previous identifications _as well as current determinations, only for records that have more than one determination_. Therefore, we had to import identifications from the occurrences table, then de-duplicate the identifications table, then import the remaining identifications from the identifications table. Similarly, we had to add the taxa twice: once from occurrences, and once from identifications.
**PROBLEM:** We did not address how to indicate which is the accepted identification.

```postgresql
INSERT INTO identification (identification_id, organism_id, identification_type, taxon_formula, verbatim_identification, type_status, identified_by, date_identified, identification_references, identification_remarks)
SELECT uuid_generate_v4(), i.identifier_target_id, 'VISUAL', 'A', scientificName || ' ' || scientificNameAuthorship, typeStatus, identifiedBy, dateIdentified, identificationReferences, identificationRemarks
FROM symbiota_occurrences o
JOIN identifier i
ON o.id = i.identifier_value::numeric
WHERE i.identifier_target_type = 'MATERIAL_ENTITY'
ALTER TABLE symbiota_occurrences
ADD COLUMN o_temp_match text
ALTER TABLE symbiota_identifications
ADD COLUMN i_temp_match text
UPDATE symbiota_occurrences
SET o_temp_match = scientificName || ' ' || scientificNameAuthorship || ' ' || identifiedBy
UPDATE symbiota_identifications
SET i_temp_match = scientificName || ' ' || scientificNameAuthorship || ' ' || identifiedBy
DELETE FROM symbiota_identifications i
WHERE i.i_temp_match IN (SELECT o.temp_match FROM symbiota_occurrences)
INSERT INTO identification (identification_id, organism_id, identification_type, taxon_formula, verbatim_identification, identified_by, date_identified, identification_references, identification_remarks)
SELECT uuid_generate_v4(), i.identifier_target_id, 'VISUAL', 'A', d.scientificName || ' ' || d.scientificNameAuthorship, d.identifiedBy, d.dateIdentified, d.identificationReferences, d.identificationRemarks
FROM symbiota_identifications d
JOIN identifier i
ON d.coreid = i.identifier_value::numeric
WHERE i.identifier_target_type = 'MATERIAL_ENTITY'
```

Add taxa (see note above about why we're adding them twice).

```postgresql
INSERT INTO taxon (taxon_id, scientific_name, scientific_name_authorship, taxon_rank, kingdom, phylum, class, 'order', family, genus)
SELECT uuid_generate_v4(), scientificName, scientificNameAuthorship, taxonRank, kingdom, phylum, class, 'order', family, genus
FROM symbiota_occurrences
Add all taxa that were found in the identifications table
INSERT INTO taxon (taxon_id, scientific_name, scientific_name_authorship, taxon_rank, genus)
SELECT uuid_generate_v4(), scientificName, scientificNameAuthorship, taxonRank, genus
FROM symbiota_identifications
```

Link taxa to their identifications. This is a bit of a kludgey way of doing this, because we are matching on scientific name rather than coreid.

```postgresql
INSERT INTO taxon_identification (taxon_id, identification_id, taxon_order)
SELECT t.taxon_id, i.identification_id, '1'
FROM taxon t
JOIN identification i
ON (t.scientific_name || ' ' || t.scientific_name_authorship) = i.verbatim_identification
```

##Problems

Here we list particular data ingestion problems or assumptions. These are the same as the ones listed in context above, but are collated here for reference.

* There is no way of knowing, from the metadata provided by a Symbiota DwC-A, that a collection is an herbarium, museum, university, etc. This would have to be populated from another source (e.g., GRSciColl record). There is also no way of knowing the GRSciColl identifier for the collection from our DwC-A.
* There is no way of knowing how many organisms are represented by a single entity. Herbarium sheets and fossil specimens, particularly, may represent several organisms but one "specimen" (or material entity). Therefore, we cannot provide an conclusive organism scope.
* Our establishmentMeans field does not have a controlled vocabulary, so it had to be excluded from this import.
* We did not address how to indicate which identification is the accepted or most current identification.
Loading

0 comments on commit ba97beb

Please sign in to comment.