Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double forward slashes in some Gemini records for files migrated from 7.x #1014

Closed
mjordan opened this issue Jan 28, 2019 · 17 comments
Closed

Comments

@mjordan
Copy link
Contributor

mjordan commented Jan 28, 2019

Looking at the Gemini database records for some files created during a migration from 7.x, we can see that some of the "fedora_uri" fields are missing path information. For example, below we see this in the entries for the OBJ and MODS files:

mysql> select * from Gemini where drupal_uri like "%islandora_2_MODS%"\G
*************************** 1. row ***************************
fedora_hash: 046b22eb67596f29043a70615ccb66bb5a8ec3c4101b5b2758d7777d95dfa1a62bf93f72fed0af6e57a97e53341caa6a4912fa8a5fc78e12901124e3b6966529
drupal_hash: 65653822f98e4de55f581588720c4c8ff67323aa93d07f32f74d63df7c81239708f81741cb9fd0dc2fc2e09152f0d298789f3c47929a16e2000827f39bffd4d8
       uuid: 83ea0267-0bc4-4dc7-8d50-901072102dc1
 drupal_uri: http://default/_flysystem/fedora/islandora_2_MODS.xml
 fedora_uri: http://localhost:8080/fcrepo/rest//islandora_2_MODS.xml
dateCreated: 2019-01-27 16:44:25
dateUpdated: 2019-01-27 16:44:25
1 row in set (0.00 sec)

mysql> 
mysql> 
mysql> 
mysql> select * from Gemini where drupal_uri like "%islandora_2_OBJ%"\G
*************************** 1. row ***************************
fedora_hash: 727b1a61c060b141bd141f4fe14394af63c2308c069464e1c49d20afc967331c10a7ea556f0ff995d64d30d21c6448320345f6aca86a0a4b054ce7c1aba7d21d
drupal_hash: a792c49d97ab7ef278a5d6823f1f2c957759912a3d046b4c6c33caa5c6edce29fe9019cea38be30a166af7d873244aa615f14f3ffd3eb9168ecda0eb450b7a61
       uuid: 818e420c-5651-4b51-9ce1-d223655c04aa
 drupal_uri: http://default/_flysystem/fedora/islandora_2_OBJ.pdf
 fedora_uri: http://localhost:8080/fcrepo/rest//islandora_2_OBJ.pdf
dateCreated: 2019-01-27 16:44:26
dateUpdated: 2019-01-27 16:44:26
1 row in set (0.00 sec)

mysql> select * from Gemini where drupal_uri like "%islandora_2_AUDIT%"\G
*************************** 1. row ***************************
fedora_hash: 9ba8ce910343be98e4e362e041a7f873dd990294662edcd76c141e33975b25d0db92a3cfd30b5b3d1545e819da722299b1436c145c7ff5f74e4bb22ee671a4e0
drupal_hash: 2aeed204721528081653e257e27788237ee39457cc5479ab8e7b3d0a9cd5dc70f997e2d21df283d4319f7fe4cda2b09466a11488075c9ba381d194e1f13216e6
       uuid: 62722130-2c10-48be-b045-c1367ef77b48
 drupal_uri: http://default/_flysystem/fedora/masters/islandora_2_AUDIT.xml
 fedora_uri: http://localhost:8080/fcrepo/rest/masters/islandora_2_AUDIT.xml
dateCreated: 2019-01-27 16:39:54
dateUpdated: 2019-01-27 16:39:54
1 row in set (0.00 sec)

However, the entry for the AUDIT file appears to be complete. Queries like select * from Gemini where drupal_uri like "%MODS%"\G show this pattern exists for all MODS files, etc. Can anyone else replicate this?

Files that are created by manually ingesting repository objects do not show this behavior (that is, they have complete paths in Gemini).

@whikloj
Copy link
Member

whikloj commented Jan 28, 2019

I don't think they are missing path information, The AUDIT one has path .../fedora/masters/.... where masters is a sub-path and fedora (IIRC) is the flysystem identifier in Drupal.

drupal_uri: http://default/_flysystem/fedora/masters/islandora_2_AUDIT.xml
fedora_uri: http://localhost:8080/fcrepo/rest/masters/islandora_2_AUDIT.xml

So we add the sub-path to the repository base in Fedora.

So for

drupal_uri: http://default/_flysystem/fedora/islandora_2_MODS.xml
fedora_uri: http://localhost:8080/fcrepo/rest//islandora_2_MODS.xml

it is .../fedora/ with no sub-path. This MODS file should be at the root of the Fedora repository. You could check that out and see if its there.

Probably we have a case where with a sub-path there is no trailing slash and we didn't include smart enough logic to test for a double //

@mjordan
Copy link
Contributor Author

mjordan commented Jan 28, 2019

@whikloj thanks for the clarification, my report should have been more specific ("Double forward slashes in some Gemini records..."). This extra forward slash would be in one of the migrate YAML config files?

@whikloj
Copy link
Member

whikloj commented Jan 28, 2019

@mjordan because the extra forward slash only appears in the Fedora URI I would guess this is something in Gemini that needs to be addressed.

@mjordan mjordan changed the title Missing path information in some Gemini records for files migrated from 7.x Double forward slashes in some Gemini records for files migrated from 7.x Jan 29, 2019
@mjordan
Copy link
Contributor Author

mjordan commented Jan 29, 2019

I'm hacking around in Gemini/src/UrlMinter/UrlMinter.php but seems this is not the right place, since when I change $this->base_url = rtrim($trimmed, '/') . '/'; to $this->base_url = 'http://example.com/';, I'm still getting records like this in Gemini for files created during the migration:

fedora_hash: 970a07e7067f7c906caf78d0d555333c352b2fc93f6c0c69567751b535b8cc925a9b9354c395e3a1c4a558d3312b66ac2e8c3b34a74ac174b6d534d0bd1e707f
drupal_hash: 61f2e0db8ef88861030bbb778cf4564d4609a347f4b8c5de1a28c9a0b98635817dba281b6e0f9c272e3ab9dbe479b6a035b63bae1e378c57be5c5cfd1e2c8483
       uuid: fe4e7be3-cfa7-42dd-adec-8d5da2764319
 drupal_uri: http://default/_flysystem/fedora/testing_11_DC.xml
 fedora_uri: http://localhost:8080/fcrepo/rest//testing_11_DC.xml
dateCreated: 2019-01-29 09:02:46
dateUpdated: 2019-01-29 09:02:46

Even when I change the fedora_base_url value in Gemini/cfg/config.yaml to http://example.com:8080/fcrepo/rest and rerun my migration, the records don't change:

fedora_hash: 897d2fd61bc55f526f3b11e8c735607f6af9a26c8bb4a76614b599ad68a05ea058a0646b8c97b28a7f130a670c8adf7d6662f6eb013c4e44173f1f1e509b7f49
drupal_hash: 300ba5c757ebf4709acb51ace74124604f55484b9ec8446a52be3eb75d44b2139da9377a3d97edfb59b0d094a512774068284d3f617b19cdd16bb2a3c64e6a79
       uuid: ff595498-add2-4d47-bba3-46ee47e1655f
 drupal_uri: http://default/_flysystem/fedora/testing_19_TN.jpg
 fedora_uri: http://localhost:8080/fcrepo/rest//testing_19_TN.jpg
dateCreated: 2019-01-29 09:10:32
dateUpdated: 2019-01-29 09:10:32

Where should I be looking?

@whikloj
Copy link
Member

whikloj commented Jan 29, 2019

@mjordan I wonder if it is here
https://github.com/Islandora-CLAW/Crayfish/blob/master/Gemini/src/UrlMinter/UrlMinter.php#L36
what if you make it something like

$path = rtrim(implode("/", $segments), "/") . "/$context";

@dannylamb
Copy link
Contributor

Hrm... that default in the drupal url looks suspicious to me, too. I'll try to see if I can find where that's set.

@mjordan
Copy link
Contributor Author

mjordan commented Jan 31, 2019

Dunno what's going on. When I added an unexpected character to /var/www/html/Crayfish/Gemini/src/UrlMinter/UrlMinter.php to induce a PHP syntax error, rerunning the migration resulted in no entries being added to Gemini. Which makes sense. But when I reran the migration with the following code in /var/www/html/Crayfish/Gemini/src/UrlMinter/UrlMinter.php to see if I can produce random URLs:

        // return $this->base_url . $path;
        return $this->base_url . rand(0, 1000000000);

I am still getting entries in Gemini that have fedora_uris that don't end in random numbers:

fedora_hash: 19d4e40bdd4eb9285683f9343ae2caa5e13d763d2c2c2439d8e13fae39587133daabbc1ecc5f96ff0a798a674a91aefee047d736164b70e57bd03ad63f94b39c
drupal_hash: 4453f1a1c7604bd4f1041f4c6b276760b6f7a2a731111b9cf76293cd1a3730b0d4e97980faed742ec484dbc63758025cb0e4df553d7ac6078fb51d08ce59859d
       uuid: fd6b537a-d5b7-4804-bd1d-78494ae471e7
 drupal_uri: http://default/_flysystem/fedora/testing_16_DC.xml
 fedora_uri: http://localhost:8080/fcrepo/rest//testing_16_DC.xml
dateCreated: 2019-01-30 20:35:30
dateUpdated: 2019-01-30 20:35:30

Which doesn't make sense.

@whikloj
Copy link
Member

whikloj commented Jan 31, 2019

@mjordan ok so I think (having not tested) I have the issue.

First $base_url is altered by

$this->base_url = rtrim($trimmed, '/') . '/';

So the base_url always has a following / (ie. http://default/), so

$path = implode("/", $segments) . "/$context";
return $this->base_url . $path;

If $segments is an empty array your $path would just be /$context (say /jared) and your returned $url would be

return "http://default/" . "/jared";

and you have a double slash.

@mjordan
Copy link
Contributor Author

mjordan commented Jan 31, 2019

@whikloj yeah, I had code to check for an empty $segments but it wasn't having any effect on the fedora_uri values ending up in the database. So then I tried several ways of seeing what the values of $context and the resulting $segments were, like setting up Monolog logging in the UrlMinter.phjp class and dumping out the values. I also tried apending the value of $context (a string) to the returned URI value. After all that, and last night's attempt to mint random URIs with return $this->base_url . rand(0, 1000000000);, I have come to the conclusion that nothing I do in that class file has any effect on the values that end up in the database.

@dannylamb
Copy link
Contributor

OK, I see it now. It's because we're talking about files, whose paths are essentially mirrored in Fedora. When a file is created, its fedora uri is converted from the Drupal file uri and put into a message here: https://github.com/Islandora-CLAW/islandora/blob/8.x-1.x/src/Plugin/Action/EmitFileEvent.php#L107

That message is put onto the queue using Context and Alpaca picks it up and indexes in Gemini here: https://github.com/Islandora-CLAW/Alpaca/blob/master/islandora-indexing-fcrepo/src/main/java/ca/islandora/alpaca/indexing/fcrepo/FcrepoIndexer.java#L189-L210

It's inconsistent with Milliner because it bypasses Gemini altogether, but it is consistent with Drupal and respects the tokens you give it for a destination path when uploading a file, etc... It'd be nice to find an elegant way to take the best of both worlds and just have one approach, but that's quite a bit larger in scope than just fixing the extra /.

Anyway, pretty sure its the EmitFileEvent action that's the culprit here.

@mjordan
Copy link
Contributor Author

mjordan commented Feb 1, 2019

@dannylamb thanks, I'll take a look at fixing the extra / over the weekend.

@mjordan
Copy link
Contributor Author

mjordan commented Feb 2, 2019

I am pleased to report that simply removing the double // in EmitFileEvent.php does the trick:

 $data = parent::generateData($entity);
    if (isset($flysystem_config[$scheme]) && $flysystem_config[$scheme]['driver'] == 'fedora') {
      $fedora_uri = str_replace("$scheme://", $flysystem_config[$scheme]['config']['root'], $uri);
      $data['fedora_uri'] = str_replace('//', '/', $fedora_uri);
    }
    return $data;

Rerunning the migration with this code in place produces the expected entries in Gemini:

*************************** 206. row ***************************
fedora_hash: 9a34b18e2b24c0aab1229df00d3902f67134e8de6d4b758953d8ed23af4e571814a04cc2d5ac8f549b2d799343c565d95554569734a50845debb1b4c2ac55076
drupal_hash: e2ac2a978a4557943394f6750ec369d751e02cd4466e5cfde01b91916a82ce4c74dc197bd9627acd4eee454d6191ce592c3c50fe680f5d6eb88c69a73549a5f6
       uuid: fca03af7-fcd6-4789-9ca8-8f01506b6da3
 drupal_uri: http://default/_flysystem/fedora/testing_8_OBJ.jpg
 fedora_uri: http:/localhost:8080/fcrepo/rest/testing_8_OBJ.jpg
dateCreated: 2019-02-02 11:42:05
dateUpdated: 2019-02-02 11:42:05
*************************** 207. row ***************************
fedora_hash: bfb598a59f04eff722d909f8d9cc40327d547617f4d28d2114fe5547e8e7152952449be1c24a7aebdfdbe93feb89dbae3173858d419e843971c0e7b22742f870
drupal_hash: 79e917633033737caa70fdb6c586f381afe6284d254f9c4f9029be5669df088c865de47d2167bef5cec054a3a7268a761e5843708b048b7fcb0887311a170e1d
       uuid: fec9b973-1602-4617-9969-115502680371
 drupal_uri: http://default/_flysystem/fedora/islandora_2_PREVIEW.jpg
 fedora_uri: http:/localhost:8080/fcrepo/rest/islandora_2_PREVIEW.jpg
dateCreated: 2019-02-02 11:42:44
dateUpdated: 2019-02-02 11:42:44
*************************** 208. row ***************************
fedora_hash: 38aa4d28ab5c2bd61f7bec00aaffdd9452c9fc4840ebb1b87b8ffaa31f04f13a314397df6fd04fded18458050591fa4419be2f5c260ee4908fcf5dd453c030a1
drupal_hash: 66bf4bb86aa802a03edca59e376db4198bc33db5df544fbf22d4d60367acb50b647166d2ae19af17e68a3aa85587e7b7d739f34d7f20eed4292ccbda1b00eb54
       uuid: ff7c5bbe-cdba-4c59-8e1f-3458cbd9c2c5
 drupal_uri: http://default/_flysystem/fedora/testing_20_MODS.xml
 fedora_uri: http:/localhost:8080/fcrepo/rest/testing_20_MODS.xml
dateCreated: 2019-02-02 11:42:16
dateUpdated: 2019-02-02 11:42:16
208 rows in set (0.00 sec)

If you're OK with that simple fix, I'll open a PR, but I'll wait until we are OK with the suspicious default in those entries. I haven't spun up a clean VM to test this fix, but I'll do so to see if the default appears. It is possible that is the result of some of early hacking around to figure out what was going on; a clean VM should not have those default entries.

@whikloj
Copy link
Member

whikloj commented Feb 3, 2019

@mjordan nice...except

fedora_uri: http:/localhost:8080/fcrepo/rest/testing_8_OBJ.jpg

you have http:/localhost probably need the double slash there.

@mjordan
Copy link
Contributor Author

mjordan commented Feb 3, 2019

Well, that was embarrassing, thanks for seeing that @whikloj.

Think I got it this time with

    if (isset($flysystem_config[$scheme]) && $flysystem_config[$scheme]['driver'] == 'fedora') {
      // $uri for files may contain 'fedora:///' so we need to replace the three / with two.
      if (strpos($uri, '///') !== FALSE) {
        $uri = str_replace('///', '//', $uri);
      }

      $data['fedora_uri'] = str_replace("$scheme://", $flysystem_config[$scheme]['config']['root'], $uri);
    }
fedora_hash: 32690f7b1d695d5216c197adc0c35de5118874c5d20f4a7b6c35bafa25a3ec56ff378c2111f4b8a2db840483d4203e82324724e868a60a67991334f295c0f8c3
drupal_hash: 855024be88d14307d2b61370b97542fedab39269b06fd3c47e17efd00e126b3f520df93e1e4c814301f52353ac3cb70151673e8c4fec39132208cc4a0517affc
       uuid: fe122140-7ffc-4108-87b3-63abb749319e
 drupal_uri: http://default/_flysystem/fedora/testing_12_MEDIUM_SIZE.jpg
 fedora_uri: http://localhost:8080/fcrepo/rest/testing_12_MEDIUM_SIZE.jpg
dateCreated: 2019-02-03 12:11:19
dateUpdated: 2019-02-03 12:11:19

fedora_hash: 3813a2a574c4bcd2786b701aaf0226409dc0548903037b8744c69d9f187968e4a02c31b7c9d79dc807223688dcd87290269457ce7dfd5fd0f2c8fac3cc5409f5
drupal_hash: 5b170c7978ef207af700373cead0a2d797104a1d97e190540a6d15f8d672d0b2e40d544312ae447448524c8c193f2fed338e0ee94b7dc036faf43aa219c79297
       uuid: e70ce26c-e5ae-401b-a242-99e3461bb3a2
 drupal_uri: http://default/_flysystem/fedora/masters/testing_12_AUDIT.xml
 fedora_uri: http://localhost:8080/fcrepo/rest/masters/testing_12_AUDIT.xml
dateCreated: 2019-02-03 12:10:22
dateUpdated: 2019-02-03 12:10:22

I'll rebuild a fresh VM and test this to see if the default reappears in the drupal_uris.

@mjordan
Copy link
Contributor Author

mjordan commented Feb 4, 2019

On a clean box, I am getting default in my drupal_uri values (so it looks like it is not the result of my hacking around with configuration as I suspected earlier):

fedora_hash: 76bd0c5a16fefdf74c2e448ceef2d6b09dc7effe0edabc7a2dbc1f2edfa4b0f6ee9fc9bc4dd4789caff81d34604e0f9789af03a7231ab8c71003cb22323c3f81
drupal_hash: c73d31c244dcd121b3d7ed6ee764c6c7dbaf69e38337a9093d0705046fb6f987409bdfb1000a1030e01459c24edcf2efbb6f0c57675ca64d052ddca5d7e0d459
       uuid: fc80d9d3-ad7e-45f5-b63b-ac99356bdb04
 drupal_uri: http://default/_flysystem/fedora/testing_2_OBJ.jpg
 fedora_uri: http://localhost:8080/fcrepo/rest/testing_2_OBJ.jpg
dateCreated: 2019-02-04 09:25:19
dateUpdated: 2019-02-04 09:25:19

But the good news is, the following code addresses the double // issue:

    $data = parent::generateData($entity);
    if (isset($flysystem_config[$scheme]) && $flysystem_config[$scheme]['driver'] == 'fedora') {
      // $uri for files may contain 'fedora:///' so we need to replace the three / with two.
      if (strpos($uri, 'fedora:///') !== FALSE) {
        $uri = str_replace('fedora:///', 'fedora://', $uri);
      }
      $data['fedora_uri'] = str_replace("$scheme://", $flysystem_config[$scheme]['config']['root'], $uri);
    }
    return $data;

Should I open a PR to include that or do you want me to wait/help figure out where that default is coming from?

@dannylamb
Copy link
Contributor

@mjordan Just the // is great 👍 We can handle the default separately.

@mjordan
Copy link
Contributor Author

mjordan commented Mar 11, 2019

Closing since this PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants