Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(es-setup): add logic in elasticsearch setup to compare-and-update index if already exists #2312

Merged
merged 7 commits into from
Apr 3, 2021

Conversation

shakti-garg
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

resolves #2310

@shakti-garg
Copy link
Contributor Author

One auxiliary change i have done as part of this PR is to change type of value of attribute, "max_ngram_diff" in settings.json to string. This is how ES internally expects and persists it.

Copy link
Contributor

@dexter-mh-lee dexter-mh-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!! Awesome work!!! Just minor comments below.

Has this script been locally tested as a docker image? Can you share the results?

In the meantime, I will be testing locally as well.

docker/elasticsearch-setup/create-indices.sh Outdated Show resolved Hide resolved
docker/elasticsearch-setup/create-indices.sh Show resolved Hide resolved
docker/elasticsearch-setup/create-indices.sh Outdated Show resolved Hide resolved
docker/elasticsearch-setup/create-indices.sh Outdated Show resolved Hide resolved
@shakti-garg
Copy link
Contributor Author

This is amazing!! Awesome work!!! Just minor comments below.

Has this script been locally tested as a docker image? Can you share the results?

In the meantime, I will be testing locally as well.

@dexter-mh-lee I have locally tested it as a docker image but unfortunately, i didn't captured screenshots for all the scenarios. I can try to redo those scenarios but i have a better idea to save regression efforts in future also. What are your thoughts on writing BATS test for this script and then integrating it to git ci workflow?

@dexter-mh-lee
Copy link
Contributor

BATS test would be great. This is the only complicated script we have.

I ran dev.sh and it is failing.
Error starting command: /create-indices.sh - fork/exec /create-indices.sh: no such file or directory

I think it is failing to run bash scripts. Changing to #!/bin/sh gives me a script syntax error, so the file is definitely there.

@shakti-garg
Copy link
Contributor Author

I ran dev.sh and it is failing.
Error starting command: /create-indices.sh - fork/exec /create-indices.sh: no such file or directory

I think it is failing to run bash scripts. Changing to #!/bin/sh gives me a script syntax error, so the file is definitely there.
@dexter-mh-lee You were correct. It was due to changing the shell executable to bash, when the elasticsearch-setup image doesn't have bash(it skipped my testing as i am using a different base image in our internal fork..phew!).
I have reverted back the script to plain sh and fixed the script syntaxes accordingly.

@shakti-garg
Copy link
Contributor Author

Has this script been locally tested as a docker image? Can you share the results?
I have retested the script in my local machine. The results are below:

Scenario: ES doesn't have any index, running setup for first time

➜  datahub git:(master) docker logs a98e941327f1
2021/04/01 13:12:01 Waiting for: http://elasticsearch:9200
2021/04/01 13:12:01 Problem with request: Get http://elasticsearch:9200: dial tcp 172.18.0.2:9200: getsockopt: connection refused. Sleeping 1s
2021/04/01 13:12:14 Problem with request: Get http://elasticsearch:9200: dial tcp 172.18.0.2:9200: getsockopt: connection refused. Sleeping 1s
2021/04/01 13:12:15 Received 200 from http://elasticsearch:9200
sh: true: unknown operand

creating index chartdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3976  100    72  100  3904     35   1898  0:00:02  0:00:02 --:--:--  1934
{"acknowledged":true,"shards_acknowledged":true,"index":"chartdocument"}
creating index corpuserinfodocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2378  100    79  100  2299    261   7612 --:--:-- --:--:-- --:--:--  7979
{"acknowledged":true,"shards_acknowledged":true,"index":"corpuserinfodocument"}
creating index dashboarddocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3879  100    76  100  3803    340  17053 --:--:-- --:--:-- --:--:-- 17472
{"acknowledged":true,"shards_acknowledged":true,"index":"dashboarddocument"}
creating index datajobdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3978  100    74  100  3904    253  13369 --:--:-- --:--:-- --:--:-- 13576
{"acknowledged":true,"shards_acknowledged":true,"index":"datajobdocument"}
creating index dataflowdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3979  100    75  100  3904    449  23377 --:--:-- --:--:-- --:--:-- 23826
{"acknowledged":true,"shards_acknowledged":true,"index":"dataflowdocument"}
creating index dataprocessdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5108  100    78  100  5030    426  27486 --:--:-- --:--:-- --:--:-- 27912
{"acknowledged":true,"shards_acknowledged":true,"index":"dataprocessdocument"}
creating index datasetdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6785  100    74  100  6711    354  32110 --:--:-- --:--:-- --:--:-- 32464
{"acknowledged":true,"shards_acknowledged":true,"index":"datasetdocument"}
creating index mlmodeldocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5868  100    74  100  5794    408  32011 --:--:-- --:--:-- --:--:-- 32419
{"acknowledged":true,"shards_acknowledged":true,"index":"mlmodeldocument"}
creating index tagdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1811  100    70  100  1741    406  10122 --:--:-- --:--:-- --:--:-- 10468
2021/04/01 13:12:19 Command finished successfully.
2021/04/01 13:12:19 Command finished successfully.
{"acknowledged":true,"shards_acknowledged":true,"index":"tagdocument"}%

Scenario: ES have indexes, running setup for second time

2021/04/01 12:39:02 Waiting for: http://elasticsearch:9200
2021/04/01 12:39:02 Problem with request: Get http://elasticsearch:9200: dial tcp 172.18.0.2:9200: getsockopt: connection refused. Sleeping 1s
2021/04/01 12:39:17 Received 200 from http://elasticsearch:9200
sh: true: unknown operand

comparing with existing version of index chartdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1480  100  1480    0     0  98666      0 --:--:-- --:--:-- --:--:-- 98666
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   874  100   874    0     0  62428      0 --:--:-- --:--:-- --:--:-- 62428

no changes to index chartdocument mappings and settings

comparing with existing version of index corpuserinfodocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   908  100   908    0     0   147k      0 --:--:-- --:--:-- --:--:--  147k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   626  100   626    0     0   101k      0 --:--:-- --:--:-- --:--:--  101k

no changes to index corpuserinfodocument mappings and settings

comparing with existing version of index dashboarddocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1488  100  1488    0     0   111k      0 --:--:-- --:--:-- --:--:--  111k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   821  100   821    0     0  48294      0 --:--:-- --:--:-- --:--:-- 48294

no changes to index dashboarddocument mappings and settings

comparing with existing version of index datajobdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1484  100  1484    0     0   120k      0 --:--:-- --:--:-- --:--:--  120k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   876  100   876    0     0   122k      0 --:--:-- --:--:-- --:--:--  122k

no changes to index datajobdocument mappings and settings

comparing with existing version of index dataflowdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1486  100  1486    0     0   131k      0 --:--:-- --:--:-- --:--:--  131k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   877  100   877    0     0  46157      0 --:--:-- --:--:-- --:--:-- 46157

no changes to index dataflowdocument mappings and settings

comparing with existing version of index dataprocessdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2142  100  2142    0     0   149k      0 --:--:-- --:--:-- --:--:--  149k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   861  100   861    0     0  71750      0 --:--:-- --:--:-- --:--:-- 71750

no changes to index dataprocessdocument mappings and settings

comparing with existing version of index datasetdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2271  100  2271    0     0   201k      0 --:--:-- --:--:-- --:--:--  201k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1623  100  1623    0     0   226k      0 --:--:-- --:--:-- --:--:--  226k

no changes to index datasetdocument mappings and settings

comparing with existing version of index mlmodeldocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1912  100  1912    0     0   266k      0 --:--:-- --:--:-- --:--:--  266k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1486  100  1486    0     0   290k      0 --:--:-- --:--:-- --:--:--  362k

no changes to index mlmodeldocument mappings and settings

comparing with existing version of index tagdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   890  100   890    0     0  30689      0 --:--:-- --:--:-- --:--:-- 30689
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   288  100   288    0     0  22153      0 --:--:-- --:--:-- --:--:-- 24000
--- /tmp/existing_sorted
+++ /tmp/data_sorted
@@ -70,7 +70,8 @@
             ],
             "type": "custom"
           }
-        }
+        },
+        "tokenizer": {}
       },
       "max_ngram_diff": "19"
     }

updating index tagdocument
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1822  100    81  100  1741    349   7504 --:--:-- --:--:-- --:--:--  7853
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
{"acknowledged":true,"shards_acknowledged":true,"index":"tagdocument_1617280758"}{
100   406  100   330  100    76   2244    517 --:--:-- --:--:-- --:--:--  2761
  "took" : 96,
  "timed_out" : false,
  "total" : 0,
  "updated" : 0,
  "created" : 0,
  "deleted" : 0,
  "batches" : 0,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    71  100    71    0     0   4437      0 --:--:-- --:--:-- --:--:--  4437
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    71  100    71    0     0  10142      0 --:--:-- --:--:-- --:--:-- 10142

Post-reindex document reconcialiation completed. doc_source_index_count: 0; doc_target_index_count: 0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    57  100    57    0     0   8142      0 --:--:-- --:--:-- --:--:--  8142
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    21  100    21    0     0    265      0 --:--:-- --:--:-- --:--:--   265
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   146  100    21  100   125    287   1712 --:--:-- --:--:-- --:--:--  2000
{"acknowledged":true}{"acknowledged":true}
Reindexing to tagdocument_1617280758 succeded
2021/04/01 12:39:18 Command finished successfully.
2021/04/01 12:39:19 Command finished successfully.

@dexter-mh-lee
Copy link
Contributor

Seems like there's a false positive above for index tagdocument
I am getting the same as well in my local test.

@dexter-mh-lee
Copy link
Contributor

dexter-mh-lee commented Apr 1, 2021

Once I ran it after ingesting demo data (no change to mapping/setting),
I got the following

elasticsearch-setup       | --- /tmp/existing_sorted
elasticsearch-setup       | +++ /tmp/data_sorted
elasticsearch-setup       | @@ -7,15 +7,6 @@
elasticsearch-setup       |        "active": {
elasticsearch-setup       |          "type": "boolean"
elasticsearch-setup       |        },
elasticsearch-setup       | -      "emails": {
elasticsearch-setup       | -        "fields": {
elasticsearch-setup       | -          "keyword": {
elasticsearch-setup       | -            "ignore_above": 256,
elasticsearch-setup       | -            "type": "keyword"
elasticsearch-setup       | -          }
elasticsearch-setup       | -        },
elasticsearch-setup       | -        "type": "text"
elasticsearch-setup       | -      },
elasticsearch-setup       |        "fullName": {
elasticsearch-setup       |          "fields": {
elasticsearch-setup       |            "ngram": {

^ seems like an issue where emails wasn't set correctly in mappings.yml and was dynamically created as the document was indexed.

elasticsearch-setup       | Post-reindex document reconcialiation failed. doc_source_index_count: 12; doc_target_index_count: 0
elasticsearch-setup       |
elasticsearch-setup       | Reindexing to corpuserinfodocument_1617296035 failed

^ I see that the new indices were created and reindexed correctly, but for some reason, it's failing. Can it be that we are not waiting enough?

@shakti-garg
Copy link
Contributor Author

Seems like there's a false positive above for index tagdocument
I am getting the same as well in my local test.

Yes, it was due to empty declaration of tokenizers in tagdocument. I have removed it from settings.yaml.

Other 3 such cases are below. As you pointed out, they are coming as data is being dynamically written for fields not declared in mappings.yaml. Do we sync up these also?

  1. corpuserinfodocument
@@ -7,15 +7,6 @@
       "active": {
         "type": "boolean"
       },
-      "emails": {
-        "fields": {
-          "keyword": {
-            "ignore_above": 256,
-            "type": "keyword"
-          }
-        },
-        "type": "text"
-      },
       "fullName": {
         "fields": {
           "ngram": {
  1. dataprocessdocument
@@ -22,12 +22,6 @@
       "name": {
         "type": "keyword"
       },
-      "numInputDatasets": {
-        "type": "long"
-      },
-      "numOutputDatasets": {
-        "type": "long"
-      },
       "num_inputs": {
         "type": "long"
       },

and
3. mlmodeldocument

@@ -15,15 +15,6 @@
         },
         "type": "text"
       },
-      "description": {
-        "fields": {
-          "keyword": {
-            "ignore_above": 256,
-            "type": "keyword"
-          }
-        },
-        "type": "text"
-      },
       "evaluationDatasets": {
         "fields": {
           "urn_components": {

@shakti-garg
Copy link
Contributor Author

Once I ran it after ingesting demo data (no change to mapping/setting),
I got the following

elasticsearch-setup       | --- /tmp/existing_sorted
elasticsearch-setup       | +++ /tmp/data_sorted
elasticsearch-setup       | @@ -7,15 +7,6 @@
elasticsearch-setup       |        "active": {
elasticsearch-setup       |          "type": "boolean"
elasticsearch-setup       |        },
elasticsearch-setup       | -      "emails": {
elasticsearch-setup       | -        "fields": {
elasticsearch-setup       | -          "keyword": {
elasticsearch-setup       | -            "ignore_above": 256,
elasticsearch-setup       | -            "type": "keyword"
elasticsearch-setup       | -          }
elasticsearch-setup       | -        },
elasticsearch-setup       | -        "type": "text"
elasticsearch-setup       | -      },
elasticsearch-setup       |        "fullName": {
elasticsearch-setup       |          "fields": {
elasticsearch-setup       |            "ngram": {

^ seems like an issue where emails wasn't set correctly in mappings.yml and was dynamically created as the document was indexed.

elasticsearch-setup       | Post-reindex document reconcialiation failed. doc_source_index_count: 12; doc_target_index_count: 0
elasticsearch-setup       |
elasticsearch-setup       | Reindexing to corpuserinfodocument_1617296035 failed

^ I see that the new indices were created and reindexed correctly, but for some reason, it's failing. Can it be that we are not waiting enough?

Here, reindexing is not failing. The issue was that brace expansion is not much sh-compatible. So, i have changed it with seq command.
Because of non-compatibility, loop was not working and returning on first-check only, when re-indexing takes some time.

@dexter-mh-lee
Copy link
Contributor

Seems like there's a false positive above for index tagdocument
I am getting the same as well in my local test.

Yes, it was due to empty declaration of tokenizers in tagdocument. I have removed it from settings.yaml.

Other 3 such cases are below. As you pointed out, they are coming as data is being dynamically written for fields not declared in mappings.yaml. Do we sync up these also?

  1. corpuserinfodocument
@@ -7,15 +7,6 @@
       "active": {
         "type": "boolean"
       },
-      "emails": {
-        "fields": {
-          "keyword": {
-            "ignore_above": 256,
-            "type": "keyword"
-          }
-        },
-        "type": "text"
-      },
       "fullName": {
         "fields": {
           "ngram": {
  1. dataprocessdocument
@@ -22,12 +22,6 @@
       "name": {
         "type": "keyword"
       },
-      "numInputDatasets": {
-        "type": "long"
-      },
-      "numOutputDatasets": {
-        "type": "long"
-      },
       "num_inputs": {
         "type": "long"
       },

and
3. mlmodeldocument

@@ -15,15 +15,6 @@
         },
         "type": "text"
       },
-      "description": {
-        "fields": {
-          "keyword": {
-            "ignore_above": 256,
-            "type": "keyword"
-          }
-        },
-        "type": "text"
-      },
       "evaluationDatasets": {
         "fields": {
           "urn_components": {

This is very tricky to detect. Ideally, all documents should be declared in mappings.yaml. I think it's okay for now.

Copy link
Contributor

@dexter-mh-lee dexter-mh-lee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the changes!

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks @shakti-garg and @dexter-mh-lee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(es-setup): compare-update-reindex if index already exists
3 participants