Need faster deploys #160

skalee · 2021-02-25T19:05:55Z

Deploying IEV site took over an hour, most of which (50 minutes) was spent on sending produced files to S3. We need to speed it up.

Currently we deploy with our custom Rake task defined here: https://github.com/geolexica/geolexica-server/blob/master/lib/tasks/deploy.rake. Under the hood it uses s3 sync, an official AWS tool.

Some ideas how to deal with that can be found in glossarist/iev-demo-site#66.

The text was updated successfully, but these errors were encountered:

skalee · 2021-02-27T01:23:26Z

@ronaldtse I got two questions:

If I end up with creating a brand new tool (which is possible, because these slow uploads are likely caused by poor parallelism), does it matter if it's Node or Ruby tool?
During upload site may be inconsistent (some pages old, some new). Is it a problem? If yes, then there are two options:
1. We may upload site to a temporary bucket and then copy it to a proper one. Copying files over buckets in the same region should be much faster than uploads, especially with S3P tool you found.
2. Alternatively, we can display some maintenance page.

ronaldtse · 2021-02-27T03:28:59Z

If I end up with creating a brand new tool (which is possible, because these slow uploads are likely caused by poor parallelism), does it matter if it's Node or Ruby tool?

No, as long as you can maintain it.

During upload site may be inconsistent (some pages old, some new). Is it a problem? If yes, then there are two options:

We may upload site to a temporary bucket and then copy it to a proper one. Copying files over buckets in the same region should be much faster than uploads, especially with S3P tool you found.

Great idea! GitHub now also supports environments - so that you can queue deploys that if one job is running, the other jobs are queued. In this case, we can use S3 Transfer Acceleration for the temporary bucket (as long as it does not contain '.' dots).

Alternatively, we can display some maintenance page.

This is probably necessary in either case.

The third option is to use AWS DynamoDB or MongoDB Cloud Atlas, which will be necessary for high frequency update workloads.

ronaldtse · 2021-02-27T03:34:00Z

https://github.com/cobbzilla/s3s3mirror seems to work for mirroring.

ronaldtse · 2021-02-27T03:52:53Z

I just found out that we could enable Transfer Acceleration if we rename the buckets to remove the dots. It's now possible to use an arbitrarily named S3 bucket as an origin for CloudFront, so we can use "example-com" instead of "example.com" as the bucket name. Let me see what we can do.

skalee · 2021-02-27T03:59:07Z

The third option is to use AWS DynamoDB or MongoDB Cloud Atlas, which will be necessary for high frequency update workloads.

Is this any expected? I though glossaries will not be updated very frequently.

skalee · 2021-02-27T04:02:36Z

I just found out that we could enable Transfer Acceleration if we rename the buckets to remove the dots. It's now possible to use an arbitrarily named S3 bucket as an origin for CloudFront, so we can use "example-com" instead of "example.com" as the bucket name. Let me see what we can do.

AWS docs say:

You might want to use Transfer Acceleration on a bucket for various reasons:

Your customers upload to a centralized bucket from all over the world.

You transfer gigabytes to terabytes of data on a regular basis across continents.

You can't use all of your available bandwidth over the internet when uploading to Amazon S3.

Doesn't sound like our case.

ronaldtse · 2021-02-27T04:14:23Z

Frequency: it’s also the burst frequencies, eg if people make subsequent changes quickly.

I found a way to make transfer acceleration work with cloud front, but it requires a separate lambda@edge to return index.html in order to mimic S3 website functionality.

In this case we may not need two buckets but let’s see.

skalee · 2021-02-27T04:43:13Z

Frequency: it’s also the burst frequencies, eg if people make subsequent changes quickly.

Wow, sounds like very different thing than deploys we have now. If burst updates can happen, then slow uploads aren't our only problem. Building the full site from scratch will be too slow too. Note that IEV has 20k concepts or so. We need some kind of incremental site builds in GHA to handle burst updates. Or throttling, or debouncing.

skalee · 2021-02-27T04:48:55Z

Also, we need to prevent race conditions between deploys.

skalee · 2021-02-27T05:16:22Z

I'm not sure what exactly Paneron will be responsible for when it comes to site generation, so this may be a silly idea: We can use Paneron to generate concept pages, and then use Jekyll to bind them into a site. Jekyll supports incremental site generation, so if we modify a few files only, then it should finish quite fast. Then we need to upload these modified files without touching the others — maybe s3 sync will do much better in such case.

Obviously that won't speed up full site rebuilds which we need too.

skalee · 2021-02-27T18:42:15Z

My new idea involves persisting generated site across builds. This is going to be a separate Git repo (maybe hosted on GitHub, maybe existing just in GHA cache, it doesn't really matter) because I don't trust file timestamps as much as commit dates. File modification timestamp can be updated for any reason whereas git commit date means actual change to file contents.

In steps (all done in GHA):

Obtain generated site (Git repo) from previous builds.
Rebuild site (incrementally or not).
Commit all the differences.
List all files in generated site along with their last commit timestamp.
List all files in S3 bucket along with their last modification timestamp.
Send only these files which have changed since the last deploy.

This approach should greatly reduce deploy time as compared to s3 sync. The latter compares MD5 hashes in order to tell which files have changed. Whilst this is a great idea in general case, it surely takes some time, even though files stored in S3 have these hashes already computed (unless given bucket is encrypted). Alternatively, s3 sync can look at file sizes which is much faster, but not that reliable.

ronaldtse · 2021-03-01T02:22:37Z

@skalee I think a more comprehensive approach is needed for S3 bucket sync; synching unchanged items is clearly not desired. A possible mechanism is to maintain a hash index at the root (with hash keys of all files), which is updated by some cron/lambda function, so that when we upload something we can match up which files need (or not) updating.

skalee · 2021-03-01T03:08:12Z

FYI I've just triggered re-deploy on iev-demo-site and it's slow again, despite the facts that nothing was changed and that most files are identical.

ronaldtse · 2021-03-02T06:21:40Z

These are relevant features:

skalee self-assigned this Feb 25, 2021

skalee added the maintainability label Feb 26, 2021

skalee mentioned this issue Feb 26, 2021

Update to latest iev-data release glossarist/iev-demo-site#66

Closed

skalee mentioned this issue Mar 1, 2021

Slightly improve deploys glossarist/iev-demo-site#70

Merged

ronaldtse added this to Geolexica Jul 24, 2022

ronaldtse moved this to 🆕 New in Geolexica Jul 24, 2022

ronaldtse moved this from 🆕 New to 📋 Backlog in Geolexica Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need faster deploys #160

Need faster deploys #160

skalee commented Feb 25, 2021

skalee commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

skalee commented Feb 27, 2021

skalee commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

skalee commented Feb 27, 2021 •

edited

Loading

skalee commented Feb 27, 2021

skalee commented Feb 27, 2021 •

edited

Loading

skalee commented Feb 27, 2021 •

edited

Loading

ronaldtse commented Mar 1, 2021

skalee commented Mar 1, 2021

ronaldtse commented Mar 2, 2021

Need faster deploys #160

Need faster deploys #160

Comments

skalee commented Feb 25, 2021

skalee commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

skalee commented Feb 27, 2021

skalee commented Feb 27, 2021

ronaldtse commented Feb 27, 2021

skalee commented Feb 27, 2021 • edited Loading

skalee commented Feb 27, 2021

skalee commented Feb 27, 2021 • edited Loading

skalee commented Feb 27, 2021 • edited Loading

ronaldtse commented Mar 1, 2021

skalee commented Mar 1, 2021

ronaldtse commented Mar 2, 2021

skalee commented Feb 27, 2021 •

edited

Loading

skalee commented Feb 27, 2021 •

edited

Loading

skalee commented Feb 27, 2021 •

edited

Loading