Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gatsby Develop] Building a performant preview server (>10k nodes with dependent pages) #16616

Closed
georgiee opened this issue Aug 14, 2019 · 8 comments
Labels
stale? Issue that may be closed soon due to the original author not responding any more. type: question or discussion Issue discussing or asking a question about Gatsby

Comments

@georgiee
Copy link
Contributor

georgiee commented Aug 14, 2019

that initial description of the issue is already dated, follow the progress in the replies.


Summary

Here comes the essence of this question/issue:

In Gatsby's develop server with the refresh webhook endpoint enabled we create all pages during bootstrap phase to reflect the current state of our external store/CMS. How can we add/update/remove single pages without relying on createPages lifecycle as this would iterate over all pages first and later run all page queries ignoring the fact that almost nothing changed. Even an empty webhook/refresh call can cause a rebuild time of many minutes.

Could createPagesStatefully(to create all pages once during bootstrap) together with onCreateNode (for any successive update on the nodes, call createPage for example) be a viable approach?

What's following are some explanations around the topic, some details and of course an example project to showcase the problem. Everything we learned and researched comes directly from the excellent docs and source files. 90% of the experiment is based on the awesome work of @DSchau who created almost everything in Gatsby's e2e suite 👌

That's how the experiment/demo project behaves
bash

You can see that all pages are recreated whenever something is received by the webhook. Here comes the full story:

Relevant information

When talking about large scale environments we are talking about Gatsby Installations with a nodes count beyond 50.000 - 100.000 and the same count of derived pages. Imagine 50.000 News Articles served by some Headless CMS for this summary. Building with Gatsby in such an environment works for most people as timing is acceptable and the upcoming incremental build should help in cases where a single new node should only generate one additional page for example. It gets pretty interesting in Gatsby's Develop Mode with the page-hot-reloader and the webhook with payload functionality being activated (ENABLE_GATSBY_REFRESH_ENDPOINT).

In such an environment, how to handle node updates efficiently in Gatsby's develop server after the bootstrap ? The bootstrap phase itself will take some time to build & reflect the current state of all data which is fine — but how to prevent Gatsby from (re)-creating all pages and running all page queries when only a single node is updated/added/deleted depending on the webhook payload ?

The product Gatsby Preview seems to help some people with that challenge but it's unfortunately not an option when the client's infrastructure is located in a closed network. Hence our current challenge is related to serve a custom preview based on Gatsby's develop server.

To be prepared for the technical challenges, we dug through many package sources including the core of Gatsby itself and we read all of the excellent documentation about the Gatsby Internals. Kudos for that awesome summary! Things are still not 100% clear to us but the mental image is already starting to build up.

We already achieved a working prototype by pinging naively the __refresh endpoint with no payload with a handful of nodes and generated pages being processed. This was a really nice experience but when we scaled things up it's gone south. We tried to build 5.000 pages and it took many minutes already (~20min to rebuild all pages after a single node update). There are no images no involved, it's the page processing. I save you the details of that installation and created an isolated experiment instead.

Example Project/Experiment

The experiment is based on @DSchau 's work on webhook/fake-source in Gatsby's e2e suite

Here is our example project:
https://github.com/satellytes/gatsby-large-scale-preview-experiment

Run it and trigger some different webhook calls from a second session. Check the README for all of our thoughts when we created the experiment. It's somehow overlapping with this issue description but might help clarifying things.

INITIAL_NODES_TO_CREATE=1000 yarn develop

yarn webhook:full-sync
yarn webhook:new-item
yarn webhook:webhook-empty

When running the example we create a set of initial nodes (INITIAL_NODES_TO_CREATE) by calling our new method api.hugeInitialSync only once in sourceNodes. The existing api.sync method is modified to accept a parameters updateAllNodes: true/false which will cause all nodes being touched as a field updated is incremented.

When the refresh endpoint is hit we can now decide among those scenarios:

  1. add new items (through the webhook, already present in the e2e project)
  2. touch all nodes and create a new node from inside (triggered by a new flag touchAll in the webhook payload)
  3. do nothing

The problem

Everytime you post to the webhook, every single page is recreated - because we tell Gatsby to do so in createPages which is called by the api runner if any page is dirty. It doesn't matter if the payload is empty or filled.

  const { data } = await graphql(`
    {
      allFakeData {
        nodes {
          title
          fields {
            slug
          }
        }
      }
    }
  `)

  data.allFakeData.nodes.forEach((node, index) => {
    createPage({})
    //...

See our file gatsby-node.js for full sources.

The createPages lifecycle is the idiomatic approach which works during build time and it works for most people (including us with a few pages) also during develop time with the hot reload functionality.

As said, with the preview mode activated we trigger an update (with or without a payload) which creates every page and all page queries are run again in addition (this happends later in the lifecycle and also costs quite some time). That makes any small update blocking the development server for minutes depending on your machine and node count. We are unsure how to prevent Gatsby from doing so in the experiment with the fake api source.

Goals:

  • Create all pages of the current state of your API data during bootstrap
  • During the following lifetime of the server, wait for data changes and update according nodes accordingly
  • Whenever a node is added/updated: create that page
  • Whenever a node is deleted: delete the page
  • Prevent Gatsby from running all page queries for that node type as I know that only a single node instance changed.

Here some approaches:

  • Remove createPages and call createPage in onCreateNode instead as it's available through the boundActionCreators.
  • Do we need to leverage the createPagesStatefully lifecycle hook? We tried that and indeed the pages are not re-created upon refresh as intended; however, all the page queries are re-evaluated nevertheless.
  • Do we have to manage page-node dependencies manually with createPageDependency to prevent an updated node to trigger an update for all nodes of the same type ?
  • Should we access the internal emitter and use some low level data ? Maybe the store?

It would be awesome if we could get a little discussion running around this topic — as this might be of interest for other people working with many pages + gatsby develop server/preview.

Source Insights

We have checked many parts of Gatsby's sources, here some interesting files we dug through:

  • page-hot-reloader.js
    You can see that nodes added/deleted set the pagesDirty flag which causes all pages to be created once the api runner has settled.

  • develop.js
    That's where the refresh/preview mode (ENABLE_GATSBY_REFRESH_ENDPOINT) is activated. We can see that sourceNodes is triggered

  • utils/source-nodes.js
    We can see how the api runner is activated. That's when we technically understood why the webhook causes the lifecycle createPages to be called.

  • gatsby-source-graphql/src/gatsby-node.js
    We found createPageDependency in the wild (inside a plugin/source) only in the gatsby-source-graphql plugin.

  • What's happening in the internal redux store can bee seen here packages/gatsby/src/redux. Did not help much. We looked up the things happening around page dependencies.

I'm sorry for the length of the topic. I wanted to provide as many information as possible. I also joined the Discord channel but I think the topic is worth to be discussed in this question issue.

Thanks for reading and I appreciate any input on this topic.

@georgiee georgiee added the type: question or discussion Issue discussing or asking a question about Gatsby label Aug 14, 2019
@georgiee georgiee changed the title [Gatsby Develop] Building a performant preview server (> 10k nodes with dependent pages) [Gatsby Develop] Building a performant preview server (>10k nodes with dependent pages) Aug 14, 2019
@georgiee
Copy link
Contributor Author

georgiee commented Aug 20, 2019

We made some progress.

  • createPagesStatefully ist the way to go at least for the initial bootstrap to generate most of the pages. There are indeed stateful because we are not going to change them usually.

  • When a node is updated those pages still update — and only the pages connected with the nodes. I don't know what was wrong in my example.

  • For any new node we can't create a page for them in onCreateNode as those pages are the default/dynamic/non-stateful ones.

  • The hot reloading mechanisms kills every page that isn't touched see

    Array.from(store.getState().pages.values()).forEach(page => {
    if (
    !page.isCreatedByStatefulCreatePages &&
    page.updatedAt < timestamp &&
    page.path !== `/404.html`
    ) {
    deleteComponentsDependencies([page.path])
    deletePage(page)
    }
    })

  • So we are basically not supposed to use createPage outside the createPages lifecycle as those pages are deleted in the next life cycle round

Idea: For every new node mark them as new so we can query them (instead of all nodes that already have a page). That way we can have a smaller set of pages we have to query and rebuild.

@georgiee
Copy link
Contributor Author

georgiee commented Aug 21, 2019

Well let's make this issue useful for other souls searching for a preview. I will add useful links to articles but mostly source files in Gatsby in this post:

I try to continue/edit this list.

@georgiee
Copy link
Contributor Author

Can't believe this, the initial example setup is wrong:

activity.setStatus(
   `Creating ${index + 1} of ${totalPages} total pages`
);

The activity timer drastically slows down the example and gives the illusion createPages is running slow. This doesn't mean that we don't have real performance problems in our build but our whole isolation of the problem and the debugging is based on false facts.

You can mimic this behaviour by dropping this in your createPages:

activity = reporter.activityTimer(`create pages`)
activity.start();

for(let i = 0; i < 1000; i++) {
    activity.setStatus(
      `[DUMMY] Creating ${i + 1} of ${1000} total pages`
    );
  }
  activity.end();

This will take 7 seconds on my machine just to run the for loop. I found the activity timer as it's being used by the page queries info spinner. The main difference: graphql queries are being reported asynchronous while I'm using a synchronous for loop. Might be worth to raise this as an issue for the reporter/activity functionality

@gatsbot
Copy link

gatsbot bot commented Sep 12, 2019

Hiya!

This issue has gone quiet. Spooky quiet. 👻

We get a lot of issues, so we currently close issues after 30 days of inactivity. It’s been at least 20 days since the last update here.

If we missed this issue or if you want to keep it open, please reply here. You can also add the label "not stale" to keep this issue open!

As a friendly reminder: the best way to see this issue, or any other, fixed is to open a Pull Request. Check out gatsby.dev/contribute for more information about opening PRs, triaging issues, and contributing!

Thanks for being a part of the Gatsby community! 💪💜

@gatsbot gatsbot bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Sep 12, 2019
@georgiee
Copy link
Contributor Author

Let's close this until we have a more specific problem to talk about.

@nadinagray
Copy link

@georgiee would love your insights if you've got a functioning solution -- paying for the Gatsby Preview currently. Encountering issues re: support responsiveness and evaluating building our own solution.

@sidharthachatterjee
Copy link
Contributor

@nadinagerlach Apologies for the issues with support responsiveness. I've gone ahead and responded to all your tickets and taken care of the issue as well! 🙂

@georgiee
Copy link
Contributor Author

Hello @nadinagerlach,
we currently focus on getting the actual page implementations done and postponed the work on the preview server. We have had some agile spikes to explore possibilities.

Some things we considered:

  • Continuously check other gatsby source plugins (especially the ones that support Gatsby Preview) to gains some technical insights.
  • Continue to explore the stateful pages approach. The idea in a shell: Create all pages statefully (kind of a baseline). Whenever a CMS change comes in derive a dynamic page to benefit from the Gatsby update magic. That way we have a huge bucket of static pages that are not changing and a bucket of pages being changed in the recent past. The latter bucket needs to be emptied from time to time to keep the performance. Could be a restart during the night.
  • Our data comes from an external GraphQL server and we include it through schema stitching. This also means our data never lands in the internal redux store that is used by Gatsby to build the data graph for the internal GraphQL endpoint. Any created page is still sitting in that store but we are unsure how this relates/behaves in the overall preview (develop) and build workflow. We are collecting some knowledge on this too.

The last time I personally worked on our preview server problem was Summer 2019. A lot of things happened since then and maybe some more resources on building a preview server appeared? There is an excellent documentation section about all the internal of Gatsby called Gatsby Internal. Reading that together with the Gatsby Source Code helped a lot — but it would help a lot to have more guidance for building an own preview server as it's such a crucial part for a Gatsby installation beyond a specific size.

I hope you have a better experience and I would be happy to hear about your preview experiences 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale? Issue that may be closed soon due to the original author not responding any more. type: question or discussion Issue discussing or asking a question about Gatsby
Projects
None yet
Development

No branches or pull requests

3 participants