Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxito 404: check for files in the DB instead of storage #10512

Closed
stsewd opened this issue Jul 5, 2023 · 5 comments · Fixed by #10617
Closed

Proxito 404: check for files in the DB instead of storage #10512

stsewd opened this issue Jul 5, 2023 · 5 comments · Fixed by #10617
Assignees
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code

Comments

@stsewd
Copy link
Member

stsewd commented Jul 5, 2023

What's the problem this feature will solve?

Our 404 handler is slooooow, this is mainly because we hit our storage to check if a file exists,
in the worst case, we will do 6 calls to storage.

Describe the solution you'd like

Instead of relying on calls to storage to check if a given file exists, we can use our DB for that. How's that? Well, as part of our build process, we create an HTMLFile object for each HTML file found in the user docs.

HTMLFile.objects.create(
project=version.project,
version=version,
path=relpath,
name=filename,
rank=page_rank,
commit=commit,
build=build,
ignore=ignore,
)

So we already keep track of every HTML file. This table is BIG, so we may want to first try to trim it a little, I don't think we need to keep track of every HTML file, we just need to track index.html/README.html and 404.html files.

What about search?
We rely heavily on the HTMLFile model for search, but I think we can do just fine without it. When indexing, we can just walk the storage as we do when creating the HTML files

storage_path = version.project.get_storage_path(
type_='html', version_slug=version.slug, include_file=False
)
for root, __, filenames in build_media_storage.walk(storage_path):

And we can get the search ignore/ranking patterns from the config of the build object attached to the version.

def config(self):
"""
Proxy to the configuration of the build.
:returns: The configuration used in the last successful build.
:rtype: dict
"""
last_build = (
self.builds(manager=INTERNAL).filter(
state=BUILD_STATE_FINISHED,
success=True,
).order_by('-date')
.only('_config')
.first()
)
if last_build:
return last_build.config
return None

If we want, we can also skip that work and just hit that table as is right now.

Additional context

A little related to #10061, where I talk about not hitting our application twice for 404s that aren't from the storage.

@stsewd stsewd added the Needed: design decision A core team decision is required label Jul 5, 2023
@humitos
Copy link
Member

humitos commented Jul 6, 2023

So we already keep track of every HTML file. This table is BIG, so we may want to first try to trim it a little, I don't think we need to keep track of every HTML file, we just need to track index.html/README.html and 404.html files.

I understand that we don't need files like tutorial/builds/customize-your-builds.html because if the user hits that URL and it doesn't exist, we don't have to do anything, right? I mean, we only need to do the 404 handler when the URL ends with a trailing slash (e.g. tutorial/builds/customize-your-builds/)

If I'm correct here, I think it's a great idea what you are proposing.

By the way, we are getting rid of README.html feature in the near future I guess: #9993 (also related #1800)

We rely heavily on the HTMLFile model for search, but I think we can do just fine without it. When indexing, we can just walk the storage as we do when creating the HTML files

And we can get the search ignore/ranking patterns from the config of the build object attached to the version.

This also looks like a good idea to me. It simplifies the code and make everything more standardized, in my opinion.

@stsewd
Copy link
Member Author

stsewd commented Jul 6, 2023

I understand that we don't need files like tutorial/builds/customize-your-builds.html because if the user hits that URL and it doesn't exist, we don't have to do anything, right?

yeah, we don't need to know if those files exist, it we are in the 404 handler, it means it wasn't found in storage.

@ericholscher
Copy link
Member

We talked about this in a call, and we're 👍 on moving forward. Big wins:

  • Reduce N queries to storage to 1 query to DB
  • Better response times on 404s for users
  • Reduce size of DB tables on search (ImportedFile is our biggest table currently)

@stsewd
Copy link
Member Author

stsewd commented Jul 26, 2023

So, now one question, what do we want to do first?

  • Replace storage calls with one DB call
  • Reduce the table size

@ericholscher
Copy link
Member

ericholscher commented Jul 26, 2023

@stsewd I think the DB call work seems useful to start with, assuming the size of the table doesn't make it slow :)

@github-project-automation github-project-automation bot moved this to Planned in 📍Roadmap Jul 27, 2023
@humitos humitos added Improvement Minor improvement to code Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jul 27, 2023
stsewd added a commit that referenced this issue Aug 9, 2023
We have all the information we need in the DB already.

Closes #10512
stsewd added a commit that referenced this issue Aug 9, 2023
We have all the information we need in the DB already.

Closes #10512
@stsewd stsewd moved this from Planned to In progress in 📍Roadmap Aug 23, 2023
@github-project-automation github-project-automation bot moved this from In progress to Done in 📍Roadmap Aug 23, 2023
stsewd added a commit that referenced this issue Aug 23, 2023
We have all the information we need in the DB already.

Closes #10512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants
@ericholscher @humitos @stsewd and others