Proxito 404: check for files in the DB instead of storage #10512

stsewd · 2023-07-05T21:43:56Z

What's the problem this feature will solve?

Our 404 handler is slooooow, this is mainly because we hit our storage to check if a file exists,
in the worst case, we will do 6 calls to storage.

Describe the solution you'd like

Instead of relying on calls to storage to check if a given file exists, we can use our DB for that. How's that? Well, as part of our build process, we create an HTMLFile object for each HTML file found in the user docs.

readthedocs.org/readthedocs/projects/tasks/search.py

Lines 138 to 147 in f7d390c

    
           HTMLFile.objects.create( 
        
               project=version.project, 
        
               version=version, 
        
               path=relpath, 
        
               name=filename, 
        
               rank=page_rank, 
        
               commit=commit, 
        
               build=build, 
        
               ignore=ignore, 
        
           )

So we already keep track of every HTML file. This table is BIG, so we may want to first try to trim it a little, I don't think we need to keep track of every HTML file, we just need to track index.html/README.html and 404.html files.

What about search?
We rely heavily on the HTMLFile model for search, but I think we can do just fine without it. When indexing, we can just walk the storage as we do when creating the HTML files

readthedocs.org/readthedocs/projects/tasks/search.py

Lines 107 to 110 in f7d390c

    
           storage_path = version.project.get_storage_path( 
        
               type_='html', version_slug=version.slug, include_file=False 
        
           ) 
        
           for root, __, filenames in build_media_storage.walk(storage_path):

And we can get the search ignore/ranking patterns from the config of the build object attached to the version.

readthedocs.org/readthedocs/builds/models.py

Lines 310 to 327 in 84f889a

    
               def config(self): 
        
                   """ 
        
                   Proxy to the configuration of the build. 
        
                   :returns: The configuration used in the last successful build. 
        
                   :rtype: dict 
        
                   """ 
        
                   last_build = ( 
        
                       self.builds(manager=INTERNAL).filter( 
        
                           state=BUILD_STATE_FINISHED, 
        
                           success=True, 
        
                       ).order_by('-date') 
        
                       .only('_config') 
        
                       .first() 
        
                   ) 
        
                   if last_build: 
        
                       return last_build.config 
        
                   return None

If we want, we can also skip that work and just hit that table as is right now.

Additional context

A little related to #10061, where I talk about not hitting our application twice for 404s that aren't from the storage.

humitos · 2023-07-06T10:28:29Z

So we already keep track of every HTML file. This table is BIG, so we may want to first try to trim it a little, I don't think we need to keep track of every HTML file, we just need to track index.html/README.html and 404.html files.

I understand that we don't need files like tutorial/builds/customize-your-builds.html because if the user hits that URL and it doesn't exist, we don't have to do anything, right? I mean, we only need to do the 404 handler when the URL ends with a trailing slash (e.g. tutorial/builds/customize-your-builds/)

If I'm correct here, I think it's a great idea what you are proposing.

By the way, we are getting rid of README.html feature in the near future I guess: #9993 (also related #1800)

We rely heavily on the HTMLFile model for search, but I think we can do just fine without it. When indexing, we can just walk the storage as we do when creating the HTML files

And we can get the search ignore/ranking patterns from the config of the build object attached to the version.

This also looks like a good idea to me. It simplifies the code and make everything more standardized, in my opinion.

stsewd · 2023-07-06T15:34:02Z

I understand that we don't need files like tutorial/builds/customize-your-builds.html because if the user hits that URL and it doesn't exist, we don't have to do anything, right?

yeah, we don't need to know if those files exist, it we are in the 404 handler, it means it wasn't found in storage.

ericholscher · 2023-07-26T16:15:45Z

We talked about this in a call, and we're 👍 on moving forward. Big wins:

Reduce N queries to storage to 1 query to DB
Better response times on 404s for users
Reduce size of DB tables on search (ImportedFile is our biggest table currently)

stsewd · 2023-07-26T17:18:23Z

So, now one question, what do we want to do first?

Replace storage calls with one DB call
Reduce the table size

ericholscher · 2023-07-26T20:34:10Z

@stsewd I think the DB call work seems useful to start with, assuming the size of the table doesn't make it slow :)

We have all the information we need in the DB already. Closes #10512

stsewd added the Needed: design decision A core team decision is required label Jul 5, 2023

humitos added this to 📍Roadmap Jul 27, 2023

github-project-automation bot moved this to Planned in 📍Roadmap Jul 27, 2023

humitos added Improvement Minor improvement to code Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jul 27, 2023

humitos assigned stsewd Jul 27, 2023

stsewd added a commit that referenced this issue Aug 9, 2023

Proxito: Don't hit storage for 404s

c4e0d06

We have all the information we need in the DB already. Closes #10512

stsewd mentioned this issue Aug 9, 2023

Proxito: Don't hit storage for 404s #10617

Merged

stsewd added a commit that referenced this issue Aug 9, 2023

Proxito: Don't hit storage for 404s

9a1bd6c

We have all the information we need in the DB already. Closes #10512

stsewd mentioned this issue Aug 10, 2023

Search: stop relying on the DB when indexing #10623

Closed

humitos mentioned this issue Aug 22, 2023

ImportedFile: use BigAutoField for primary key #9669

Open

stsewd mentioned this issue Aug 22, 2023

Proxito: query the DB instead of storage when serving robots.txt #10659

Open

stsewd moved this from Planned to In progress in 📍Roadmap Aug 23, 2023

stsewd closed this as completed in #10617 Aug 23, 2023

github-project-automation bot moved this from In progress to Done in 📍Roadmap Aug 23, 2023

stsewd added a commit that referenced this issue Aug 23, 2023

Proxito: Don't hit storage for 404s (#10617)

dcfc862

We have all the information we need in the DB already. Closes #10512

stsewd mentioned this issue Aug 30, 2023

Build: track imported files for external versions #10690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxito 404: check for files in the DB instead of storage #10512

Proxito 404: check for files in the DB instead of storage #10512

stsewd commented Jul 5, 2023

humitos commented Jul 6, 2023

stsewd commented Jul 6, 2023

ericholscher commented Jul 26, 2023

stsewd commented Jul 26, 2023

ericholscher commented Jul 26, 2023 •

edited

Loading

Proxito 404: check for files in the DB instead of storage #10512

Proxito 404: check for files in the DB instead of storage #10512

Comments

stsewd commented Jul 5, 2023

What's the problem this feature will solve?

Describe the solution you'd like

Additional context

humitos commented Jul 6, 2023

stsewd commented Jul 6, 2023

ericholscher commented Jul 26, 2023

stsewd commented Jul 26, 2023

ericholscher commented Jul 26, 2023 • edited Loading

ericholscher commented Jul 26, 2023 •

edited

Loading