Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Rebuild npm syncing #4438

Merged
merged 11 commits into from
May 5, 2017
Merged

Rebuild npm syncing #4438

merged 11 commits into from
May 5, 2017

Conversation

chadwhitacre
Copy link
Contributor

@chadwhitacre chadwhitacre commented May 3, 2017

Picks up from #4148 and #4427 (comment). Part of #4427.

We managed to load up a snapshot of npm back on #4148, but only barely enough to show stubby pages. Now that #4305 is pretty much ready to go, it's time to get some more robust npm syncing in place. Turns out the old API we were depending on is going away, so we need to rebuild this subsystem around npm's CouchDB change stream.

Specs

Todo

  • clear out old code
  • write new code
  • what about deletes?
  • remove ijson dep, add couchdb
  • moar tests for consume!
  • log distance behind registry?
  • fix on travis
  • check behavior when we catch up

@chadwhitacre chadwhitacre mentioned this pull request May 3, 2017
15 tasks
@chadwhitacre
Copy link
Contributor Author

I'm pretty sure the initial load that took two days included README fetching, which we ripped out in #4211. I believe now we can load up from just the metadata, which should go much quicker.

@chadwhitacre
Copy link
Contributor Author

Curveball!

"Deprecating the /-/all registry endpoint"

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented May 3, 2017

If you are using the endpoint as a way to get a list of all packages, we encourage you to write a registry follower that watches the changes stream at replicate.npmjs.com for public packages. We provide sample code and libraries to support you.

@chadwhitacre
Copy link
Contributor Author

Pretty sure we want to use https://github.com/djc/couchdb-python.

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented May 3, 2017

Yeah, this is gonna be it. :-)

[gratipay] $ ./stream-npm-registry.py 
--Return--
> /Users/whit537/personal/gratipay/gratipay.com/stream-npm-registry.py(10)<module>()->None
-> import pdb; pdb.set_trace()
(Pdb) changes
{u'last_seq': 46739, u'results': [{u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}], u'id': u'_design/scratch', u'seq': 2}, {u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}], u'id': u'_design/app', u'seq': 3}, {u'deleted': True, u'changes': [{u'rev': u'2-997ac5f43938c18e61b537c648a819ea'}], u'id': u'netlify-yo-styleguide', u'seq': 45291}, {u'deleted': True, u'changes': [{u'rev': u'2-2d3c93e9c1e6311165d0ff4db2252175'}], u'id': u'nwc-18next', u'seq': 45312}, {u'deleted': True, u'changes': [{u'rev': u'2-cfd3dab84525c7a3cc0b2870e34cf2a8'}], u'id': u'bemwork', u'seq': 45655}, {u'deleted': True, u'changes': [{u'rev': u'2-ba019773cc7349a5b5e15be0e99c9b45'}], u'id': u'eslint-config-testharness', u'seq': 45721}, {u'deleted': True, u'changes': [{u'rev': u'2-025e1b8750f106d7aa800a562be9ace9'}], u'id': u'babel-preset-backpack-react-app', u'seq': 45778}, {u'deleted': True, u'changes': [{u'rev': u'2-36d03f8f1f0dad72b3d18dbd7653f663'}], u'id': u'node-websockets', u'seq': 45899}, {u'deleted': True, u'changes': [{u'rev': u'4-98077090c8d8cafcebb26fb7368c4c8b'}], u'id': u'phuoctt2015', u'seq': 46707}, {u'deleted': True, u'changes': [{u'rev': u'2-a32acab1e307d8359a15294693206675'}], u'id': u'rose-common', u'seq': 46739}]}
(Pdb) pp changes
{u'last_seq': 46739,
 u'results': [{u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}],
               u'id': u'_design/scratch',
               u'seq': 2},
              {u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}],
               u'id': u'_design/app',
               u'seq': 3},
              {u'changes': [{u'rev': u'2-997ac5f43938c18e61b537c648a819ea'}],
               u'deleted': True,
               u'id': u'netlify-yo-styleguide',
               u'seq': 45291},
              {u'changes': [{u'rev': u'2-2d3c93e9c1e6311165d0ff4db2252175'}],
               u'deleted': True,
               u'id': u'nwc-18next',
               u'seq': 45312},
              {u'changes': [{u'rev': u'2-cfd3dab84525c7a3cc0b2870e34cf2a8'}],
               u'deleted': True,
               u'id': u'bemwork',
               u'seq': 45655},
              {u'changes': [{u'rev': u'2-ba019773cc7349a5b5e15be0e99c9b45'}],
               u'deleted': True,
               u'id': u'eslint-config-testharness',
               u'seq': 45721},
              {u'changes': [{u'rev': u'2-025e1b8750f106d7aa800a562be9ace9'}],
               u'deleted': True,
               u'id': u'babel-preset-backpack-react-app',
               u'seq': 45778},
              {u'changes': [{u'rev': u'2-36d03f8f1f0dad72b3d18dbd7653f663'}],
               u'deleted': True,
               u'id': u'node-websockets',
               u'seq': 45899},
              {u'changes': [{u'rev': u'4-98077090c8d8cafcebb26fb7368c4c8b'}],
               u'deleted': True,
               u'id': u'phuoctt2015',
               u'seq': 46707},
              {u'changes': [{u'rev': u'2-a32acab1e307d8359a15294693206675'}],
               u'deleted': True,
               u'id': u'rose-common',
               u'seq': 46739}]}
(Pdb)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals

from couchdb import Database


npm = Database('https://skimdb.npmjs.com/registry')
changes = npm.changes(limit=10)
import pdb; pdb.set_trace()

screen shot 2017-05-03 at 6 41 29 am

@chadwhitacre
Copy link
Contributor Author

@chadwhitacre
Copy link
Contributor Author

Alright, I think to start with we should go for a naive approach where we have a single process/thread that consumes the registry stream and inserts/updates in our database all at once. Decoupling fetch and update will be more complicated and should be done because we're getting too far behind otherwise. We could even probably build a quick dashboard to show far behind we are—or log over to Librato.

@chadwhitacre chadwhitacre mentioned this pull request May 3, 2017
@chadwhitacre
Copy link
Contributor Author

Shelving:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals

from couchdb import Database
from gratipay import wireup


def go(db):
    npm = Database('https://skimdb.npmjs.com/registry')
    changes = npm.changes(feed='continuous', include_docs=True)
    for change in changes:
        doc = change['doc']
        if 'name' not in doc:
            continue  # not a package, probably a design doc*
        name = doc['name']
        description = doc.get('description', '')
        emails = [e for e in [m.get('email') for m in doc.get('maintainers', [])] if e]

        try:
            db.run( "update packages set description=%s, emails=%s where package_manager='npm' and name=%s"
                  , (description, emails, name)
                   )
        except:
            db.run('insert into packages () values ()')





# * https://github.com/npm/registry/blob/aef8a275/docs/follower.md#clean-up
if __name__ == '__main__':
    env = wireup.env()
    db = wireup.db(env)
    go(db)

@chadwhitacre
Copy link
Contributor Author

This is basically a rewrite of this subsystem.

@chadwhitacre
Copy link
Contributor Author

This will actually be much better though. No more ijson dependency and also much closer to real-time. No more batch mode.

@chadwhitacre chadwhitacre changed the title Sync npm Rebuild npm syncing May 3, 2017
@chadwhitacre
Copy link
Contributor Author

Eep! Time to upgrade to Postgres 9.6 locally. 😊

psycopg2.ProgrammingError: syntax error at or near "ON"
LINE 6:     ON CONFLICT (package_manager, name) UPDATE

@chadwhitacre
Copy link
Contributor Author

Yesssss!!!

screen shot 2017-05-03 at 2 57 50 pm

@chadwhitacre
Copy link
Contributor Author

Okay! I'm going to run this locally and see how long it takes and how it behaves.

@chadwhitacre
Copy link
Contributor Author

Added some logging.

screen shot 2017-05-03 at 4 59 04 pm

@chadwhitacre
Copy link
Contributor Author

Started a long run ...

@chadwhitacre
Copy link
Contributor Author

pid-77299 thread-140735204086528 (MainThread) Picking up with npm sync at -1.

@chadwhitacre
Copy link
Contributor Author

I'm at about 80,000 after about 10(?) minutes.

@chadwhitacre
Copy link
Contributor Author

So maybe an hour for the whole thing?

@chadwhitacre
Copy link
Contributor Author

That's way better than two days, anyway. ☺️

@chadwhitacre
Copy link
Contributor Author

deleted documents include the "deleted": true attribute

http://docs.couchdb.org/en/2.0.0/api/database/changes.html#polling

@chadwhitacre
Copy link
Contributor Author

Out of time for now.

pid-77350 thread-140735204086528 (MainThread) KeyboardInterrupt
pid-77350 thread-140735204086528 (MainThread) Encountered an error, will pick up with 517201 in 60 seconds (Ctrl-C to exit) ...
^C
real    17m47.575s
user    1m16.416s
sys     0m25.121s
[gratipay] $
Every 1.0s: echo 'select count(*) from packages' | psql gratipay                 Wed May  3 17:27:49 2017

Null display is "¤".
Line style is unicode.
Border style is 2.
┌────────┐
│ count  │
├────────┤
│ 100404 │
└────────┘
(1 row)

@chadwhitacre
Copy link
Contributor Author

Deletes should remove locally. Need to take care to unlink teams from packages when deleting packages. It's okay for the team to stick around, I think? It'll have a 404 homepage on npmjs.

connection.commit()


def delete(cursor, processed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be clearer if we renamed processed to processed_doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 631dfc9.

for change in change_stream(last_seq):
if change.get('deleted'):
# Hack to work around conflation of design docs and packages in updates
op, doc = delete, {'name': change['id']}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit confusing. delete takes a dictionary, although it only needs one string as the argument. Also, we don't need to pass the fake doc ({'name': change['id']}) through process_doc.

At the cost of a line or two more, I think this can be simplified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something along the lines of:

screen shot 2017-05-05 at 5 37 48 pm

Raw version:

Before:

with db.get_connection() as connection:
        for change in change_stream(last_seq):
            if change.get('deleted'):
                # Hack to work around conflation of design docs and packages in updates
                op, doc = delete, {'name': change['id']}
            else:
                op, doc = upsert, change['doc']
            processed = process_doc(doc)
            if not processed:
                continue
            cursor = connection.cursor()
            op(cursor, processed)
            cursor.run('UPDATE worker_coordination SET npm_last_seq=%(seq)s', change)
            connection.commit()


def delete(cursor, processed):
    cursor.run("DELETE FROM packages WHERE package_manager='npm' AND name=%(name)s", processed)


def upsert(cursor, processed):
    cursor.run('''
    INSERT INTO packages
                (package_manager, name, description, emails)
         VALUES ('npm', %(name)s, %(description)s, %(emails)s)
    ON CONFLICT (package_manager, name) DO UPDATE
            SET description=%(description)s, emails=%(emails)s
    ''', processed)

After:

with db.get_connection() as connection:
        for change in change_stream(last_seq):
            cursor = connection.cursor()
            if change.get('deleted'):
                # Hack to work around conflation of design docs and packages in updates
                delete(cursor, change['id'])
            else:
                upsert(cursor, process_doc(doc))
            cursor.run('UPDATE worker_coordination SET npm_last_seq=%(seq)s', change)
            connection.commit()


def delete(cursor, package_name):
    cursor.run("DELETE FROM packages WHERE package_manager='npm' AND name=%s", package_name)


def upsert(cursor, processed_doc):
    cursor.run('''
    INSERT INTO packages
                (package_manager, name, description, emails)
         VALUES ('npm', %(name)s, %(description)s, %(emails)s)
    ON CONFLICT (package_manager, name) DO UPDATE
            SET description=%(description)s, emails=%(emails)s
    ''', processed_doc)

Only downside I see here is that we're doing a little bit more work (calling process_doc, checking the deleted key) inside the transaction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't account for skipping docs with no name key. How about 631dfc9?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes 631dfc9 looks good

with sentry.teller(env):
consume_change_stream(production_change_stream, db)
try:
last_seq = get_last_seq(db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we're calling get_last_seq here anyway - might make sense to simplify the function definition of consume_change_stream to accept the stream directly, and not a function that has to be called with seq to return the stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in af41409.

return self

def __exit__(self, exc_type, exc_value, traceback):
self.tell_sentry(exc_type, {})
Copy link
Contributor

@rohitpaulk rohitpaulk May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does exc_type have all the details that we need to send to sentry? Shouldn't we pass traceback and exc_value? (I'm not sure what they are, but traceback sure sounds important)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentry accesses Python's global exception state directly during captureException (via sys.exc_info, presumably), so we don't have to pass it through these function calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting

import traceback

from aspen import log
from gratipay import wireup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils importing wireup? 😛 That seems hacky. No neater way?

Copy link
Contributor Author

@chadwhitacre chadwhitacre May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No neater way?

Not on this PR. 😞

I agree it's hacky. Eventually I would see rewiring Sentry along the lines of what was started in #4345 and other PRs listed under "new email subflooring" on #4427.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This PR is already 1.5x our 400-net-lines rule of thumb.)

@chadwhitacre
Copy link
Contributor Author

/me looking into failures ...

@rohitpaulk
Copy link
Contributor

I'm good once travis is :)

@chadwhitacre
Copy link
Contributor Author

Travis is good! :-D

@rohitpaulk
Copy link
Contributor

==

@chadwhitacre
Copy link
Contributor Author

chadwhitacre commented May 5, 2017

Some discussion of error reporting and retry architecture in slack.

@rohitpaulk
Copy link
Contributor

Merging and deploying...

@rohitpaulk rohitpaulk merged commit df3efcd into master May 5, 2017
@rohitpaulk rohitpaulk deleted the sync-npm branch May 5, 2017 13:26
@chadwhitacre
Copy link
Contributor Author

When you're done you could try adding an instrument to http://inside.gratipay.com/appendices/health for npm sync lag. Librato is in 1Password so it would be a good test for that as well. :)

@rohitpaulk
Copy link
Contributor

Okay, we've got an error..

screen shot 2017-05-05 at 7 02 28 pm

@rohitpaulk
Copy link
Contributor

rohitpaulk commented May 5, 2017

Hmm, I had a sync_npm folder lying around.. wonder where that came from

@rohitpaulk
Copy link
Contributor

I forgot to add the env var 😞 Gratipay was down for around 3 minutes, back up now

@rohitpaulk
Copy link
Contributor

Now to figure out how to run the syncer. Add it to the heroku procfile?

@rohitpaulk
Copy link
Contributor

I'm going to try to run as a one-off dyno first

@chadwhitacre
Copy link
Contributor Author

Some deploy log in slack.

@chadwhitacre
Copy link
Contributor Author

Hmm, I had a sync_npm folder lying around.. wonder where that came from

Maybe pyc files kept Git from removing it after the switch from sync_npm/ to sync_npm.py?

chadwhitacre added a commit that referenced this pull request May 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants