Rebuild npm syncing #4438

chadwhitacre · 2017-05-03T10:06:50Z

Picks up from #4148 and #4427 (comment). Part of #4427.

We managed to load up a snapshot of npm back on #4148, but only barely enough to show stubby pages. Now that #4305 is pretty much ready to go, it's time to get some more robust npm syncing in place. Turns out the old API we were depending on is going away, so we need to rebuild this subsystem around npm's CouchDB change stream.

Specs

don't store author email address, only maintainer (Integrate initial package claiming PRs #4305 (comment))
deduplicate emails (shouldn't be a problem w/ only maintainers?)

Todo

chadwhitacre · 2017-05-03T10:10:37Z

I'm pretty sure the initial load that took two days included README fetching, which we ripped out in #4211. I believe now we can load up from just the metadata, which should go much quicker.

chadwhitacre · 2017-05-03T10:24:32Z

Curveball!

"Deprecating the /-/all registry endpoint"

chadwhitacre · 2017-05-03T10:25:51Z

If you are using the endpoint as a way to get a list of all packages, we encourage you to write a registry follower that watches the changes stream at replicate.npmjs.com for public packages. We provide sample code and libraries to support you.

chadwhitacre · 2017-05-03T10:34:00Z

Pretty sure we want to use https://github.com/djc/couchdb-python.

chadwhitacre · 2017-05-03T10:36:41Z

Probably this API?

https://pythonhosted.org/CouchDB/client.html#couchdb.client.Database.changes

chadwhitacre · 2017-05-03T10:39:57Z

Underlying API: http://docs.couchdb.org/en/2.0.0/api/database/changes.html.

chadwhitacre · 2017-05-03T10:42:02Z

Yeah, this is gonna be it. :-)

[gratipay] $ ./stream-npm-registry.py 
--Return--
> /Users/whit537/personal/gratipay/gratipay.com/stream-npm-registry.py(10)<module>()->None
-> import pdb; pdb.set_trace()
(Pdb) changes
{u'last_seq': 46739, u'results': [{u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}], u'id': u'_design/scratch', u'seq': 2}, {u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}], u'id': u'_design/app', u'seq': 3}, {u'deleted': True, u'changes': [{u'rev': u'2-997ac5f43938c18e61b537c648a819ea'}], u'id': u'netlify-yo-styleguide', u'seq': 45291}, {u'deleted': True, u'changes': [{u'rev': u'2-2d3c93e9c1e6311165d0ff4db2252175'}], u'id': u'nwc-18next', u'seq': 45312}, {u'deleted': True, u'changes': [{u'rev': u'2-cfd3dab84525c7a3cc0b2870e34cf2a8'}], u'id': u'bemwork', u'seq': 45655}, {u'deleted': True, u'changes': [{u'rev': u'2-ba019773cc7349a5b5e15be0e99c9b45'}], u'id': u'eslint-config-testharness', u'seq': 45721}, {u'deleted': True, u'changes': [{u'rev': u'2-025e1b8750f106d7aa800a562be9ace9'}], u'id': u'babel-preset-backpack-react-app', u'seq': 45778}, {u'deleted': True, u'changes': [{u'rev': u'2-36d03f8f1f0dad72b3d18dbd7653f663'}], u'id': u'node-websockets', u'seq': 45899}, {u'deleted': True, u'changes': [{u'rev': u'4-98077090c8d8cafcebb26fb7368c4c8b'}], u'id': u'phuoctt2015', u'seq': 46707}, {u'deleted': True, u'changes': [{u'rev': u'2-a32acab1e307d8359a15294693206675'}], u'id': u'rose-common', u'seq': 46739}]}
(Pdb) pp changes
{u'last_seq': 46739,
 u'results': [{u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}],
               u'id': u'_design/scratch',
               u'seq': 2},
              {u'changes': [{u'rev': u'1-4136ab2028eaa41eeb63e22b028172a0'}],
               u'id': u'_design/app',
               u'seq': 3},
              {u'changes': [{u'rev': u'2-997ac5f43938c18e61b537c648a819ea'}],
               u'deleted': True,
               u'id': u'netlify-yo-styleguide',
               u'seq': 45291},
              {u'changes': [{u'rev': u'2-2d3c93e9c1e6311165d0ff4db2252175'}],
               u'deleted': True,
               u'id': u'nwc-18next',
               u'seq': 45312},
              {u'changes': [{u'rev': u'2-cfd3dab84525c7a3cc0b2870e34cf2a8'}],
               u'deleted': True,
               u'id': u'bemwork',
               u'seq': 45655},
              {u'changes': [{u'rev': u'2-ba019773cc7349a5b5e15be0e99c9b45'}],
               u'deleted': True,
               u'id': u'eslint-config-testharness',
               u'seq': 45721},
              {u'changes': [{u'rev': u'2-025e1b8750f106d7aa800a562be9ace9'}],
               u'deleted': True,
               u'id': u'babel-preset-backpack-react-app',
               u'seq': 45778},
              {u'changes': [{u'rev': u'2-36d03f8f1f0dad72b3d18dbd7653f663'}],
               u'deleted': True,
               u'id': u'node-websockets',
               u'seq': 45899},
              {u'changes': [{u'rev': u'4-98077090c8d8cafcebb26fb7368c4c8b'}],
               u'deleted': True,
               u'id': u'phuoctt2015',
               u'seq': 46707},
              {u'changes': [{u'rev': u'2-a32acab1e307d8359a15294693206675'}],
               u'deleted': True,
               u'id': u'rose-common',
               u'seq': 46739}]}
(Pdb)

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals

from couchdb import Database


npm = Database('https://skimdb.npmjs.com/registry')
changes = npm.changes(limit=10)
import pdb; pdb.set_trace()

chadwhitacre · 2017-05-03T12:19:57Z

https://blog.andyet.com/2015/04/06/postgres-pubsub-with-json/

chadwhitacre · 2017-05-03T12:32:52Z

Alright, I think to start with we should go for a naive approach where we have a single process/thread that consumes the registry stream and inserts/updates in our database all at once. Decoupling fetch and update will be more complicated and should be done because we're getting too far behind otherwise. We could even probably build a quick dashboard to show far behind we are—or log over to Librato.

chadwhitacre · 2017-05-03T16:59:08Z

Shelving:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals

from couchdb import Database
from gratipay import wireup


def go(db):
    npm = Database('https://skimdb.npmjs.com/registry')
    changes = npm.changes(feed='continuous', include_docs=True)
    for change in changes:
        doc = change['doc']
        if 'name' not in doc:
            continue  # not a package, probably a design doc*
        name = doc['name']
        description = doc.get('description', '')
        emails = [e for e in [m.get('email') for m in doc.get('maintainers', [])] if e]

        try:
            db.run( "update packages set description=%s, emails=%s where package_manager='npm' and name=%s"
                  , (description, emails, name)
                   )
        except:
            db.run('insert into packages () values ()')





# * https://github.com/npm/registry/blob/aef8a275/docs/follower.md#clean-up
if __name__ == '__main__':
    env = wireup.env()
    db = wireup.db(env)
    go(db)

chadwhitacre · 2017-05-03T17:09:04Z

This is basically a rewrite of this subsystem.

chadwhitacre · 2017-05-03T17:14:38Z

This will actually be much better though. No more ijson dependency and also much closer to real-time. No more batch mode.

chadwhitacre · 2017-05-03T18:14:42Z

Eep! Time to upgrade to Postgres 9.6 locally. 😊

psycopg2.ProgrammingError: syntax error at or near "ON"
LINE 6:     ON CONFLICT (package_manager, name) UPDATE

chadwhitacre · 2017-05-03T18:58:09Z

Yesssss!!!

chadwhitacre · 2017-05-03T20:17:45Z

Okay! I'm going to run this locally and see how long it takes and how it behaves.

chadwhitacre · 2017-05-03T20:59:29Z

Added some logging.

chadwhitacre · 2017-05-03T21:07:48Z

Started a long run ...

chadwhitacre · 2017-05-03T21:07:58Z

pid-77299 thread-140735204086528 (MainThread) Picking up with npm sync at -1.

chadwhitacre · 2017-05-03T21:23:38Z

I'm at about 80,000 after about 10(?) minutes.

chadwhitacre · 2017-05-03T21:24:03Z

So maybe an hour for the whole thing?

chadwhitacre · 2017-05-03T21:24:14Z

That's way better than two days, anyway. ☺️

chadwhitacre · 2017-05-03T21:26:45Z

deleted documents include the "deleted": true attribute

http://docs.couchdb.org/en/2.0.0/api/database/changes.html#polling

chadwhitacre · 2017-05-03T21:27:58Z

Out of time for now.

pid-77350 thread-140735204086528 (MainThread) KeyboardInterrupt
pid-77350 thread-140735204086528 (MainThread) Encountered an error, will pick up with 517201 in 60 seconds (Ctrl-C to exit) ...
^C
real    17m47.575s
user    1m16.416s
sys     0m25.121s
[gratipay] $

Every 1.0s: echo 'select count(*) from packages' | psql gratipay                 Wed May  3 17:27:49 2017

Null display is "¤".
Line style is unicode.
Border style is 2.
┌────────┐
│ count  │
├────────┤
│ 100404 │
└────────┘
(1 row)

chadwhitacre · 2017-05-04T14:05:44Z

Deletes should remove locally. Need to take care to unlink teams from packages when deleting packages. It's okay for the team to stick around, I think? It'll have a 404 homepage on npmjs.

rohitpaulk · 2017-05-05T11:59:40Z

gratipay/sync_npm.py

+            connection.commit()
+
+
+def delete(cursor, processed):


I think it'd be clearer if we renamed processed to processed_doc

Done in 631dfc9.

rohitpaulk · 2017-05-05T12:01:55Z

gratipay/sync_npm.py

+        for change in change_stream(last_seq):
+            if change.get('deleted'):
+                # Hack to work around conflation of design docs and packages in updates
+                op, doc = delete, {'name': change['id']}


I think this is a bit confusing. delete takes a dictionary, although it only needs one string as the argument. Also, we don't need to pass the fake doc ({'name': change['id']}) through process_doc.

At the cost of a line or two more, I think this can be simplified.

Something along the lines of:

Raw version:

Before:

with db.get_connection() as connection: for change in change_stream(last_seq): if change.get('deleted'): # Hack to work around conflation of design docs and packages in updates op, doc = delete, {'name': change['id']} else: op, doc = upsert, change['doc'] processed = process_doc(doc) if not processed: continue cursor = connection.cursor() op(cursor, processed) cursor.run('UPDATE worker_coordination SET npm_last_seq=%(seq)s', change) connection.commit() def delete(cursor, processed): cursor.run("DELETE FROM packages WHERE package_manager='npm' AND name=%(name)s", processed) def upsert(cursor, processed): cursor.run(''' INSERT INTO packages (package_manager, name, description, emails) VALUES ('npm', %(name)s, %(description)s, %(emails)s) ON CONFLICT (package_manager, name) DO UPDATE SET description=%(description)s, emails=%(emails)s ''', processed)

After:

with db.get_connection() as connection: for change in change_stream(last_seq): cursor = connection.cursor() if change.get('deleted'): # Hack to work around conflation of design docs and packages in updates delete(cursor, change['id']) else: upsert(cursor, process_doc(doc)) cursor.run('UPDATE worker_coordination SET npm_last_seq=%(seq)s', change) connection.commit() def delete(cursor, package_name): cursor.run("DELETE FROM packages WHERE package_manager='npm' AND name=%s", package_name) def upsert(cursor, processed_doc): cursor.run(''' INSERT INTO packages (package_manager, name, description, emails) VALUES ('npm', %(name)s, %(description)s, %(emails)s) ON CONFLICT (package_manager, name) DO UPDATE SET description=%(description)s, emails=%(emails)s ''', processed_doc)

Only downside I see here is that we're doing a little bit more work (calling process_doc, checking the deleted key) inside the transaction

That doesn't account for skipping docs with no name key. How about 631dfc9?

Ah, yes 631dfc9 looks good

rohitpaulk · 2017-05-05T12:17:58Z

gratipay/cli/sync_npm.py

+        with sentry.teller(env):
+            consume_change_stream(production_change_stream, db)
+        try:
+            last_seq = get_last_seq(db)


Hmm, if we're calling get_last_seq here anyway - might make sense to simplify the function definition of consume_change_stream to accept the stream directly, and not a function that has to be called with seq to return the stream?

Done in af41409.

rohitpaulk · 2017-05-05T12:19:23Z

gratipay/utils/sentry.py

+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        self.tell_sentry(exc_type, {})


Does exc_type have all the details that we need to send to sentry? Shouldn't we pass traceback and exc_value? (I'm not sure what they are, but traceback sure sounds important)

Sentry accesses Python's global exception state directly during captureException (via sys.exc_info, presumably), so we don't have to pass it through these function calls.

Interesting

rohitpaulk · 2017-05-05T12:21:10Z

gratipay/utils/sentry.py

+import traceback
+
+from aspen import log
+from gratipay import wireup


utils importing wireup? 😛 That seems hacky. No neater way?

No neater way?

Not on this PR. 😞

I agree it's hacky. Eventually I would see rewiring Sentry along the lines of what was started in #4345 and other PRs listed under "new email subflooring" on #4427.

(This PR is already 1.5x our 400-net-lines rule of thumb.)

chadwhitacre · 2017-05-05T12:44:36Z

/me looking into failures ...

rohitpaulk · 2017-05-05T12:51:05Z

I'm good once travis is :)

chadwhitacre · 2017-05-05T12:53:06Z

Travis is good! :-D

rohitpaulk · 2017-05-05T12:54:47Z

==

chadwhitacre · 2017-05-05T12:54:51Z

Some discussion of error reporting and retry architecture in slack.

rohitpaulk · 2017-05-05T13:26:10Z

Merging and deploying...

chadwhitacre · 2017-05-05T13:28:29Z

When you're done you could try adding an instrument to http://inside.gratipay.com/appendices/health for npm sync lag. Librato is in 1Password so it would be a good test for that as well. :)

rohitpaulk · 2017-05-05T13:33:26Z

Okay, we've got an error..

rohitpaulk · 2017-05-05T13:36:23Z

Hmm, I had a sync_npm folder lying around.. wonder where that came from

rohitpaulk · 2017-05-05T14:05:13Z

I forgot to add the env var 😞 Gratipay was down for around 3 minutes, back up now

rohitpaulk · 2017-05-05T14:31:23Z

Now to figure out how to run the syncer. Add it to the heroku procfile?

rohitpaulk · 2017-05-05T14:31:38Z

I'm going to try to run as a one-off dyno first

chadwhitacre · 2017-05-05T18:03:38Z

Some deploy log in slack.

chadwhitacre · 2017-05-08T15:07:43Z

Hmm, I had a sync_npm folder lying around.. wonder where that came from

Maybe pyc files kept Git from removing it after the switch from sync_npm/ to sync_npm.py?

Obsolete with #4438.

chadwhitacre mentioned this pull request May 3, 2017

✈️ Give to package.json #4427

Closed

15 tasks

chadwhitacre mentioned this pull request May 3, 2017

Upgrade Postgres #4440

Closed

chadwhitacre changed the title ~~Sync npm~~ Rebuild npm syncing May 3, 2017

chadwhitacre force-pushed the sync-npm branch from bc8fab8 to f7a148d Compare May 3, 2017 19:39

chadwhitacre force-pushed the sync-npm branch from eaa71f4 to 35f6e3f Compare May 3, 2017 20:41

chadwhitacre force-pushed the sync-npm branch from 35f6e3f to 0d068a7 Compare May 3, 2017 21:06

rohitpaulk reviewed May 5, 2017

View reviewed changes

chadwhitacre added 2 commits May 5, 2017 08:21

Respond to review

631dfc9

Simplify definition of consume_change_stream

af41409

Pyflakes nit

a997d38

chadwhitacre mentioned this pull request May 5, 2017

Try to install pg 9.6.2 manually #4444

Merged

rohitpaulk merged commit df3efcd into master May 5, 2017

rohitpaulk deleted the sync-npm branch May 5, 2017 13:26

rohitpaulk mentioned this pull request May 5, 2017

Add a check in deploy.sh for environment variables #4446

Closed

This was referenced May 5, 2017

Harmonize package deletion and claiming #4448

Merged

Fix otb sync_npm test failures #4321

Closed

chadwhitacre added a commit that referenced this pull request May 17, 2017

Prune old npm sync code

12c26db

Obsolete with #4438.

chadwhitacre mentioned this pull request May 17, 2017

Prune old npm sync code #4475

Merged

Rebuild npm syncing #4438

Rebuild npm syncing #4438

Conversation

chadwhitacre commented May 3, 2017 • edited Loading

Specs

Todo

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017 • edited Loading

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017 • edited Loading

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 3, 2017

chadwhitacre commented May 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohitpaulk May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadwhitacre May 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadwhitacre commented May 5, 2017

rohitpaulk commented May 5, 2017

chadwhitacre commented May 5, 2017

rohitpaulk commented May 5, 2017

chadwhitacre commented May 5, 2017 • edited Loading

rohitpaulk commented May 5, 2017

chadwhitacre commented May 5, 2017

rohitpaulk commented May 5, 2017

rohitpaulk commented May 5, 2017 • edited Loading

rohitpaulk commented May 5, 2017

rohitpaulk commented May 5, 2017

rohitpaulk commented May 5, 2017

chadwhitacre commented May 5, 2017

chadwhitacre commented May 8, 2017

chadwhitacre commented May 3, 2017 •

edited

Loading

chadwhitacre commented May 3, 2017 •

edited

Loading

chadwhitacre commented May 3, 2017 •

edited

Loading

rohitpaulk May 5, 2017 •

edited

Loading

chadwhitacre May 5, 2017 •

edited

Loading

chadwhitacre commented May 5, 2017 •

edited

Loading

rohitpaulk commented May 5, 2017 •

edited

Loading