-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mcclient archive
command to generate archive tarball
#185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is kind of what I was thinking with my "just a tarball of stuff" approach. My memory of the tar format is kinda rusty but it looks like the "file" entry headers are 500 bytes, which is a bunch but not a catastrophic amount, especially with compound statements. I am tempted to say let's use this as is and leave batching as a possible enhancement.
queryStream.on('data', obj => { | ||
let stmt | ||
try { | ||
stmt = Statement.fromProtobuf(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be inclined to write the protobufs directly -- there's definitely a human readability cost but it is, after all, a serialization format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I agree
src/client/cli/commands/archive.js
Outdated
if (dataResult == null || typeof dataResult !== 'object' || dataResult.data == null) return | ||
|
||
const bytes = Buffer.from(dataResult.data, 'base64') | ||
writeToTarball(tarball, `data/${key}`, bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though maybe if we're batching anyway might as well write the batch as a single "file"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's preferable to have objects in their own individual files as it facilitates quickly checking (or seeking) an object for existence in an archive.
I think it might be a mistake to have each statement in its own file. I think the better approach is to batch statements together in json, in one (or more) ndjson files in |
src/client/cli/commands/archive.js
Outdated
writeToTarball(tarball, name, content) | ||
|
||
for (const id of stmt.objectIds) { | ||
objectIds.push(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to deduplicate objects (and dependencies!) for the entire archive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, true. Using a Set instead of an Array for the refs should do the trick, although if we're willing to accept some small amount of duplication we could request the objects in batches while the statement stream is still coming in instead of waiting until the end to fully deduplicate them. We'd probably end up with duplicate schema objects though.
Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form With gzipping the extra overhead might from the json might not be too bad, although we'd take a hit from base64, of course. |
OK, that's fair, though I wouldn't call this "its own file" -- it's a big file with large-ish record delimiters, not separate files. The advantage is potentially higher recoverability/seeking due to presence of headers but since we're treating this as all-or-nothing I can live with big ol' ndjson
Might as well just store a base64 object per line then? Don't really need the extra markup |
src/client/cli/commands/archive.js
Outdated
@@ -15,6 +16,10 @@ const TAR_ENTRY_OPTS = { | |||
gname: 'staff' | |||
} | |||
|
|||
function leftpad (str, length, char = '0') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😛
function writeStatementBatch (force: boolean = false) { | ||
if (force || stmtBatch.length >= STMT_BATCH_SIZE) { | ||
const content = Buffer.from(stmtBatch.join('\n'), 'utf-8') | ||
const filename = `stmt/${leftpad(stmtBatchNumber.toString(), 8)}.ndjson` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just .json
should work too for the extension? it's common for .json files to actually contain ndjson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use .ndjson
elsewhere, it's descriptive and reasonably commonly used
src/client/cli/commands/archive.js
Outdated
} | ||
|
||
function writeDataObjectsToTarball (client: RestClient, tarball: Object, objectIds: Set<string>): Promise<*> { | ||
if (objectIds.size < 1) return Promise.resolve() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's so javascript :)
in a compiled language you would write the comparison as == 0
comparison, as it has direct compilation to test
and jz/jnz
ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then you get runaway loops in concurrency scenarios with multiple workers 😉 (or with bad math)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always try to be as paranoid as possible with js, and assume that some demented dependency could have monkey patched .size
to return a negative number 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, javascript, the land of crazy!
src/client/cli/commands/archive.js
Outdated
.then(stream => new Promise((resolve, reject) => { | ||
stream.on('data', dataResult => { | ||
const key = objectIds.shift() | ||
if (dataResult == null || typeof dataResult !== 'object' || dataResult.data == null) return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if an object i missing from the datastore (or some other error condition), we just silently stop adding objects and still produce an archive?
I think we should at least notify the user that there was an error and possibly keep going and add the remaining objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a good point... perhaps we could be strict by default and abort on errors, with a flag to just warn about them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's error out by default and allow the user to specify a lax mode that keeps going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with warnings :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for erroring out
@@ -83,6 +83,12 @@ module.exports = { | |||
envelope: [ SIMPLE_STMT_1.publisher ], | |||
envelopeEmpty: [ ENVELOPE_STMT.publisher ] | |||
}, | |||
expectedDeps: { | |||
simple: [ new Set(['dep1', 'dep2']), new Set(['dep1', 'dep3']) ], | |||
compound: [ new Set() ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should test that it retrieves dependencies correctly in compound, rather than envelopes; we don't have any of the latter yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good call. I'll update the test
src/client/cli/commands/archive.js
Outdated
const msg = (dataResult && dataResult.error) ? dataResult.error : 'Unknown error' | ||
if (allowErrors) { | ||
printlnErr(`Error fetching object for ${key}: ${msg}`) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the promise stays unresolved here; is this desired behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will still resolve when the stream ends (in the stream.on('end', ...)
below. That will probably be immediately after the error, since concat will close the stream after the error.
I figured that it would be easier to just make the tarball in JS using the tar-stream module vs writing some shell or python scripts, since we already have the code to extract and request the associated objects.
This adds an
mcclient archive <queryString>
command that will write a gzipped tarball to stdout (or you can give a--output|-o
flag). In the tarball will be astmt/<statementId>
entry for each statement, and adata/<objectId>
entry for each data object. The statements are stringified JSON objects, but could easily do protobufs instead.Happy to tweak the archive format in the morning (multiple statements per entry is probably a good idea).