Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mcclient archive command to generate archive tarball #185

Merged
merged 7 commits into from
Feb 23, 2017

Conversation

yusefnapora
Copy link
Contributor

I figured that it would be easier to just make the tarball in JS using the tar-stream module vs writing some shell or python scripts, since we already have the code to extract and request the associated objects.

This adds an mcclient archive <queryString> command that will write a gzipped tarball to stdout (or you can give a --output|-o flag). In the tarball will be a stmt/<statementId> entry for each statement, and a data/<objectId> entry for each data object. The statements are stringified JSON objects, but could easily do protobufs instead.

Happy to tweak the archive format in the morning (multiple statements per entry is probably a good idea).

Copy link
Contributor

@parkan parkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is kind of what I was thinking with my "just a tarball of stuff" approach. My memory of the tar format is kinda rusty but it looks like the "file" entry headers are 500 bytes, which is a bunch but not a catastrophic amount, especially with compound statements. I am tempted to say let's use this as is and leave batching as a possible enhancement.

queryStream.on('data', obj => {
let stmt
try {
stmt = Statement.fromProtobuf(obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be inclined to write the protobufs directly -- there's definitely a human readability cost but it is, after all, a serialization format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I agree

if (dataResult == null || typeof dataResult !== 'object' || dataResult.data == null) return

const bytes = Buffer.from(dataResult.data, 'base64')
writeToTarball(tarball, `data/${key}`, bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe if we're batching anyway might as well write the batch as a single "file"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's preferable to have objects in their own individual files as it facilitates quickly checking (or seeking) an object for existence in an archive.

@vyzo
Copy link
Contributor

vyzo commented Feb 22, 2017

I think it might be a mistake to have each statement in its own file.
Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)

I think the better approach is to batch statements together in json, in one (or more) ndjson files in stmt/

writeToTarball(tarball, name, content)

for (const id of stmt.objectIds) {
objectIds.push(id)
Copy link
Contributor

@vyzo vyzo Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to deduplicate objects (and dependencies!) for the entire archive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, true. Using a Set instead of an Array for the refs should do the trick, although if we're willing to accept some small amount of duplication we could request the objects in batches while the statement stream is still coming in instead of waiting until the end to fully deduplicate them. We'd probably end up with duplicate schema objects though.

@yusefnapora
Copy link
Contributor Author

Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}. Then we could just feed the ndjson directly into the concat API with curl or mcclient or whatever.

With gzipping the extra overhead might from the json might not be too bad, although we'd take a hit from base64, of course.

@parkan
Copy link
Contributor

parkan commented Feb 22, 2017

@vyzo

Firstly it costs space (those 500 bytes are 2x overhead) and it also makes the load process slow, as statements need to be read from file one by one (and doesn't take advantage of the streaming nature of the import api)

OK, that's fair, though I wouldn't call this "its own file" -- it's a big file with large-ish record delimiters, not separate files. The advantage is potentially higher recoverability/seeking due to presence of headers but since we're treating this as all-or-nothing I can live with big ol' ndjson

@yusefnapora

Also, again if we're willing to increase the archive size somewhat, the dead simplest thing to ingest at load time would be to use collections of ndjson batches for both statements and objects, where the object ndjson is of the form {data: "base64-encoded-object"}

Might as well just store a base64 object per line then? Don't really need the extra markup

@@ -15,6 +16,10 @@ const TAR_ENTRY_OPTS = {
gname: 'staff'
}

function leftpad (str, length, char = '0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😛

function writeStatementBatch (force: boolean = false) {
if (force || stmtBatch.length >= STMT_BATCH_SIZE) {
const content = Buffer.from(stmtBatch.join('\n'), 'utf-8')
const filename = `stmt/${leftpad(stmtBatchNumber.toString(), 8)}.ndjson`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just .json should work too for the extension? it's common for .json files to actually contain ndjson.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use .ndjson elsewhere, it's descriptive and reasonably commonly used

}

function writeDataObjectsToTarball (client: RestClient, tarball: Object, objectIds: Set<string>): Promise<*> {
if (objectIds.size < 1) return Promise.resolve()
Copy link
Contributor

@vyzo vyzo Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's so javascript :)
in a compiled language you would write the comparison as == 0 comparison, as it has direct compilation to test and jz/jnz ops

Copy link
Contributor

@parkan parkan Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then you get runaway loops in concurrency scenarios with multiple workers 😉 (or with bad math)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always try to be as paranoid as possible with js, and assume that some demented dependency could have monkey patched .size to return a negative number 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, javascript, the land of crazy!

.then(stream => new Promise((resolve, reject) => {
stream.on('data', dataResult => {
const key = objectIds.shift()
if (dataResult == null || typeof dataResult !== 'object' || dataResult.data == null) return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if an object i missing from the datastore (or some other error condition), we just silently stop adding objects and still produce an archive?
I think we should at least notify the user that there was an error and possibly keep going and add the remaining objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a good point... perhaps we could be strict by default and abort on errors, with a flag to just warn about them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's error out by default and allow the user to specify a lax mode that keeps going.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with warnings :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for erroring out

@@ -83,6 +83,12 @@ module.exports = {
envelope: [ SIMPLE_STMT_1.publisher ],
envelopeEmpty: [ ENVELOPE_STMT.publisher ]
},
expectedDeps: {
simple: [ new Set(['dep1', 'dep2']), new Set(['dep1', 'dep3']) ],
compound: [ new Set() ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should test that it retrieves dependencies correctly in compound, rather than envelopes; we don't have any of the latter yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good call. I'll update the test

const msg = (dataResult && dataResult.error) ? dataResult.error : 'Unknown error'
if (allowErrors) {
printlnErr(`Error fetching object for ${key}: ${msg}`)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the promise stays unresolved here; is this desired behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still resolve when the stream ends (in the stream.on('end', ...) below. That will probably be immediately after the error, since concat will close the stream after the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants