-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jupyter upload file truncation #80
Comments
In notebook upload, the file is cut and uploaded seperately. @web.authenticated
@gen.coroutine
def put(self, path=''):
"""Saves the file in the location specified by name and path.
PUT is very similar to POST, but the requester specifies the name,
whereas with POST, the server picks the name.
PUT /api/contents/path/Name.ipynb
Save notebook at ``path/Name.ipynb``. Notebook structure is specified
in `content` key of JSON request body. If content is not specified,
create a new empty notebook.
"""
model = self.get_json_body()
if model:
if model.get('copy_from'):
raise web.HTTPError(400, "Cannot copy with PUT, only POST")
exists = yield maybe_future(self.contents_manager.file_exists(path))
if exists:
yield maybe_future(self._save(model, path))
else:
yield maybe_future(self._upload(model, path))
else:
yield maybe_future(self._new_untitled(path)) However, in s3contents, the file is opened with 'wb' each time. def write(self, path, content, format):
path_ = self.path(self.unprefix(path))
self.log.debug("S3contents.S3FS: Writing file: `%s`", path_)
if format not in {'text', 'base64'}:
raise HTTPError(
400,
"Must specify format of file contents as 'text' or 'base64'",
)
try:
if format == 'text':
content_ = content.encode('utf8')
else:
b64_bytes = content.encode('ascii')
content_ = base64.b64decode(b64_bytes)
except Exception as e:
raise HTTPError(
400, u'Encoding error saving %s: %s' % (path_, e)
)
with self.fs.open(path_, mode='wb') as f:
f.write(content_) Trying to fix this. |
We've encountered the same issue. This could be a good reference for a fix: |
Is there a fix in progress for this issue? Or a potential workaround? ~thanks |
Looking at the LargeFileManager, the file is provided in chunks if the model['chunk'] is present. The last segment has a chunk of -1. One potential solution could be to start a transaction is s3fs when chunk == 1 and end the transaction when the chunk is -1 Another potential way could be to save the file locally using the LargeFileManager and upload it to S3. Here is a crude way to test it out -
There is also a put method in s3fs that would upload the local file to S3. I am sure that a clean fix would require changes to GenericFileManager and S3FS in the package.
|
I have a working "in-memory" solution using The drawbacks:
My first attempt was to use |
Hi @pvanliefland, I'd happily review a PR from you that fixes this problem. Regarding your two points, (1) Making this compatible with py37 or newer does not pose a problem for me and (2) can you elaborate a little? Is the concern that the upload would fail only after the user waited for a potentially long time to upload a file? |
Hey @ericdill , for 2), to give you a better idea, here is a simplified version of my code. If it makes sense to integrate it here, I'll work on a proper PR. Let me know! import base64
import contextvars
# Used as an in-memory "registry" for uploads.
# TODO: periodic cleanup - but when ? As is, could cause memory issues
content_chunks = contextvars.ContextVar("jupyterlab_content_chunks", default={})
def store_content_chunk(path: str, content: str):
"""Store a base64 chunk in the registry as bytes"""
current_value = content_chunks.get()
if path not in current_value:
current_value[path] = []
current_value[path].append(base64.b64decode(content.encode("ascii"), validate=True))
def assemble_chunks(path: str) -> str:
"""Assemble the chunk bytes into a single base64 string"""
current_value = content_chunks.get()
if path not in current_value:
raise ValueError(f"No chunk for path {path}")
return base64.b64encode(b"".join(current_value[path])).decode("ascii")
def delete_chunks(path):
"""Should be called once the upload is complete to free the memory"""
current_value = content_chunks.get()
del current_value[path]
class GenericContentsManager(ContentsManager, HasTraits):
def save(self, model, path=""):
"""Code inspired from notebook.services.contents.largefilemanager"""
chunk = model.get("chunk", None) # this is how the client sends it
if chunk is not None:
# some checks / try&except / logs skipped for readability
if chunk == 1:
self.run_pre_save_hook(model=model, path=path)
# Store the chunk in our "in-memory" registry
store_content_chunk(path, model["content"])
if chunk == -1:
# Last chunk: we want to combine the chunks in the registry to compose the full file content
model["content"] = assemble_chunks(path)
delete_chunks(path)
self._save_file(model, path)
return self.get(path, content=False)
else:
return super().save(model, path) |
Hey @pvanliefland, this looks good to me. Certainly better than the current situation where large file uploads just totally fail. I think it makes sense to discuss the remaining concerns in the PR.
|
@ericdill ok, I'll start working on a PR. For cleanup of the As for the standard At this point I'm thinking of either
I'll probably try both approaches. |
sounds great, thanks for tackling this! |
Hi @pvanliefland how are things going? Want to open a PR with these changes so far? I'd be happy to help test against my jupyterhub instance |
Hey @ericdill , thanks for the ping - I'll schedule some time for this tomorrow. |
Added chunk utils module and adapted save() in generic manager
Disable gracefully for Python < 3.7
Disable gracefully for Python < 3.7
Reorder imports
Reorder imports
Reorder imports
Added tests
Removed compat with notebook 4.*
Adapted test for Python <= 3.7
Added chunk utils module and adapted save() in generic manager
Disable gracefully for Python < 3.7
Disable gracefully for Python < 3.7
Reorder imports
Reorder imports
Reorder imports
Added tests
Removed compat with notebook 4.*
Adapted test for Python <= 3.7
Fixed tests
Fixed tests
Fixed tests
Added basic pruning of stale chunked uploads
Fixed tests
Fixed tests
Prune only if chunked support
Fixed iteration on stale paths
Fixed 3.6 BC
Prioritize existing exceptions before "python >= 3.7" error
Fixed tests
Fixed tests
Fixed tests
Fixed tests
Changed notebook version in requirements-package.txt
Dropped compat for Python 3.6 Adapted README to suggest using LargeFileManager
Closed by #99 |
When upload a file via the jupyter web ui, the file seems to be truncated.
The original file size is 29MB. However, when uploading finished with no error, the file we got is only 5.3MB.
The text was updated successfully, but these errors were encountered: