Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jupyter upload file truncation #80

Closed
Seanspt opened this issue Oct 15, 2019 · 13 comments
Closed

jupyter upload file truncation #80

Seanspt opened this issue Oct 15, 2019 · 13 comments

Comments

@Seanspt
Copy link

Seanspt commented Oct 15, 2019

When upload a file via the jupyter web ui, the file seems to be truncated.
The original file size is 29MB. However, when uploading finished with no error, the file we got is only 5.3MB.

@Seanspt
Copy link
Author

Seanspt commented Oct 16, 2019

In notebook upload, the file is cut and uploaded seperately.

    @web.authenticated
    @gen.coroutine
    def put(self, path=''):
        """Saves the file in the location specified by name and path.

        PUT is very similar to POST, but the requester specifies the name,
        whereas with POST, the server picks the name.

        PUT /api/contents/path/Name.ipynb
          Save notebook at ``path/Name.ipynb``. Notebook structure is specified
          in `content` key of JSON request body. If content is not specified,
          create a new empty notebook.
        """
        model = self.get_json_body()
        if model:
            if model.get('copy_from'):
                raise web.HTTPError(400, "Cannot copy with PUT, only POST")
            exists = yield maybe_future(self.contents_manager.file_exists(path))
            if exists:
                yield maybe_future(self._save(model, path))
            else:
                yield maybe_future(self._upload(model, path))
        else:
            yield maybe_future(self._new_untitled(path))

However, in s3contents, the file is opened with 'wb' each time.

    def write(self, path, content, format):
        path_ = self.path(self.unprefix(path))
        self.log.debug("S3contents.S3FS: Writing file: `%s`", path_)
        if format not in {'text', 'base64'}:
            raise HTTPError(
                400,
                "Must specify format of file contents as 'text' or 'base64'",
            )
        try:
            if format == 'text':
                content_ = content.encode('utf8')
            else:
                b64_bytes = content.encode('ascii')
                content_ = base64.b64decode(b64_bytes)
        except Exception as e:
            raise HTTPError(
                400, u'Encoding error saving %s: %s' % (path_, e)
            )
        with self.fs.open(path_, mode='wb') as f:
            f.write(content_)

Trying to fix this.

@zac-yang
Copy link
Contributor

We've encountered the same issue. This could be a good reference for a fix:
https://github.com/jupyter/notebook/blob/master/notebook/services/contents/largefilemanager.py

@rhlarora84
Copy link

Is there a fix in progress for this issue? Or a potential workaround? ~thanks

@rhlarora84
Copy link

rhlarora84 commented Jun 10, 2020

Looking at the LargeFileManager, the file is provided in chunks if the model['chunk'] is present. The last segment has a chunk of -1.

One potential solution could be to start a transaction is s3fs when chunk == 1 and end the transaction when the chunk is -1

Another potential way could be to save the file locally using the LargeFileManager and upload it to S3. Here is a crude way to test it out -

def _save_file(self, model, path):
    file_contents = model["content"]
    file_format = model.get("format")
    chunk = model.get('chunk', None)
    large_file_manager = LargeFileManager()
    if chunk is not None:
        large_file_manager.save(model, path)
        if chunk == -1:
            updated_model = large_file_manager.get(path)
            self.fs.write(path, updated_model['content'], updated_model.get("format"))
            large_file_manager.delete_file(path)
    else:
        self.fs.write(path, file_contents, file_format)
        GenericContentsManager._save_file = _save_file

There is also a put method in s3fs that would upload the local file to S3. I am sure that a clean fix would require changes to GenericFileManager and S3FS in the package.

def _save_file(self, model, path):
    file_contents = model["content"]
    file_format = model.get("format")
    chunk = model.get('chunk', None)
    large_file_manager = LargeFileManager()
    if chunk is not None:
        large_file_manager.save(model, path)
        if chunk == -1:
            os_path = large_file_manager._get_os_path(path)
            destination_path = self.fs.path(self.fs.unprefix(path))
            self.fs.fs.put(os_path, destination_path)
            large_file_manager.delete_file(path)
    else:
        self.fs.write(path, file_contents, file_format)

@pvanliefland
Copy link
Contributor

I have a working "in-memory" solution using ContextVar. @danielfrg @ericdill would you be interested in a PR?

The drawbacks:

  • The contextvars module is part of Python 3.7
  • It's in-memory (but we cannot be sure that the user has access to a filesystem)

My first attempt was to use merge in s3fs but it does not work, as S3 does not accept multipart upload with parts smaller than 5MB (and JupyterLab is configured to send 1MB chunks)

@ericdill
Copy link
Collaborator

Hi @pvanliefland, I'd happily review a PR from you that fixes this problem. Regarding your two points, (1) Making this compatible with py37 or newer does not pose a problem for me and (2) can you elaborate a little? Is the concern that the upload would fail only after the user waited for a potentially long time to upload a file?

@pvanliefland
Copy link
Contributor

Hey @ericdill , for 2), to give you a better idea, here is a simplified version of my code. If it makes sense to integrate it here, I'll work on a proper PR.

Let me know!

import base64
import contextvars

# Used as an in-memory "registry" for uploads.
# TODO: periodic cleanup - but when ? As is, could cause memory issues
content_chunks = contextvars.ContextVar("jupyterlab_content_chunks", default={})


def store_content_chunk(path: str, content: str):
    """Store a base64 chunk in the registry as bytes"""

    current_value = content_chunks.get()

    if path not in current_value:
        current_value[path] = []

    current_value[path].append(base64.b64decode(content.encode("ascii"), validate=True))


def assemble_chunks(path: str) -> str:
    """Assemble the chunk bytes into a single base64 string"""

    current_value = content_chunks.get()

    if path not in current_value:
        raise ValueError(f"No chunk for path {path}")

    return base64.b64encode(b"".join(current_value[path])).decode("ascii")


def delete_chunks(path):
    """Should be called once the upload is complete to free the memory"""

    current_value = content_chunks.get()
    del current_value[path]


class GenericContentsManager(ContentsManager, HasTraits):
    def save(self, model, path=""):
        """Code inspired from notebook.services.contents.largefilemanager"""

        chunk = model.get("chunk", None)  # this is how the client sends it
        if chunk is not None:
            # some checks / try&except / logs skipped for readability

            if chunk == 1:
                self.run_pre_save_hook(model=model, path=path)
            # Store the chunk in our "in-memory" registry
            store_content_chunk(path, model["content"])

            if chunk == -1:
                # Last chunk: we want to combine the chunks in the registry to compose the full file content
                model["content"] = assemble_chunks(path)
                delete_chunks(path)
                self._save_file(model, path)

            return self.get(path, content=False)
        else:
            return super().save(model, path)

@ericdill
Copy link
Collaborator

Hey @pvanliefland, this looks good to me. Certainly better than the current situation where large file uploads just totally fail. I think it makes sense to discuss the remaining concerns in the PR.

  1. "periodic cleanup - but when ?": You raise a good point about memory leakage. Do any of the other content managers solve this problem?

@pvanliefland
Copy link
Contributor

@ericdill ok, I'll start working on a PR.

For cleanup of the ContextVar, I couldn't really find a good example elsewhere: I think that none of the custom ContentsManager implementations handle chunked uploads properly.

As for the standard LargeFileManager, it doesn't run into the same potential issue: it creates a new file for chunk 1, and simply opens this file in append mode whenever a new chunk is posted.

At this point I'm thinking of either

I'll probably try both approaches.

@ericdill
Copy link
Collaborator

sounds great, thanks for tackling this!

@ericdill
Copy link
Collaborator

Hi @pvanliefland how are things going? Want to open a PR with these changes so far? I'd be happy to help test against my jupyterhub instance

@pvanliefland
Copy link
Contributor

Hey @ericdill , thanks for the ping - I'll schedule some time for this tomorrow.

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
Added chunk utils module and adapted save() in generic manager
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
Disable gracefully for Python < 3.7
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
Disable gracefully for Python < 3.7
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Removed compat with notebook 4.*
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Adapted test for Python <= 3.7
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Added chunk utils module and adapted save() in generic manager
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Disable gracefully for Python < 3.7
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Disable gracefully for Python < 3.7
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Removed compat with notebook 4.*
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Adapted test for Python <= 3.7
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Added basic pruning of stale chunked uploads
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Prune only if chunked support
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Fixed iteration on stale paths
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
Prioritize existing exceptions before "python >= 3.7" error
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 1, 2020
Changed notebook version in requirements-package.txt
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 3, 2020
Dropped compat for Python 3.6
Adapted README to suggest using LargeFileManager
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 3, 2020
pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 3, 2020
@ericdill
Copy link
Collaborator

ericdill commented Sep 4, 2020

Closed by #99

@ericdill ericdill closed this as completed Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants