jupyter upload file truncation #80

Seanspt · 2019-10-15T10:07:28Z

When upload a file via the jupyter web ui, the file seems to be truncated.
The original file size is 29MB. However, when uploading finished with no error, the file we got is only 5.3MB.

Seanspt · 2019-10-16T06:54:37Z

In notebook upload, the file is cut and uploaded seperately.

    @web.authenticated
    @gen.coroutine
    def put(self, path=''):
        """Saves the file in the location specified by name and path.

        PUT is very similar to POST, but the requester specifies the name,
        whereas with POST, the server picks the name.

        PUT /api/contents/path/Name.ipynb
          Save notebook at ``path/Name.ipynb``. Notebook structure is specified
          in `content` key of JSON request body. If content is not specified,
          create a new empty notebook.
        """
        model = self.get_json_body()
        if model:
            if model.get('copy_from'):
                raise web.HTTPError(400, "Cannot copy with PUT, only POST")
            exists = yield maybe_future(self.contents_manager.file_exists(path))
            if exists:
                yield maybe_future(self._save(model, path))
            else:
                yield maybe_future(self._upload(model, path))
        else:
            yield maybe_future(self._new_untitled(path))

However, in s3contents, the file is opened with 'wb' each time.

    def write(self, path, content, format):
        path_ = self.path(self.unprefix(path))
        self.log.debug("S3contents.S3FS: Writing file: `%s`", path_)
        if format not in {'text', 'base64'}:
            raise HTTPError(
                400,
                "Must specify format of file contents as 'text' or 'base64'",
            )
        try:
            if format == 'text':
                content_ = content.encode('utf8')
            else:
                b64_bytes = content.encode('ascii')
                content_ = base64.b64decode(b64_bytes)
        except Exception as e:
            raise HTTPError(
                400, u'Encoding error saving %s: %s' % (path_, e)
            )
        with self.fs.open(path_, mode='wb') as f:
            f.write(content_)

Trying to fix this.

zac-yang · 2020-02-27T11:19:36Z

We've encountered the same issue. This could be a good reference for a fix:
https://github.com/jupyter/notebook/blob/master/notebook/services/contents/largefilemanager.py

rhlarora84 · 2020-06-05T01:36:47Z

Is there a fix in progress for this issue? Or a potential workaround? ~thanks

rhlarora84 · 2020-06-10T02:15:10Z

Looking at the LargeFileManager, the file is provided in chunks if the model['chunk'] is present. The last segment has a chunk of -1.

One potential solution could be to start a transaction is s3fs when chunk == 1 and end the transaction when the chunk is -1

Another potential way could be to save the file locally using the LargeFileManager and upload it to S3. Here is a crude way to test it out -

def _save_file(self, model, path):
    file_contents = model["content"]
    file_format = model.get("format")
    chunk = model.get('chunk', None)
    large_file_manager = LargeFileManager()
    if chunk is not None:
        large_file_manager.save(model, path)
        if chunk == -1:
            updated_model = large_file_manager.get(path)
            self.fs.write(path, updated_model['content'], updated_model.get("format"))
            large_file_manager.delete_file(path)
    else:
        self.fs.write(path, file_contents, file_format)
        GenericContentsManager._save_file = _save_file

There is also a put method in s3fs that would upload the local file to S3. I am sure that a clean fix would require changes to GenericFileManager and S3FS in the package.

def _save_file(self, model, path):
    file_contents = model["content"]
    file_format = model.get("format")
    chunk = model.get('chunk', None)
    large_file_manager = LargeFileManager()
    if chunk is not None:
        large_file_manager.save(model, path)
        if chunk == -1:
            os_path = large_file_manager._get_os_path(path)
            destination_path = self.fs.path(self.fs.unprefix(path))
            self.fs.fs.put(os_path, destination_path)
            large_file_manager.delete_file(path)
    else:
        self.fs.write(path, file_contents, file_format)

pvanliefland · 2020-08-04T20:11:00Z

I have a working "in-memory" solution using ContextVar. @danielfrg @ericdill would you be interested in a PR?

The drawbacks:

The contextvars module is part of Python 3.7
It's in-memory (but we cannot be sure that the user has access to a filesystem)

My first attempt was to use merge in s3fs but it does not work, as S3 does not accept multipart upload with parts smaller than 5MB (and JupyterLab is configured to send 1MB chunks)

ericdill · 2020-08-11T16:06:00Z

Hi @pvanliefland, I'd happily review a PR from you that fixes this problem. Regarding your two points, (1) Making this compatible with py37 or newer does not pose a problem for me and (2) can you elaborate a little? Is the concern that the upload would fail only after the user waited for a potentially long time to upload a file?

pvanliefland · 2020-08-14T17:57:43Z

Hey @ericdill , for 2), to give you a better idea, here is a simplified version of my code. If it makes sense to integrate it here, I'll work on a proper PR.

Let me know!

import base64
import contextvars

# Used as an in-memory "registry" for uploads.
# TODO: periodic cleanup - but when ? As is, could cause memory issues
content_chunks = contextvars.ContextVar("jupyterlab_content_chunks", default={})


def store_content_chunk(path: str, content: str):
    """Store a base64 chunk in the registry as bytes"""

    current_value = content_chunks.get()

    if path not in current_value:
        current_value[path] = []

    current_value[path].append(base64.b64decode(content.encode("ascii"), validate=True))


def assemble_chunks(path: str) -> str:
    """Assemble the chunk bytes into a single base64 string"""

    current_value = content_chunks.get()

    if path not in current_value:
        raise ValueError(f"No chunk for path {path}")

    return base64.b64encode(b"".join(current_value[path])).decode("ascii")


def delete_chunks(path):
    """Should be called once the upload is complete to free the memory"""

    current_value = content_chunks.get()
    del current_value[path]


class GenericContentsManager(ContentsManager, HasTraits):
    def save(self, model, path=""):
        """Code inspired from notebook.services.contents.largefilemanager"""

        chunk = model.get("chunk", None)  # this is how the client sends it
        if chunk is not None:
            # some checks / try&except / logs skipped for readability

            if chunk == 1:
                self.run_pre_save_hook(model=model, path=path)
            # Store the chunk in our "in-memory" registry
            store_content_chunk(path, model["content"])

            if chunk == -1:
                # Last chunk: we want to combine the chunks in the registry to compose the full file content
                model["content"] = assemble_chunks(path)
                delete_chunks(path)
                self._save_file(model, path)

            return self.get(path, content=False)
        else:
            return super().save(model, path)

ericdill · 2020-08-17T13:04:41Z

Hey @pvanliefland, this looks good to me. Certainly better than the current situation where large file uploads just totally fail. I think it makes sense to discuss the remaining concerns in the PR.

"periodic cleanup - but when ?": You raise a good point about memory leakage. Do any of the other content managers solve this problem?

pvanliefland · 2020-08-17T15:30:09Z

@ericdill ok, I'll start working on a PR.

For cleanup of the ContextVar, I couldn't really find a good example elsewhere: I think that none of the custom ContentsManager implementations handle chunked uploads properly.

As for the standard LargeFileManager, it doesn't run into the same potential issue: it creates a new file for chunk 1, and simply opens this file in append mode whenever a new chunk is posted.

At this point I'm thinking of either

storing a timestamp along the chunks that are being uploaded, and checking if we have "stale" chunks whenever the get() method is called (not pretty but should do the job)
going async with add_timeout from Tornado (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.add_timeout)

I'll probably try both approaches.

ericdill · 2020-08-17T18:43:46Z

sounds great, thanks for tackling this!

ericdill · 2020-08-24T17:24:54Z

Hi @pvanliefland how are things going? Want to open a PR with these changes so far? I'd be happy to help test against my jupyterhub instance

pvanliefland · 2020-08-25T18:04:30Z

Hey @ericdill , thanks for the ping - I'll schedule some time for this tomorrow.

Added chunk utils module and adapted save() in generic manager

Disable gracefully for Python < 3.7

Reorder imports

Added tests

Removed compat with notebook 4.*

Adapted test for Python <= 3.7

Added chunk utils module and adapted save() in generic manager

Disable gracefully for Python < 3.7

Reorder imports

Added tests

Removed compat with notebook 4.*

Adapted test for Python <= 3.7

Fixed tests

Added basic pruning of stale chunked uploads

Fixed tests

Prune only if chunked support

Fixed iteration on stale paths

Fixed 3.6 BC

Prioritize existing exceptions before "python >= 3.7" error

Fixed tests

Sorted

Mixup...

Fixed tests

Changed notebook version in requirements-package.txt

Dropped compat for Python 3.6 Adapted README to suggest using LargeFileManager

ericdill · 2020-09-04T14:12:05Z

Closed by #99

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

e52a67e

Added chunk utils module and adapted save() in generic manager

pvanliefland mentioned this issue Aug 26, 2020

🚧 Chunked uploads handling (#80) #99

Merged

5 tasks

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

b569187

Disable gracefully for Python < 3.7

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

aaa60bf

Disable gracefully for Python < 3.7

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

fe4d575

Reorder imports

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

608a0b8

Reorder imports

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

3c7458e

Reorder imports

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 26, 2020

Chunked uploads handling (danielfrg#80)

f7c78db

Added tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

3ceeac3

Removed compat with notebook 4.*

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

25a0d58

Adapted test for Python <= 3.7

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

ed6ced1

Added chunk utils module and adapted save() in generic manager

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

df6ec69

Disable gracefully for Python < 3.7

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

4df879f

Disable gracefully for Python < 3.7

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

0283f33

Reorder imports

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

35fbf1d

Reorder imports

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

f995e42

Reorder imports

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

35e72e9

Added tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

596f91c

Removed compat with notebook 4.*

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

ae4eefb

Adapted test for Python <= 3.7

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

3ed80db

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

d0d811a

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

f7111b9

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

62ad519

Added basic pruning of stale chunked uploads

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

fea4f8b

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

d2abd7e

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

fc6fe4f

Prune only if chunked support

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

7bbaa1b

Fixed iteration on stale paths

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

072b309

Fixed 3.6 BC

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

8ec830d

Prioritize existing exceptions before "python >= 3.7" error

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

8a8a086

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

09cb82d

Sorted

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

20ba1cf

Mixup...

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

6a7996b

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

e90e773

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Aug 31, 2020

Chunked uploads handling (danielfrg#80)

6372012

Fixed tests

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 1, 2020

Chunked uploads handling (danielfrg#80)

78fa78b

Changed notebook version in requirements-package.txt

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 3, 2020

Chunked uploads handling (danielfrg#80)

1e3e4e1

Dropped compat for Python 3.6 Adapted README to suggest using LargeFileManager

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 3, 2020

Add support for chunked uploads (danielfrg#80)

2d91436

pvanliefland added a commit to pvanliefland/s3contents that referenced this issue Sep 3, 2020

Add support for chunked uploads (danielfrg#80)

992faf9

ericdill pushed a commit that referenced this issue Sep 4, 2020

Add support for chunked uploads (#80) (#99)

57f89c1

ericdill closed this as completed Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jupyter upload file truncation #80

jupyter upload file truncation #80

Seanspt commented Oct 15, 2019

Seanspt commented Oct 16, 2019

zac-yang commented Feb 27, 2020

rhlarora84 commented Jun 5, 2020

rhlarora84 commented Jun 10, 2020 •

edited

Loading

pvanliefland commented Aug 4, 2020

ericdill commented Aug 11, 2020

pvanliefland commented Aug 14, 2020

ericdill commented Aug 17, 2020

pvanliefland commented Aug 17, 2020

ericdill commented Aug 17, 2020

ericdill commented Aug 24, 2020

pvanliefland commented Aug 25, 2020

ericdill commented Sep 4, 2020

jupyter upload file truncation #80

jupyter upload file truncation #80

Comments

Seanspt commented Oct 15, 2019

Seanspt commented Oct 16, 2019

zac-yang commented Feb 27, 2020

rhlarora84 commented Jun 5, 2020

rhlarora84 commented Jun 10, 2020 • edited Loading

pvanliefland commented Aug 4, 2020

ericdill commented Aug 11, 2020

pvanliefland commented Aug 14, 2020

ericdill commented Aug 17, 2020

pvanliefland commented Aug 17, 2020

ericdill commented Aug 17, 2020

ericdill commented Aug 24, 2020

pvanliefland commented Aug 25, 2020

ericdill commented Sep 4, 2020

rhlarora84 commented Jun 10, 2020 •

edited

Loading