Skip to content
This repository has been archived by the owner on Jun 25, 2023. It is now read-only.

ddnovikov submission #7

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

ddnovikov submission #7

wants to merge 13 commits into from

Conversation

ddnovikov
Copy link

Hi!

Here's my solution. Not 100% sure if I enabled GPU in helm values correctly, but hopefully it's ok. Please advise on the changes if it doesn't work -- I don't have experience using kubernetes and helm.

I have some ideas of what can be improved and played with here to improve performance. But from my experience -- from this point most optimizations will be very time-consuming and give very marginal gains; b) even if there are good ideas for optimization (e.g. spending time to convert these models to ONNX?) -- they should be considered in the real world usage context to be worth it. So my final solution is what seems to be reasonable to do given the requirements and ~15-20 hours that I was ready to spend on the challenge.

But also for the purpose of experience -- I will be happy to hear about any easy ideas for serious optimization if you have some! Thanks!

@darknessest
Copy link
Collaborator

Hello @ddnovikov, thank you for the great submission.
Could you please move the models downloading to the runtime and save them to the /models path (see mounted volume)?

We deployed your project and unfortunately, the models didn't load up properly on a GPU, here are the logs:

logs
INFO:     Started server process [1]
INFO:     Waiting for application startup.
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

I know it's time-consuming to deal with this type of issue, if you want to proceed with the challenge we'd recommend you:

  • use pytorch/pytorch:*-cuda*-cudnn*-runtime images, as we found them to be hassle-free when working with AWS GPU instances
  • try to specify the GPU as device="cuda:0" and not device=0 (see line)

@ddnovikov
Copy link
Author

ddnovikov commented May 10, 2023

Oh, I see, thank you!

Actually, I decided that I have a very fun idea to play around with to at least proudly put it on my github page later 😁 . Though I think it may be more of a mess in terms of GPU stuff (considering the issue above), but anything can be fixed given time and effort.

For the purposes of the challenge I am not going to disclose the idea now, but judging by previous challenges I guess I have 2-3 weeks more to implement it, right?

@darknessest
Copy link
Collaborator

For the purposes of the challenge I am not going to disclose the idea now, but judging by previous challenges I guess I have 2-3 weeks more to implement it, right?

There's no deadline yet, I think it's safe to say that you have 2-3 weeks

@ddnovikov ddnovikov closed this May 11, 2023
@ddnovikov ddnovikov reopened this May 14, 2023
@rsolovev
Copy link
Collaborator

Hey @ddnovikov, issues with gpu usage were resolved, thank you for the beautiful solution. Here are our tests results on a grafana dashboard.

If you would like to work on your python solution further, you can continue optimizing/improving it and re-request our review once done. Any contribution during the challenge period will be taken into account while choosing a winner. Many thanks!

P.S. I'll come back with the dashboard for the rust solution a bit later today

@ddnovikov
Copy link
Author

Hi @rsolovev! I decided to experiment with some improvements in the spare time. Could you please run the tests? Thanks!

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddnovikov sure, here are the results for the latest commit -- grafana

@ddnovikov
Copy link
Author

@rsolovev, thanks! I did some more improvements, could you please check them again?

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddnovikov -- new peak throughput record! -- grafana

@ddnovikov
Copy link
Author

@rsolovev Yay! I should also say that on newer Ampere architecture GPUs (such as NVIDIA A16 that I used) this code processes 2 times more iterations -- I reached a total of 15800. The funny thing about this is that the server I rented out (6 vCPU, 64 GB RAM, NVIDIA A16) costs 32% less (0.512$/hr vs 0.752$/hr) than g4dn.2xlarge while giving 2.1x better performance 🙃

@ddnovikov
Copy link
Author

@rsolovev Also, could you please tell what CUDA version is installed on the machines you're using?

@rsolovev
Copy link
Collaborator

@ddnovikov

Wed May 31 10:56:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+

@ddnovikov
Copy link
Author

@rsolovev, hi, I have another attempt at ONNX. I have no idea if it runs because it didn't run on my machine. I suspect the driver issue and I can't resolve it with the hardware I have. I just hope it runs out of the box on your g4dn 🤣

Copy link
Collaborator

@rsolovev rsolovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddnovikov onnx seems to launch successfully, but there are errors on request:

INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<model_inference_task() done, defined at /code/./app.py:21> exception=IndexError('list index out of range')>
Traceback (most recent call last):
  File "/code/./app.py", line 47, in model_inference_task
    logits = model(**encoded_input).logits
  File "/opt/conda/lib/python3.10/site-packages/optimum/modeling_base.py", line 85, in __call__
    return self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 1234, in forward
    io_binding, output_shapes, output_buffers = self.prepare_io_binding(
  File "/opt/conda/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 807, in prepare_io_binding
    return self._prepare_io_binding(self.model, ordered_input_names=ordered_input_names, *model_inputs)
  File "/opt/conda/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 752, in _prepare_io_binding
    name = ordered_input_names[idx]
IndexError: list index out of range

electriclizard added a commit that referenced this pull request Jun 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants