ddnovikov submission #7

ddnovikov · 2023-05-10T08:23:41Z

Hi!

Here's my solution. Not 100% sure if I enabled GPU in helm values correctly, but hopefully it's ok. Please advise on the changes if it doesn't work -- I don't have experience using kubernetes and helm.

I have some ideas of what can be improved and played with here to improve performance. But from my experience -- from this point most optimizations will be very time-consuming and give very marginal gains; b) even if there are good ideas for optimization (e.g. spending time to convert these models to ONNX?) -- they should be considered in the real world usage context to be worth it. So my final solution is what seems to be reasonable to do given the requirements and ~15-20 hours that I was ready to spend on the challenge.

But also for the purpose of experience -- I will be happy to hear about any easy ideas for serious optimization if you have some! Thanks!

…rks under k6's load but obviously optimizations are required.

… to cost lots of time but are going to be marginal. Timeout/batch_size settings may be tied to the HW i used for the tests.

…k space for submissino.

darknessest · 2023-05-10T17:22:27Z

Hello @ddnovikov, thank you for the great submission.
Could you please move the models downloading to the runtime and save them to the /models path (see mounted volume)?

We deployed your project and unfortunately, the models didn't load up properly on a GPU, here are the logs:

logs

INFO:     Started server process [1]
INFO:     Waiting for application startup.
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-4' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-5' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-6' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<model_inference_task() done, defined at /code/./app.py:19> exception=RuntimeError('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx')>
Traceback (most recent call last):
  File "/code/./app.py", line 20, in model_inference_task
    text_classification_pipeline = pipeline('text-classification', model=model_path, device=0)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/text_classification.py", line 83, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in __init__
    self.model.to(device)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1896, in to
    return super().to(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

I know it's time-consuming to deal with this type of issue, if you want to proceed with the challenge we'd recommend you:

use pytorch/pytorch:*-cuda*-cudnn*-runtime images, as we found them to be hassle-free when working with AWS GPU instances
try to specify the GPU as device="cuda:0" and not device=0 (see line)

ddnovikov · 2023-05-10T17:52:29Z

Oh, I see, thank you!

Actually, I decided that I have a very fun idea to play around with to at least proudly put it on my github page later 😁 . Though I think it may be more of a mess in terms of GPU stuff (considering the issue above), but anything can be fixed given time and effort.

For the purposes of the challenge I am not going to disclose the idea now, but judging by previous challenges I guess I have 2-3 weeks more to implement it, right?

darknessest · 2023-05-10T17:58:23Z

For the purposes of the challenge I am not going to disclose the idea now, but judging by previous challenges I guess I have 2-3 weeks more to implement it, right?

There's no deadline yet, I think it's safe to say that you have 2-3 weeks

rsolovev · 2023-05-15T13:53:31Z

Hey @ddnovikov, issues with gpu usage were resolved, thank you for the beautiful solution. Here are our tests results on a grafana dashboard.

If you would like to work on your python solution further, you can continue optimizing/improving it and re-request our review once done. Any contribution during the challenge period will be taken into account while choosing a winner. Many thanks!

P.S. I'll come back with the dashboard for the rust solution a bit later today

ddnovikov · 2023-05-29T21:21:06Z

Hi @rsolovev! I decided to experiment with some improvements in the spare time. Could you please run the tests? Thanks!

rsolovev

@ddnovikov sure, here are the results for the latest commit -- grafana

ddnovikov · 2023-05-30T10:21:45Z

@rsolovev, thanks! I did some more improvements, could you please check them again?

rsolovev

@ddnovikov -- new peak throughput record! -- grafana

ddnovikov · 2023-05-31T09:17:25Z

@rsolovev Yay! I should also say that on newer Ampere architecture GPUs (such as NVIDIA A16 that I used) this code processes 2 times more iterations -- I reached a total of 15800. The funny thing about this is that the server I rented out (6 vCPU, 64 GB RAM, NVIDIA A16) costs 32% less (0.512$/hr vs 0.752$/hr) than g4dn.2xlarge while giving 2.1x better performance 🙃

ddnovikov · 2023-05-31T10:46:25Z

@rsolovev Also, could you please tell what CUDA version is installed on the machines you're using?

rsolovev · 2023-05-31T10:56:56Z

@ddnovikov

Wed May 31 10:56:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+

ddnovikov · 2023-06-02T18:16:23Z

@rsolovev, hi, I have another attempt at ONNX. I have no idea if it runs because it didn't run on my machine. I suspect the driver issue and I can't resolve it with the hardware I have. I just hope it runs out of the box on your g4dn 🤣

rsolovev

@ddnovikov onnx seems to launch successfully, but there are errors on request:

INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<model_inference_task() done, defined at /code/./app.py:21> exception=IndexError('list index out of range')>
Traceback (most recent call last):
  File "/code/./app.py", line 47, in model_inference_task
    logits = model(**encoded_input).logits
  File "/opt/conda/lib/python3.10/site-packages/optimum/modeling_base.py", line 85, in __call__
    return self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 1234, in forward
    io_binding, output_shapes, output_buffers = self.prepare_io_binding(
  File "/opt/conda/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 807, in prepare_io_binding
    return self._prepare_io_binding(self.model, ordered_input_names=ordered_input_names, *model_inputs)
  File "/opt/conda/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 752, in _prepare_io_binding
    name = ordered_input_names[idx]
IndexError: list index out of range

Model optimization

ddnovikov added 5 commits May 9, 2023 18:31

WIP: first CPU-only working solution.

0ab2b51

WIP: first working GPU solution + bugfix with first model. Somehow wo…

3e415cc

…rks under k6's load but obviously optimizations are required.

WIP: attempt at async computations, doesn't seem very successful.

07207ea

WIP: attempt at async computations with batching.

7c86621

WIP: final solution, works ok, improvements from this point are going…

7d67166

… to cost lots of time but are going to be marginal. Timeout/batch_size settings may be tied to the HW i used for the tests.

ddnovikov requested review from darknessest and rsolovev as code owners May 10, 2023 08:23

ddnovikov added 3 commits May 10, 2023 15:26

Minor fixes.

c692fa8

Minor fixes.

d53beec

Moving to one stage build from two-stage build to try saving some dis…

feafcb4

…k space for submissino.

ddnovikov closed this May 11, 2023

Minor fixes to finally run the submission.

e11c7af

ddnovikov reopened this May 14, 2023

More fixes.

2da75e2

More improvements.

caefd79

rsolovev reviewed May 30, 2023

View reviewed changes

Moar improvements.

295fc04

rsolovev reviewed May 31, 2023

View reviewed changes

Attempt to run through onnx.

b904d25

rsolovev reviewed Jun 2, 2023

View reviewed changes

electriclizard added a commit that referenced this pull request Jun 7, 2023

Merge pull request #7 from electriclizard/model-optimization

ccde086

Model optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddnovikov submission #7

ddnovikov submission #7

ddnovikov commented May 10, 2023

darknessest commented May 10, 2023

ddnovikov commented May 10, 2023 •

edited

Loading

darknessest commented May 10, 2023

rsolovev commented May 15, 2023

ddnovikov commented May 29, 2023

rsolovev left a comment

ddnovikov commented May 30, 2023

rsolovev left a comment

ddnovikov commented May 31, 2023

ddnovikov commented May 31, 2023

rsolovev commented May 31, 2023

ddnovikov commented Jun 2, 2023

rsolovev left a comment

ddnovikov submission #7

Are you sure you want to change the base?

ddnovikov submission #7

Conversation

ddnovikov commented May 10, 2023

darknessest commented May 10, 2023

ddnovikov commented May 10, 2023 • edited Loading

darknessest commented May 10, 2023

rsolovev commented May 15, 2023

ddnovikov commented May 29, 2023

rsolovev left a comment

Choose a reason for hiding this comment

ddnovikov commented May 30, 2023

rsolovev left a comment

Choose a reason for hiding this comment

ddnovikov commented May 31, 2023

ddnovikov commented May 31, 2023

rsolovev commented May 31, 2023

ddnovikov commented Jun 2, 2023

rsolovev left a comment

Choose a reason for hiding this comment

ddnovikov commented May 10, 2023 •

edited

Loading