Replies: 12 comments 37 replies
-
@ggerganov, Let's iterate the solution on llama.cpp, then copy/paste the solution to whisper.cpp & ggml when it is "good enough" TBH: I've been wanting to contribute to llama.cpp... but haven't had the matching tech skills until now. |
Beta Was this translation helpful? Give feedback.
-
If you can secure one, Oracle's free Ampere instances would be a good candidate for continuous testing on ARM. Its 4 Ampere cores, 24GB, and enough ssd storage space for a few models. |
Beta Was this translation helpful? Give feedback.
-
Also, one issue with VM cloud benchmarking is that performance is inconsistent. If you do twenty runs of the same benchmark, the performance spread could be huge. Then you might spin up the exact same type of instance the next day, run twenty more runs of a benchmark, and then performance is very consistent.... but somehow slower than the previous sporadic run. I am not sure why this is the case, but I'm sure it has to do with other tenants stressing the CPU and its subsystems, variable silicon binning and such. So if performance metrics are important, consider a small bare metal instance? The Oracle VMs previously mentioned suffer from this, but since they are free and 24-7, you can probably get around it with a huge number of tests. |
Beta Was this translation helpful? Give feedback.
-
I've got cloud accounts with Google, Amazon & Microsoft Azure... a lot of free capacity with Azure & Google Curiously, I've had problems signing up with Oracle (in Australia). Tried twice unsuccessfully, not sure if it my email address, credit card other? AWS also has ARM machines - Inferentia & Trainium. IMO with the nvidia shortages Amazon's custom silicon will be getting a lot more more popular. Huggingface apparently use them for most (all?) training. @ggerganov mentioned Azure - so I'll do a POC there. Assume he has some freebie on offer. I just signed up for USD$4500/year with Azure for a non-profit I'm helping out. AWS has bare metal Apple M2s - (later on) I'd like to do performance testing there... give than Apple Silicon is a "first class citizen" for llama.cpp. They have a minimum 24hours usage purchase - so there are costs involved unless we get a ggml.ai AWS freebie deal. |
Beta Was this translation helpful? Give feedback.
-
Be aware that you will get different VM every time you create a new VM. In order to make the perf result stable, you may have to keep one VM running. |
Beta Was this translation helpful? Give feedback.
-
OK, so I have had an initial review, checked out the "perplexity page", reviewed the Github actions to build the releases, and test run Some questions - please pls confirm/correct my assumptions Q1: Test Variables
Q2: Test Results Required
Q3: How will we present/display the results
Q4: Where/how to record test results?
Q5: Test Hardware & OS
|
Beta Was this translation helpful? Give feedback.
-
@ianscrivener, replying here as this about performance and not perplexity per se:
If you're looking for performance (2) your dependent variable is execution time. (a), (c), (d) should be kept constant and you should vary (b). (1) is irrelevant then, as for CI/CD it shouldn't change (at least say to My point was that you may want to judiciously choose (c) and your llama config params so you get timings for all of the 1 = sample, 2 = prompt and 3 = eval time and 1 + 2 + 3 = total compute time. Ignore the load time as that's a function on the model size and disk I/O of the VM the model is running on. Perplexity looks to only test load and prompt time. @ggerganov can confirm if just prompt time is sufficient for a performance test. Furthermore, you want to ensure the runs are of sufficient duration that the potentially high level of variability between different successive virtualized instances will be somewhat averaged out. I'd use just 2 (or maybe 4) vCPUs (i.e. Then, to deal with the inevitable variability of running on VMs when looking for non-functional performance regressions, I usually calculate an exponentially smoothed time series model of previous CI/CD total performance test run time results for each (b). You can then compare the last build for each (b) to this average and fail the build on performance if it is say 0.25 s.d. outside the predicted performance of the (b). I check both for upper and lower bounds as I've seen performance regressions where performance tests take 0 seconds to run 🤦 |
Beta Was this translation helpful? Give feedback.
-
Some back of the ballpark guestimates & assumptions
|
Beta Was this translation helpful? Give feedback.
-
@ggerganov - I just dropped you an email regarding CI 😀 |
Beta Was this translation helpful? Give feedback.
-
Hi guys, I see this issue related to Azure and CI. FYI, I set up a build/test/deploy workflow on my free azure account, originally to be able to build/download quantized models directly on azure and save time/storage/memory on my laptop. It could also serve as a basis for CI. https://github.com/tpaviot/llama.cpp-azure is a separate project so far, I just changed it to public a couple of hours ago (it used to be a private repo), it could be a dev branch as well. |
Beta Was this translation helpful? Give feedback.
-
@tpaviot I checkout out your code... looks like a good start. Here's the bash code I use to setup and run a llama.cpp perplexity test on Docker container with CUDA support: https://gist.github.com/ianscrivener/71bde7a2bfc92e8d217900229d78df51 |
Beta Was this translation helpful? Give feedback.
-
In response to @ggerganov's call for perplexity and latency testing for llama.cpp I've coded llama.cpp perplexity scorecard... a helper project to run and gather |
Beta Was this translation helpful? Give feedback.
-
If we assume we had enough MS Azure credits, how difficult would it be to make a
llama.cpp
/whisper.cpp
/ggml
CI in the cloud that runs on every commit and does some performance and perplexity benchmarks across a variety of hardware and models?I don't have any experience in this, so would like to hear some opinions and maybe also see if there are people that would be interested in implementing such type of CI.
Beta Was this translation helpful? Give feedback.
All reactions