Running on big multicore machines (servers) #6730

diiigle · 2023-03-29T07:20:27Z

diiigle
Mar 29, 2023

Disclaimer: I tried my best to find an answer in issues + web before posting here, but wasn't successful.

Summary: I experience slow batch processing when using more cores.

I'm running a batch conversion of about 240k CR2 raw images, always 5-9 per batch at a time. The profile I'm running, is fairly minimal. And it outputs to 16-bit TIFF (rawtherapee-cli -t -b16) for subsequent HDR stacking.

Now the hardware I'm running this on are multiple powerful Linux servers that are interconnected with (SSD-cached) NFS.

Name	cores	threads	RAM
Server1	2 x 8	2 x 16	112GB
Server2	2 x 16	2 x 32	126GB

Doing some testing with limiting the number of threads I get the following results. I am able to limit it to the cores that are on one socket via /sys/devices/system/node/node{0,1}/cpulist and docker --cpuset-cpus

Server	number of threads	time per 5 images
Server1	32	~45s IIRC
Server1	8	46s
Server2	64	300s
Server2	8 (`OMP_NUM_THREADS=8`)	50s

I/O considerations

The raw files are read from NFS, and the output is written to a RAMdisk /run/user/<userid>/. Even if I run it twice on the same images (such that the input would be RAM cached by the Linux kernel), the results don't change.

OpenMP

Glancing through the code, I see that parallel programming is solved here using OpenMP. In my own programming, I have rather bad experience with OMP in terms of performance (on GCC/Linux), to the point where running it single-threaded (OMP_NUM_THREADS=1) was sometimes even faster.

I hope you find this evaluation somewhat useful. I would like to see the developers comment on this from their experience, and possibly have the documentation being appended on the topic of multicore machines.

Tags: multicore, many threads, many core, HPC, parallelization, OpenMP, slow

noirsabb · 2023-05-05T11:29:54Z

noirsabb
May 5, 2023

Hi

I experienced some RT performance issues when working with .CR2 and .CR3 files and had to tune the use of OMP_NUM_THREADS and OMP_THREAD_LIMIT environment variables on a high core count workstation class machine, yet noticed adequate performance (for me) on a Latte Panda D featuring four cores and four threads with OPENMP compiled into RT running under Linux (Kubuntu).

I am unclear what values of OMP_NUM_THREADS you have used for each server other than the last one in the table above. For me, OMP_NUM_THREADS between 4 and 8 (the number of threads to allocate per parallel region) and OMP_THREAD_LIMIT of 32 work for me (.CR2 and .CR3 files): 100MB .RAF files is a work in progress.

I suppose you could try profiling RT under your installation to see what might be causing the issues identified.

NS

1 reply

diiigle May 9, 2023
Author

Thanks for sharing! You seem to confirm my findings. I think the application could be a bit smarter about its thread pool, and limit it to 4-8 by default, if the implemented algorithms are known not to parallelize well beyond that.

In the rows where OpenMP is not mentioned, I used docker --cpuset-cpus to limit the cores visible to the application from the kernel. That has the additional benefit of the threads being pinned to a specific core, and the kernel not moving them around the CPU at will. Should be good for the caches 😉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on big multicore machines (servers) #6730

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Running on big multicore machines (servers) #6730

diiigle Mar 29, 2023

I/O considerations

OpenMP

Replies: 1 comment · 1 reply

noirsabb May 5, 2023

diiigle May 9, 2023 Author

diiigle
Mar 29, 2023

Replies: 1 comment 1 reply

noirsabb
May 5, 2023

diiigle May 9, 2023
Author