Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rusty: Add mempolicy checks to rusty #364

Merged
merged 3 commits into from
Jul 16, 2024

Conversation

hodgesds
Copy link
Contributor

This change makes scx_rusty mempolicy aware. When a process uses set_mempolicy it can change NUMA memory preferences and cause performance issues when tasks are scheduled on remote NUMA nodes. This change modifies task_pick_domain to use the new helper method that returns the preferred node id.

With the --mempolicy-affinity flag set:
image

$ stress-ng  -M --mbind 1 --malloc 5 -t 10 --bigheap 10 --numa 5
stress-ng: info:  [873775] setting to a 10 secs run per stressor
stress-ng: info:  [873775] dispatching hogs: 5 malloc, 10 bigheap, 5 numa
stress-ng: info:  [873796] numa: system has 2 of a maximum 8 memory NUMA nodes
stress-ng: metrc: [873775] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [873775]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [873775] malloc          6079478     10.06     23.71     26.42    604560.22      121266.06        99.71        122880
stress-ng: metrc: [873775] bigheap         1466461     11.89     10.37    108.17    123327.88       12371.04        99.69       9393280
stress-ng: metrc: [873775] numa                 35     10.06      1.02      0.50         3.48          22.97         3.03          5120
stress-ng: metrc: [873775] miscellaneous metrics:
stress-ng: metrc: [873775] bigheap           412301.05 realloc calls per sec (geometric mean of 10 instances)
stress-ng: info:  [873775] skipped: 0
stress-ng: info:  [873775] passed: 20: malloc (5) bigheap (10) numa (5)
stress-ng: info:  [873775] failed: 0
stress-ng: info:  [873775] metrics untrustworthy: 0
stress-ng: info:  [873775] successful run completed in 11.91 secs

scx_rusty default:
image

$ stress-ng  -M --mbind 1 --malloc 5 -t 10 --bigheap 10 --numa 5
stress-ng: info:  [875135] setting to a 10 secs run per stressor
stress-ng: info:  [875135] dispatching hogs: 5 malloc, 10 bigheap, 5 numa
stress-ng: info:  [875155] numa: system has 2 of a maximum 8 memory NUMA nodes
stress-ng: metrc: [875135] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [875135]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [875135] malloc          6272998     10.06     23.23     26.93    623537.39      125044.23        99.73        122240
stress-ng: metrc: [875135] bigheap          986926     11.51      7.07    107.74     85723.73        8595.96        99.73       6389760
stress-ng: metrc: [875135] numa                 25     10.05      0.54      0.38         2.49          27.30         1.82          5120
stress-ng: metrc: [875135] miscellaneous metrics:
stress-ng: metrc: [875135] bigheap           398465.21 realloc calls per sec (geometric mean of 10 instances)
stress-ng: info:  [875135] skipped: 0
stress-ng: info:  [875135] passed: 20: malloc (5) bigheap (10) numa (5)
stress-ng: info:  [875135] failed: 0
stress-ng: info:  [875135] metrics untrustworthy: 0
stress-ng: info:  [875135] successful run completed in 11.52 secs

cfs:
image

$ stress-ng  -M --mbind 1 --malloc 5 -t 10 --bigheap 10 --numa 5
stress-ng: info:  [882100] setting to a 10 secs run per stressor
stress-ng: info:  [882100] dispatching hogs: 5 malloc, 10 bigheap, 5 numa
stress-ng: info:  [882125] numa: system has 2 of a maximum 8 memory NUMA nodes
stress-ng: metrc: [882100] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [882100]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [882100] malloc          6259502     10.07     23.07     27.01    621890.00      124990.82        99.51        122876
stress-ng: metrc: [882100] bigheap          874114     11.50      5.59    108.43     76008.26        7666.52        99.14       5612800
stress-ng: metrc: [882100] numa                395     10.04      3.45      4.72        39.35          48.38        16.27          5120
stress-ng: metrc: [882100] miscellaneous metrics:
stress-ng: metrc: [882100] bigheap           373862.39 realloc calls per sec (geometric mean of 10 instances)
stress-ng: info:  [882100] skipped: 0
stress-ng: info:  [882100] passed: 20: malloc (5) bigheap (10) numa (5)
stress-ng: info:  [882100] failed: 0
stress-ng: info:  [882100] metrics untrustworthy: 0
stress-ng: info:  [882100] successful run completed in 11.50 secs

The bigheap benchmark sees a moderate improvement, where most everything else is flat or worse. So maybe this flag makes sense if something is using mbind with lots of allocations.

Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely think this will be useful, thanks for working on it. Putting it back into your queue for now per our discussion.

scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
static u32 task_pick_mempolicy_domain(
struct task_struct *p, struct task_ctx *taskc, u32 rr_token)
{
u32 ret = NO_DOM_FOUND;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just to match kernel coding style, could you please add a newline between variable declarations and the rest of the function, and move all variable declarations to the top of whatever scope they're declared in?

scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
Comment on lines 595 to 596
if (ret != NO_DOM_FOUND && i % rr_token == 0)
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline -- we're matching NUMA node with domain ID here, but they might not match up if there are multiple CCXs per node. We can instead have something like a mask of preferred domains (that's informed by p->mempolicy->nodes), and then query that when looking for a domain when we do round robin in the caller, and in the load balancer.

@hodgesds hodgesds force-pushed the rusty-mempolicy branch 3 times, most recently from 607c4e4 to 3a9e468 Compare July 12, 2024 20:43
@hodgesds
Copy link
Contributor Author

hodgesds commented Jul 12, 2024

Some more testing results. It looks like a slight dip in malloc performance with this disabled (I think that makes sense?), but a nice bump in numa and bigheap performance.

scx_rusty --mempolicy-affinity:

$ for i in {1..15}; do stress-ng  -M --mbind 0 --malloc 5 -t 10 --bigheap 10 --numa 5 | grep 'malloc\|bigheap\|numa' | grep -v '(' | grep -v 'numa:\|hogs' | tr -s ' ' ' ' | cut -d ' ' -f 1,4,10 | sed 's/$/ ops\/sec/'; done
stress-ng: malloc 2414959.00 ops/sec
stress-ng: bigheap 9851.43 ops/sec
stress-ng: numa 14.79 ops/sec
stress-ng: malloc 2330303.78 ops/sec
stress-ng: bigheap 9834.70 ops/sec
stress-ng: numa 15.48 ops/sec
stress-ng: malloc 2277192.83 ops/sec
stress-ng: bigheap 9907.67 ops/sec
stress-ng: numa 16.95 ops/sec
stress-ng: malloc 2398472.53 ops/sec
stress-ng: bigheap 8475.79 ops/sec
stress-ng: numa 12.64 ops/sec
stress-ng: malloc 2390881.30 ops/sec
stress-ng: bigheap 8587.66 ops/sec
stress-ng: numa 11.66 ops/sec
stress-ng: malloc 2209118.09 ops/sec
stress-ng: bigheap 9939.01 ops/sec
stress-ng: numa 15.38 ops/sec
stress-ng: malloc 2385835.82 ops/sec
stress-ng: bigheap 9875.28 ops/sec
stress-ng: numa 15.29 ops/sec
stress-ng: malloc 2396635.40 ops/sec
stress-ng: bigheap 9830.77 ops/sec
stress-ng: numa 14.72 ops/sec
stress-ng: malloc 2409422.04 ops/sec
stress-ng: bigheap 8881.07 ops/sec
stress-ng: numa 13.00 ops/sec
stress-ng: malloc 2440340.00 ops/sec
stress-ng: bigheap 9830.51 ops/sec
stress-ng: numa 15.17 ops/sec
stress-ng: malloc 2422172.50 ops/sec
stress-ng: bigheap 9879.53 ops/sec
stress-ng: numa 15.18 ops/sec
stress-ng: malloc 2408656.68 ops/sec
stress-ng: bigheap 9899.86 ops/sec
stress-ng: numa 15.17 ops/sec
stress-ng: malloc 2304170.39 ops/sec
stress-ng: bigheap 9895.38 ops/sec
stress-ng: numa 15.21 ops/sec
stress-ng: malloc 2339274.84 ops/sec
stress-ng: bigheap 9827.77 ops/sec
stress-ng: numa 15.88 ops/sec
stress-ng: malloc 2305752.18 ops/sec
stress-ng: bigheap 9849.53 ops/sec
stress-ng: numa 14.86 ops/sec

scx_rusty:

$ for i in {1..15}; do stress-ng  -M --mbind 0 --malloc 5 -t 10 --bigheap 10 --numa 5 | grep 'malloc\|bigheap\|numa' | grep -v '(' | grep -v 'numa:\|hogs' | tr -s ' ' ' ' | cut -d ' ' -f 1,4,10 | sed 's/$/ ops\/sec/'; done
stress-ng: malloc 2484963.65 ops/sec
stress-ng: bigheap 5204.87 ops/sec
stress-ng: numa 8.65 ops/sec
stress-ng: malloc 2525933.34 ops/sec
stress-ng: bigheap 5183.96 ops/sec
stress-ng: numa 8.78 ops/sec
stress-ng: malloc 2404081.76 ops/sec
stress-ng: bigheap 7157.05 ops/sec
stress-ng: numa 12.74 ops/sec
stress-ng: malloc 2554882.83 ops/sec
stress-ng: bigheap 5200.13 ops/sec
stress-ng: numa 8.71 ops/sec
stress-ng: malloc 2482935.71 ops/sec
stress-ng: bigheap 5268.02 ops/sec
stress-ng: numa 8.47 ops/sec
stress-ng: malloc 2464358.89 ops/sec
stress-ng: bigheap 5207.38 ops/sec
stress-ng: numa 8.51 ops/sec
stress-ng: malloc 2076552.91 ops/sec
stress-ng: bigheap 5232.29 ops/sec
stress-ng: numa 8.62 ops/sec
stress-ng: malloc 2469416.03 ops/sec
stress-ng: bigheap 5253.22 ops/sec
stress-ng: numa 8.47 ops/sec
stress-ng: malloc 2615101.87 ops/sec
stress-ng: bigheap 5243.97 ops/sec
stress-ng: numa 9.02 ops/sec
stress-ng: malloc 2571799.75 ops/sec
stress-ng: bigheap 5211.78 ops/sec
stress-ng: numa 8.62 ops/sec
stress-ng: malloc 2499824.30 ops/sec
stress-ng: bigheap 5253.45 ops/sec
stress-ng: numa 8.88 ops/sec
stress-ng: malloc 2502773.43 ops/sec
stress-ng: bigheap 5269.24 ops/sec
stress-ng: numa 8.49 ops/sec
stress-ng: malloc 2464731.06 ops/sec
stress-ng: bigheap 5266.82 ops/sec
stress-ng: numa 8.50 ops/sec
stress-ng: malloc 2426839.69 ops/sec
stress-ng: bigheap 5180.43 ops/sec
stress-ng: numa 8.77 ops/sec
stress-ng: malloc 2476085.88 ops/sec
stress-ng: bigheap 5282.69 ops/sec
stress-ng: numa 8.44 ops/sec

CFS:

$ for i in {1..15}; do stress-ng  -M --mbind 0 --malloc 5 -t 10 --bigheap 10 --numa 5 | grep 'malloc\|bigheap\|numa' | grep -v '(' | grep -v 'numa:\|hogs' | tr -s ' ' ' ' | cut -d ' ' -f 1,4,10 | sed 's/$/ ops\/sec/'; done
stress-ng: malloc 2368975.17 ops/sec
stress-ng: bigheap 3493.42 ops/sec
stress-ng: numa 6.17 ops/sec
stress-ng: malloc 2398223.87 ops/sec
stress-ng: bigheap 3391.80 ops/sec
stress-ng: numa 6.38 ops/sec
stress-ng: malloc 2447627.55 ops/sec
stress-ng: bigheap 3378.21 ops/sec
stress-ng: numa 6.39 ops/sec
stress-ng: malloc 2379718.94 ops/sec
stress-ng: bigheap 3370.09 ops/sec
stress-ng: numa 6.29 ops/sec
stress-ng: malloc 2447586.32 ops/sec
stress-ng: bigheap 3460.58 ops/sec
stress-ng: numa 6.02 ops/sec
stress-ng: malloc 2366751.28 ops/sec
stress-ng: bigheap 3374.49 ops/sec
stress-ng: numa 6.87 ops/sec
stress-ng: malloc 2394597.31 ops/sec
stress-ng: bigheap 3391.88 ops/sec
stress-ng: numa 6.20 ops/sec
stress-ng: malloc 2431500.42 ops/sec
stress-ng: bigheap 3489.44 ops/sec
stress-ng: numa 5.87 ops/sec
stress-ng: malloc 2380103.88 ops/sec
stress-ng: bigheap 3335.02 ops/sec
stress-ng: numa 6.64 ops/sec
stress-ng: malloc 2428433.77 ops/sec
stress-ng: bigheap 3438.69 ops/sec
stress-ng: numa 6.09 ops/sec
stress-ng: malloc 2401221.88 ops/sec
stress-ng: bigheap 3399.20 ops/sec
stress-ng: numa 6.25 ops/sec
stress-ng: malloc 2432262.07 ops/sec
stress-ng: bigheap 3413.91 ops/sec
stress-ng: numa 6.23 ops/sec
stress-ng: malloc 2236934.75 ops/sec
stress-ng: bigheap 3388.53 ops/sec
stress-ng: numa 6.21 ops/sec
stress-ng: malloc 2319418.76 ops/sec
stress-ng: bigheap 3433.82 ops/sec
stress-ng: numa 6.31 ops/sec
stress-ng: malloc 2344342.66 ops/sec
stress-ng: bigheap 3459.78 ops/sec
stress-ng: numa 6.21 ops/sec

Tested on a multi NUMA machine with:

$ lscpu 
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   80
  On-line CPU(s) list:    0-79
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   20
    Socket(s):            2
    Stepping:             4
    Frequency boost:      enabled
    CPU(s) scaling MHz:   100%
    CPU max MHz:          2001.0000
    CPU min MHz:          1000.0000
    BogoMIPS:             4000.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
                           pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba i
                          brs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsave
                          s cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi pku ospke md_clear flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    1.3 MiB (40 instances)
  L1i:                    1.3 MiB (40 instances)
  L2:                     40 MiB (40 instances)
  L3:                     55 MiB (2 instances)
NUMA:                     
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-19,40-59
  NUMA node1 CPU(s):      20-39,60-79
Vulnerabilities:          
  Gather data sampling:   Vulnerable: No microcode
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX vulnerable
  Mds:                    Vulnerable; SMT vulnerable
  Meltdown:               Vulnerable
  Mmio stale data:        Vulnerable
  Reg file data sampling: Not affected
  Retbleed:               Vulnerable
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Not affected
  Srbds:                  Not affected
  Tsx async abort:        Vulnerable

Sorry for the formatting issues!

@hodgesds hodgesds requested a review from Byte-Lab July 12, 2024 20:48
@@ -753,8 +755,18 @@ impl<'a, 'b> LoadBalancer<'a, 'b> {
skip_kworkers: bool,
) -> Option<&'d TaskInfo>
where
I: IntoIterator<Item = &'d TaskInfo>,
I: IntoIterator<Item = &'d TaskInfo> + Clone,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok because it's only cloning the iterator, but I may be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, honestly not sure.

@hodgesds hodgesds force-pushed the rusty-mempolicy branch 2 times, most recently from 9d8484d to 94dee33 Compare July 12, 2024 22:23
@hodgesds
Copy link
Contributor Author

Pushed up a fix where only the first preferred node map was being used. Similar results afterwards:

$ for i in {1..15}; do stress-ng  -M --mbind 1 --malloc 5 -t 10 --bigheap 10 --numa 5 | grep 'malloc\|bigheap\|numa' | grep -v '(' | grep -v 'numa:\|hogs' | tr -s ' ' ' ' | cut -d ' ' -f 1,4,10 | sed 's/$/ ops\/sec/'; done
stress-ng: malloc 2477314.61 ops/sec
stress-ng: bigheap 9856.44 ops/sec
stress-ng: numa 14.02 ops/sec
stress-ng: malloc 2471408.91 ops/sec
stress-ng: bigheap 9118.78 ops/sec
stress-ng: numa 12.71 ops/sec
stress-ng: malloc 2442749.43 ops/sec
stress-ng: bigheap 9894.94 ops/sec
stress-ng: numa 14.44 ops/sec
stress-ng: malloc 2301150.62 ops/sec
stress-ng: bigheap 9794.42 ops/sec
stress-ng: numa 14.69 ops/sec
stress-ng: malloc 2327373.87 ops/sec
stress-ng: bigheap 10012.91 ops/sec
stress-ng: numa 14.83 ops/sec
stress-ng: malloc 2421000.60 ops/sec
stress-ng: bigheap 9877.79 ops/sec
stress-ng: numa 14.17 ops/sec
stress-ng: malloc 2283503.85 ops/sec
stress-ng: bigheap 9930.82 ops/sec
stress-ng: numa 14.38 ops/sec
stress-ng: malloc 2342542.07 ops/sec
stress-ng: bigheap 10027.94 ops/sec
stress-ng: numa 14.53 ops/sec
stress-ng: malloc 2421437.89 ops/sec
stress-ng: bigheap 9931.36 ops/sec
stress-ng: numa 14.74 ops/sec
stress-ng: malloc 2410791.19 ops/sec
stress-ng: bigheap 10018.79 ops/sec
stress-ng: numa 13.97 ops/sec
stress-ng: malloc 2186324.62 ops/sec
stress-ng: bigheap 9919.59 ops/sec
stress-ng: numa 14.70 ops/sec
stress-ng: malloc 2405431.37 ops/sec
stress-ng: bigheap 9925.42 ops/sec
stress-ng: numa 14.12 ops/sec
stress-ng: malloc 2437171.07 ops/sec
stress-ng: bigheap 9658.56 ops/sec
stress-ng: numa 13.78 ops/sec
stress-ng: malloc 2364183.72 ops/sec
stress-ng: bigheap 9934.91 ops/sec
stress-ng: numa 14.84 ops/sec
stress-ng: malloc 2442849.59 ops/sec
stress-ng: bigheap 9934.88 ops/sec
stress-ng: numa 13.60 ops/sec

u32 dom_id = 0;

bpf_for(dom_id, 0, nr_doms) {
if (!(dom_node_id(dom_id) == node_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we do if (dom_node_id(dom_id) != node_id)

* Sets the preferred domain mask according to the mempolicy. See man(2)
* set_mempolicy for more details on mempolicy.
*/
static int task_set_preferred_mempolicy_dom_mask(struct task_struct *p,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change the return value to bool? or maybe even simpler would be to make it void, and just check if preferred_dom_mask is nonzero in the caller?

if (taskc->preferred_dom_mask <= 0)
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let's maybe just do == 0? We really just case about whether there are any bits set, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was just being lazy with cleaning up those function signatures. Good call!

if (taskc->preferred_dom_mask <= 0)
continue;

if ((1LLU << dom) & taskc->preferred_dom_mask != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add parens around the bitwise comparison?


if (cpu < 0 || cpu >= MAX_CPUS)
return NO_DOM_FOUND;

taskc->dom_mask = 0;
taskc->preferred_dom_mask = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we bring this into task_set_preferred_mempolicy_dom_mask?

if (has_preferred_dom < 0)
continue;

if (((1LLU << dom) & taskc->preferred_dom_mask))
Copy link
Contributor

@Byte-Lab Byte-Lab Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check && preferred_dom == NO_DOM_FOUND?

Edit: Changed from != NO_DOM_FOUND to == NO_DOM_FOUND

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm @hodgesds, looks like we still need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked this, since preferred_dom is initialized to NO_DOM_FOUND it would always fail that check I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading comprehension > me.... yeah that's good!

{
// First try to find a task in the preferred domain mask.
if let Some(task) = tasks_by_load.clone().into_iter().take_while(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only useful if the the mempolicy stuff has been used in the BPF prog right? Should we maybe first check that preferred_dom_mask is nonzero so we can avoid the clone?

The other thing is that we may want to do this in two passes. If we do it this way, we may end up:

  1. Successfully finding a task that matches the preferred_dom_mask, but;
  2. Does not improve load imbalance to the point where it's chosen to migrate, and therefore;
  3. Failing to find a task that does sufficiently address load imbalance, but didn't match on preferred_dom_mask.

So in other words, rather than first trying to find a task in the pull_dom that matches the preferred mask, and then skipping looking at the rest of the tasks if we do find one, we should do the whole load imbalance calculation logic again in try_to_find_task() after (2) above, but ignoring the preferred_dom_mask. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. How does this sound:

If a filter lambda passed to try_find_move_task it becomes easy to check the return value in the loop where tasks are transferred and if that is empty then remove the preferred_dom_mask filter and retry the move as you mentioned.

      fn try_find_move_task(                                                                                                           
          &mut self,                                                                                                                   
          (push_dom, to_push): (&mut Domain, f64),                                                                                     
          (pull_dom, to_pull): (&mut Domain, f64),                                                                                     
+         task_filter: impl Fn(& TaskInfo) -> bool,                                                                                    
          to_xfer: f64,                                                                                                                
      ) -> Result<Option<f64>> {  

I guess the current approach kind of cheats by breaking the load balancing.

@@ -753,8 +755,18 @@ impl<'a, 'b> LoadBalancer<'a, 'b> {
skip_kworkers: bool,
) -> Option<&'d TaskInfo>
where
I: IntoIterator<Item = &'d TaskInfo>,
I: IntoIterator<Item = &'d TaskInfo> + Clone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, honestly not sure.

This change makes scx_rusty mempolicy aware. When a process uses
set_mempolicy it can change NUMA memory preferences and cause
performance issues when tasks are scheduled on remote NUMA nodes. This
change modifies task_pick_domain to use the new helper method that
returns the preferred node id.

Signed-off-by: Daniel Hodges <[email protected]>
Comment on lines 1531 to 1532
if (taskc->preferred_dom_mask == 0
&& preferred_dom != NO_DOM_FOUND)
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ||?

This change refactors some of the helper methods for getting the
preferred node for tasks using mempolicy. The load balancing logic in
try_find_move_task is updated to allow for a filter, which is used to
filter for tasks with a preferred mempolicy.

Signed-off-by: Daniel Hodges <[email protected]>
@@ -879,9 +888,19 @@ impl<'a, 'b> LoadBalancer<'a, 'b> {
pull_node.domains.insert(pull_dom);
break;
}
let transferred = self.try_find_move_task((&mut push_dom, push_imbal),
let mut transferred = self.try_find_move_task((&mut push_dom, push_imbal),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to avoid doing this work altogether if preferred_dom_mask is zero, as that will probably be the common case (right?). But this is fine for now, we can address this in a follow-on change if we choose. Thanks!

@Byte-Lab Byte-Lab merged commit e4be1ec into sched-ext:main Jul 16, 2024
1 check passed
@hodgesds hodgesds deleted the rusty-mempolicy branch July 17, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants