Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test the performance of permutation instances with different parameters #296

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

CongMa13
Copy link
Collaborator

permutation_tuning_3 is able to test all rank 3 permutation instances. It accepts command line parameters: lens of rank3 tensor and the output dims.

For example,

root@x1000c7s5b1n0:/home/congma13/work/forks/build/hiptensor_da972cadcf9b_ck_223a2abe62f# ./bin/permutation_tuning_3 F32 124 1580 64  2 0 1
There are 70 instances.                                                                                                                                                                          DeviceElementwiseImpl<3, 256, 64, 64, 4, 4, 0, 1, 1, 1>:        Perf: 0.155879 ms, 0.16088 TFlops, 643.519 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 4, 4, 0, 1, 2, 2>:        Perf: 0.122778 ms, 0.204253 TFlops, 817.012 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 4, 4, 0, 1, 4, 4>:        Perf: 0.103642 ms, 0.241966 TFlops, 967.862 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 4, 4, 1, 0, 1, 1>:        Perf: 0.168602 ms, 0.148739 TFlops, 594.957 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 4, 4, 1, 0, 2, 2>:        Perf: 0.119111 ms, 0.210542 TFlops, 842.166 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 4, 4, 1, 0, 4, 4>:        Perf: 0.107594 ms, 0.233078 TFlops, 932.312 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 0, 1, 1, 1>:      Perf: 0.555214 ms, 0.0451677 TFlops, 180.671 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 0, 1, 2, 2>:      Perf: 0.282154 ms, 0.0888796 TFlops, 355.518 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 0, 1, 4, 4>:      Perf: 0.259383 ms, 0.0966823 TFlops, 386.729 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 0, 1, 8, 8> does not support this input tensor:
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 0, 1, 16, 16> does not support this input tensor:
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 1, 0, 1, 1>:      Perf: 0.54868 ms, 0.0457056 TFlops, 182.823 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 1, 0, 2, 2>:      Perf: 0.286791 ms, 0.0874426 TFlops, 349.771 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 1, 0, 4, 4>:      Perf: 0.258455 ms, 0.0970295 TFlops, 388.118 GB/s
DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 1, 0, 8, 8> does not support this input tensor:                                                                                                    DeviceElementwiseImpl<3, 256, 64, 64, 16, 16, 1, 0, 16, 16> does not support this input tensor:                                                                                                  DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 0, 1, 1, 1>:      Perf: 0.223434 ms, 0.112238 TFlops, 448.951 GB/s
DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 0, 1, 2, 2>:      Perf: 0.163738 ms, 0.153158 TFlops, 612.631 GB/s
DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 0, 1, 4, 4>:      Perf: 0.140621 ms, 0.178336 TFlops, 713.343 GB/s
DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 0, 1, 8, 8> does not support this input tensor:                                                                                                    DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 1, 0, 1, 1>:      Perf: 0.224637 ms, 0.111637 TFlops, 446.547 GB/s
DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 1, 0, 2, 2>:      Perf: 0.174285 ms, 0.143889 TFlops, 575.557 GB/s
DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 1, 0, 4, 4>:      Perf: 0.153869 ms, 0.162981 TFlops, 651.924 GB/s
DeviceElementwiseImpl<3, 256, 128, 128, 8, 8, 1, 0, 8, 8> does not support this input tensor:                   

@@ -2,7 +2,7 @@
*
* MIT License
*
* Copyright (C) 2023-2024 Advanced Micro Devices, Inc. All rights reserved.
* Copyright (C) 2023-2025 Advanced Micro Devices, Inc. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to change the sample?

}
}

int main(int argc, char* argv[])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense here to have rank determined from argc?
RANK = (argc - 2) >>1; // as long as (argc -2) % 2 == 0

Then we can have 1 executable and combine all the cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants