Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip match_empty after failed to find a insert slot in group #607

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ldm0
Copy link
Contributor

@ldm0 ldm0 commented Feb 26, 2025

Running benchmark on my local macbook pro(M1 Pro):

Before:

Before:

running 41 tests
test clone_from_large               ... bench:       7,630.54 ns/iter (+/- 105.82)
test clone_from_small               ... bench:          69.27 ns/iter (+/- 4.44)
test clone_large                    ... bench:       7,650.92 ns/iter (+/- 138.79)
test clone_small                    ... bench:         110.63 ns/iter (+/- 4.04)
test grow_insert_foldhash_highbits  ... bench:      21,677.08 ns/iter (+/- 1,188.93)
test grow_insert_foldhash_random    ... bench:      24,624.66 ns/iter (+/- 673.04)
test grow_insert_foldhash_serial    ... bench:      22,303.47 ns/iter (+/- 1,107.11)
test grow_insert_std_highbits       ... bench:      37,461.35 ns/iter (+/- 1,110.84)
test grow_insert_std_random         ... bench:      37,744.47 ns/iter (+/- 1,511.93)
test grow_insert_std_serial         ... bench:      37,440.62 ns/iter (+/- 618.89)
test insert_erase_foldhash_highbits ... bench:      15,511.28 ns/iter (+/- 367.55)
test insert_erase_foldhash_random   ... bench:      17,365.08 ns/iter (+/- 1,125.38)
test insert_erase_foldhash_serial   ... bench:      15,461.06 ns/iter (+/- 738.03)
test insert_erase_std_highbits      ... bench:      31,643.75 ns/iter (+/- 1,053.08)
test insert_erase_std_random        ... bench:      33,238.85 ns/iter (+/- 1,722.66)
test insert_erase_std_serial        ... bench:      31,921.90 ns/iter (+/- 1,340.85)
test insert_foldhash_highbits       ... bench:      14,680.70 ns/iter (+/- 545.13)
test insert_foldhash_random         ... bench:      14,706.28 ns/iter (+/- 290.97)
test insert_foldhash_serial         ... bench:      14,810.04 ns/iter (+/- 362.32)
test insert_std_highbits            ... bench:      19,833.44 ns/iter (+/- 786.51)
test insert_std_random              ... bench:      19,728.32 ns/iter (+/- 277.17)
test insert_std_serial              ... bench:      19,674.22 ns/iter (+/- 636.33)
test iter_foldhash_highbits         ... bench:         902.91 ns/iter (+/- 66.87)
test iter_foldhash_random           ... bench:         784.51 ns/iter (+/- 29.84)
test iter_foldhash_serial           ... bench:         777.75 ns/iter (+/- 49.07)
test iter_std_highbits              ... bench:         762.85 ns/iter (+/- 14.56)
test iter_std_random                ... bench:         774.44 ns/iter (+/- 24.61)
test iter_std_serial                ... bench:         762.94 ns/iter (+/- 12.79)
test lookup_fail_foldhash_highbits  ... bench:       1,899.07 ns/iter (+/- 67.99)
test lookup_fail_foldhash_random    ... bench:       2,249.00 ns/iter (+/- 129.37)
test lookup_fail_foldhash_serial    ... bench:       2,602.64 ns/iter (+/- 56.42)
test lookup_fail_std_highbits       ... bench:       8,734.81 ns/iter (+/- 273.01)
test lookup_fail_std_random         ... bench:       8,594.13 ns/iter (+/- 507.16)
test lookup_fail_std_serial         ... bench:       8,625.80 ns/iter (+/- 145.82)
test lookup_foldhash_highbits       ... bench:       2,098.07 ns/iter (+/- 49.03)
test lookup_foldhash_random         ... bench:       2,455.68 ns/iter (+/- 87.79)
test lookup_foldhash_serial         ... bench:       2,089.30 ns/iter (+/- 111.22)
test lookup_std_highbits            ... bench:       8,784.86 ns/iter (+/- 293.86)
test lookup_std_random              ... bench:       8,740.02 ns/iter (+/- 119.12)
test lookup_std_serial              ... bench:       8,604.05 ns/iter (+/- 472.75)
test rehash_in_place                ... bench:     209,355.86 ns/iter (+/- 6,211.48)

test result: ok. 0 passed; 0 failed; 0 ignored; 41 measured; 0 filtered out; finished in 55.76s

     Running benches/insert_unique_unchecked.rs (target/release/deps/insert_unique_unchecked-70b56273fe907210)

running 2 tests
test insert                  ... bench:       9,978.20 ns/iter (+/- 218.31)
test insert_unique_unchecked ... bench:       3,453.39 ns/iter (+/- 53.06)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 2.10s

     Running benches/set_ops.rs (target/release/deps/set_ops-41e9dbeede3fedd5)

running 10 tests
test set_ops_bit_and                ... bench:      10,482.35 ns/iter (+/- 387.65)
test set_ops_bit_and_assign         ... bench:       6,106.08 ns/iter (+/- 247.88)
test set_ops_bit_or                 ... bench:      86,940.28 ns/iter (+/- 2,392.71)
test set_ops_bit_or_assign          ... bench:      53,571.67 ns/iter (+/- 1,050.82)
test set_ops_bit_xor                ... bench:      88,900.81 ns/iter (+/- 3,737.91)
test set_ops_bit_xor_assign         ... bench:      55,736.53 ns/iter (+/- 1,031.37)
test set_ops_sub_assign_large_small ... bench:      55,513.89 ns/iter (+/- 1,965.35)
test set_ops_sub_assign_small_large ... bench:       6,100.57 ns/iter (+/- 255.82)
test set_ops_sub_large_small        ... bench:      87,790.68 ns/iter (+/- 2,284.42)
test set_ops_sub_small_large        ... bench:         647.65 ns/iter (+/- 5.28)

test result: ok. 0 passed; 0 failed; 0 ignored; 10 measured; 0 filtered out; finished in 19.65s
After:
running 41 tests
test clone_from_large               ... bench:       7,708.37 ns/iter (+/- 317.98)
test clone_from_small               ... bench:          69.86 ns/iter (+/- 1.66)
test clone_large                    ... bench:       7,662.61 ns/iter (+/- 153.91)
test clone_small                    ... bench:         110.04 ns/iter (+/- 3.01)
test grow_insert_foldhash_highbits  ... bench:      20,067.66 ns/iter (+/- 813.59)
test grow_insert_foldhash_random    ... bench:      24,074.16 ns/iter (+/- 1,134.52)
test grow_insert_foldhash_serial    ... bench:      21,096.95 ns/iter (+/- 366.22)
test grow_insert_std_highbits       ... bench:      36,934.55 ns/iter (+/- 894.28)
test grow_insert_std_random         ... bench:      36,898.65 ns/iter (+/- 1,328.19)
test grow_insert_std_serial         ... bench:      36,811.75 ns/iter (+/- 1,086.38)
test insert_erase_foldhash_highbits ... bench:      15,791.87 ns/iter (+/- 473.97)
test insert_erase_foldhash_random   ... bench:      16,973.07 ns/iter (+/- 934.27)
test insert_erase_foldhash_serial   ... bench:      17,217.81 ns/iter (+/- 1,168.29)
test insert_erase_std_highbits      ... bench:      33,779.54 ns/iter (+/- 1,145.91)
test insert_erase_std_random        ... bench:      34,539.58 ns/iter (+/- 743.19)
test insert_erase_std_serial        ... bench:      33,910.30 ns/iter (+/- 5,716.08)
test insert_foldhash_highbits       ... bench:      14,719.39 ns/iter (+/- 467.13)
test insert_foldhash_random         ... bench:      15,091.88 ns/iter (+/- 12,584.39)
test insert_foldhash_serial         ... bench:      14,592.95 ns/iter (+/- 652.87)
test insert_std_highbits            ... bench:      19,681.62 ns/iter (+/- 728.00)
test insert_std_random              ... bench:      19,588.85 ns/iter (+/- 618.96)
test insert_std_serial              ... bench:      19,611.61 ns/iter (+/- 792.86)
test iter_foldhash_highbits         ... bench:         828.32 ns/iter (+/- 27.00)
test iter_foldhash_random           ... bench:         768.69 ns/iter (+/- 55.53)
test iter_foldhash_serial           ... bench:         766.07 ns/iter (+/- 9.61)
test iter_std_highbits              ... bench:         765.31 ns/iter (+/- 18.42)
test iter_std_random                ... bench:         765.99 ns/iter (+/- 45.18)
test iter_std_serial                ... bench:         763.18 ns/iter (+/- 15.77)
test lookup_fail_foldhash_highbits  ... bench:       1,936.33 ns/iter (+/- 47.30)
test lookup_fail_foldhash_random    ... bench:       2,425.87 ns/iter (+/- 99.09)
test lookup_fail_foldhash_serial    ... bench:       2,127.12 ns/iter (+/- 67.47)
test lookup_fail_std_highbits       ... bench:       8,581.58 ns/iter (+/- 279.54)
test lookup_fail_std_random         ... bench:       8,333.45 ns/iter (+/- 114.58)
test lookup_fail_std_serial         ... bench:       8,097.86 ns/iter (+/- 185.79)
test lookup_foldhash_highbits       ... bench:       2,172.82 ns/iter (+/- 43.31)
test lookup_foldhash_random         ... bench:       2,370.67 ns/iter (+/- 43.23)
test lookup_foldhash_serial         ... bench:       2,066.38 ns/iter (+/- 113.00)
test lookup_std_highbits            ... bench:       8,777.33 ns/iter (+/- 266.92)
test lookup_std_random              ... bench:       8,688.57 ns/iter (+/- 277.06)
test lookup_std_serial              ... bench:       8,676.47 ns/iter (+/- 196.41)
test rehash_in_place                ... bench:     191,902.86 ns/iter (+/- 3,842.01)

test result: ok. 0 passed; 0 failed; 0 ignored; 41 measured; 0 filtered out; finished in 50.84s

     Running benches/insert_unique_unchecked.rs (target/release/deps/insert_unique_unchecked-70b56273fe907210)

running 2 tests
test insert                  ... bench:       4,811.63 ns/iter (+/- 36.77)
test insert_unique_unchecked ... bench:       3,464.63 ns/iter (+/- 109.34)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out; finished in 1.11s

     Running benches/set_ops.rs (target/release/deps/set_ops-41e9dbeede3fedd5)

running 10 tests
test set_ops_bit_and                ... bench:       8,704.29 ns/iter (+/- 204.07)
test set_ops_bit_and_assign         ... bench:       6,075.31 ns/iter (+/- 137.78)
test set_ops_bit_or                 ... bench:      68,336.41 ns/iter (+/- 2,494.32)
test set_ops_bit_or_assign          ... bench:      53,319.49 ns/iter (+/- 1,172.32)
test set_ops_bit_xor                ... bench:      73,222.35 ns/iter (+/- 3,848.50)
test set_ops_bit_xor_assign         ... bench:      55,755.47 ns/iter (+/- 2,032.17)
test set_ops_sub_assign_large_small ... bench:      55,242.62 ns/iter (+/- 1,864.93)
test set_ops_sub_assign_small_large ... bench:       6,130.91 ns/iter (+/- 130.80)
test set_ops_sub_large_small        ... bench:      73,938.60 ns/iter (+/- 2,211.95)
test set_ops_sub_small_large        ... bench:         626.02 ns/iter (+/- 19.25)

Highlights:

- test insert                  ... bench:       9,978.20 ns/iter (+/- 218.31)
+ test insert                  ... bench:       4,811.63 ns/iter (+/- 36.77)
- test set_ops_bit_and                ... bench:      10,482.35 ns/iter (+/- 387.65)
+ test set_ops_bit_and                ... bench:       8,704.29 ns/iter (+/- 204.07)
- test set_ops_sub_large_small        ... bench:      87,790.68 ns/iter (+/- 2,284.42)
+ test set_ops_sub_large_small        ... bench:      73,938.60 ns/iter (+/- 2,211.95)

Previously, each group would be matched twice(first by find_insert_slot_in_group, which finds empty or deleted slot in the group, and then again by match_empty, which finds empty slot in the group) until a deleted slot was found in the table. The second match is clearly redundant.

By removing the redundant check, we significantly improve the insertion speed and query speed in a newly created HashMap/HashSet.

@ldm0
Copy link
Contributor Author

ldm0 commented Feb 27, 2025

Hmm, I tested it on my x64 machine, improvements is not that significant though. There might be some subtle difference between sse2 and neon.

@ldm0
Copy link
Contributor Author

ldm0 commented Mar 7, 2025

ping @Amanieu

@Amanieu
Copy link
Member

Amanieu commented Mar 7, 2025

I'm getting varying results on different systems:

Cortex-A72:

old: test insert ... bench:      83,321.21 ns/iter (+/- 579.96)
new: test insert ... bench:      84,078.87 ns/iter (+/- 430.20)

Cortex-X4:

old: test insert ... bench:      21,691.12 ns/iter (+/- 1,072.98)
new: test insert ... bench:      16,136.07 ns/iter (+/- 261.89)

Ryzen 3950X:

old: test insert ... bench:       8,964.59 ns/iter (+/- 176.85)
new: test insert ... bench:       8,726.09 ns/iter (+/- 390.04)

It's not clear that this is 100% a win, though it seems to help on newer ARM CPUs. What systems did you try this on?

@ldm0
Copy link
Contributor Author

ldm0 commented Mar 8, 2025

I'm getting varying results on different systems:

Cortex-A72:

old: test insert ... bench:      83,321.21 ns/iter (+/- 579.96)
new: test insert ... bench:      84,078.87 ns/iter (+/- 430.20)

Cortex-X4:

old: test insert ... bench:      21,691.12 ns/iter (+/- 1,072.98)
new: test insert ... bench:      16,136.07 ns/iter (+/- 261.89)

Ryzen 3950X:

old: test insert ... bench:       8,964.59 ns/iter (+/- 176.85)
new: test insert ... bench:       8,726.09 ns/iter (+/- 390.04)

It's not clear that this is 100% a win, though it seems to help on newer ARM CPUs. What systems did you try this on?

My benchmarks:

M1&macOS:

- test insert                  ... bench:       9,957.01 ns/iter (+/- 449.58)
+ test insert                  ... bench:       4,783.33 ns/iter (+/- 109.99)

M1Pro&macOS:

- test insert                  ... bench:       9,978.20 ns/iter (+/- 218.31)
+ test insert                  ... bench:       4,811.63 ns/iter (+/- 36.77)

M4&macOS:

- test insert                  ... bench:       3,774.82 ns/iter (+/- 153.22)
+ test insert                  ... bench:       3,208.76 ns/iter (+/- 147.87)

i9-13900HX&Windows

- test insert                  ... bench:       4,470.61 ns/iter (+/- 570.73)
+ test insert                  ... bench:       4,458.62 ns/iter (+/- 633.96)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants