Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow unaligned reads with _mm_loadl_epi64 #584

Merged
merged 2 commits into from
Oct 27, 2018

Conversation

brian-armstrong
Copy link
Contributor

@brian-armstrong brian-armstrong commented Oct 27, 2018

As discussed in #582 it is "safe" to read unaligned memory locations with _mm_loadl_epi64. This code was suggested by scottmcm in the Rust Discord - I'm not familiar enough with this part of Rust to offer much commentary on it.

But to note, I have tried this with my code that uses this intrinsic and am seeing the behavior I'd expect to see.

@@ -1145,7 +1145,7 @@ pub unsafe fn _mm_setzero_si128() -> __m128i {
)]
#[stable(feature = "simd_x86", since = "1.27.0")]
pub unsafe fn _mm_loadl_epi64(mem_addr: *const __m128i) -> __m128i {
_mm_set_epi64x(0, simd_extract((*mem_addr).as_i64x2(), 0))
_mm_set_epi64x(0, simd_extract(ptr::read_unaligned::<__m128i >(mem_addr).as_i64x2(), 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation here (::<__m128i>) shouldn't be necessary.

Maybe one can just use _mm_loadu_si128 here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends - in my case, I need to do a shuffle afterwards anyway, so it comes out to be about the same. If you just wanted to load the lower 64 bits though, then adding in an unnecessary shuffle afterwards would likely cost some performance. At any rate, the fact that the instruction, when seemingly used correctly, yields a segfault is fairly surprising.

Copy link
Contributor

@gnzlbg gnzlbg Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ptr::read_unaligned reads 128bits, and so does _mm_loadu_si128, or what am I missing? With a shuffle for the first 64 bits afterwards, LLVM should optimize this to a 64-bit load in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, you're right. I actually don't think I know how to write this PR, now that I think about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are doing great! I think you can replace the ptr::read_unaligned(mem_addr) with just _mm_loadu_si128(mem_addr) and that should just work.

Copy link
Contributor

@gnzlbg gnzlbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. There is only one nitpick, otherwise this LGTM.

@brian-armstrong
Copy link
Contributor Author

So I cleaned up my example and got the right flags so that godbolt shows what I think is the right instructions. I'm not sure how to translate this into Rust though.

https://gcc.godbolt.org/z/EZv0qv

Notably,

_mm_loadl_epi64(...);
_mm_storel_epi64(...);

becomes

        mov     rax, qword ptr [rdi + 24]
        mov     qword ptr [rdi + 40], rax

It's weird that it takes a qword ptr because I think it really only is loading a dword here, though I guess I have nothing to base that on.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 27, 2018

It's weird that it takes a qword ptr because I think it really only is loading a dword here, though I guess I have nothing to base that on.

So the Intrinsics Guide says these should lower to a movq, but these intrinsics are so simple, that the code that they will lower to will heavily depend on what the code at the call site around them is doing because LLVM can easily reason about them.

When we merge this, you can try them in your app, and if you see unexpected code generation we can revise their implementation.

@gnzlbg gnzlbg merged commit 635a995 into rust-lang:master Oct 27, 2018
@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 27, 2018

Thanks!

@brian-armstrong
Copy link
Contributor Author

So I'm looking at the trace in my actual program, and I believe the C version which uses _mm_loadl_epi64 is emitting movq while Rust (with this patch) is emitting vmovqdu. It's unclear if this is because llvm is unable to optimize the Rust version for some reason or if the C version has just correctly determined that only a quadword is needed, not a double quadword.

@brian-armstrong
Copy link
Contributor Author

I wonder if the right code here isn't just

    _mm_set_epi64x(0, ptr::read_unaligned(mem_addr as *const i64))

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 27, 2018

That could work too. Could you modify your stdsimd copy locally, and use it in your project, and report whether that works better?

@brian-armstrong
Copy link
Contributor Author

Yes, it does emit movq. It's hard to say whether that's generally faster, but it does seem closer to the original spirit of the intrinsic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants