c++ - What Does This Loop Optimization Profile Mean? -
i getting started using vtune. educational example, i'm trying hand @ micro-optimization in debug mode. here's toy example codebase. code appears in c++ non-const
method, , ".data_length" int
field of object (offset 32 bytes), typically large number:
for (int i=0;i<data_length;++i) { /*...*/ }
vtune helpfully showed me assembly (from msvc 2013) for loop. note performance numbers, in seconds (i removed timings don't register). added annotation:
0x140433084 mov dword ptr [rsp+0x588], 0x0 | | ;"i=0" 0x14043308f jmp 0x1404330a1 <block 77> | | ;jump compare , loop body | | 0x140433091 block 76: | | ;"++i" 0x140433091 mov eax, dword ptr [rsp+0x588] | 0.451 | 0x140433098 inc eax | 0.002 | 0x14043309a mov dword ptr [rsp+0x588], eax | | | | 0x1404330a1 block 77: | | ;if (!(i<data_length)) goto next section 0x1404330a1 mov rax, qword ptr [rsp+0x6f0] | 0.407 | 0x1404330a9 mov eax, dword ptr [rax+0x20] | | ; move "data_length" "eax". 0x1404330ac cmp dword ptr [rsp+0x588], eax | 1.195 | ; "i<data_length;" 0x1404330b3 jnl 0x140433106 <block 80> | | 0x1404330b5 block 78: | | . . . | | ;loop body. there's jmp in here | | ; block 76. | | 0x140433106 block 80: | | ;code following loop
what tells me loading i
increment incurring caching failure (why isn't register, geez?). second, test logic pretty sluggish--especially loading ".data_length" each time.
i figured, why not load once , use decrement:
for (int i=data_length-1;i>=0;--i) { /*...*/ }
the assembly , timing like:
0x140433084 mov rax, qword ptr [rsp+0x6f0] | | ;same code, happens once! 0x14043308c mov eax, dword ptr [rax+0x20] | | 0x14043308f dec eax | | ;"data_length-1" 0x140433091 mov dword ptr [rsp+0x588], eax | | ;"i=data_length-1;" 0x140433098 jmp 0x1404330aa <block 77> | | ;jump compare , loop body | | 0x14043309a block 76: | | ;"++i" 0x14043309a mov eax, dword ptr [rsp+0x588] | 0.357 | 0x1404330a1 dec eax | 0.002 | 0x1404330a3 mov dword ptr [rsp+0x588], eax | | | | 0x1404330aa block 77: | | ;if (i<0) goto next section 0x1404330aa cmp dword ptr [rsp+0x588], 0x0 | 0.401 | ; "i>=0;" 0x1404330b2 jl 0x140433105 <block 80> | 2.806 | 0x1404330b4 block 78: | | . . . | | ;loop body. same above, think. | | 0x140433105 block 80: | | ;code following loop
look @ jl
! 3 seconds jump? thought maybe location wasn't in instruction cache, can see it's quite close (right after loop body, you'd expect). more importantly, first method should have same problem anyway. first version's jnl
didn't register.
my guess timing getting eaten in loop body--although it's weird happens in 1 case not other. have more work looking in that?
i wrote this, , looking again think may boring branch prediction issue. cpus take branches backward in loops, in case branch block 80 shouldn't taken of time.
i'm still learning this, assuming annotated correctly, have few questions:
- am right in thinking
i
should register, , become 1 in optimized mode? - whatever happening
jl
in second version? indeed branch prediction fail? why doesn't show on next instruction?
edit: cpu being tested on intel 990x (gulftown, 2011).
i
should in register (e.g. if compiler optimised); it's possible code in loop body (not shown) uses registers , it's better performance avoid putting i
in register (especially if loop body contains inner loop or function call or something).
i have no idea cpu (via, amd, intel; atom, xeon; how old), branch prediction in modern cpus should handle branch well. maybe expect branch mis-prediction when loop terminates, , if number of iterations low (e.g. data_length-1
small) single mis-prediction might significant.
Comments
Post a Comment