c++ - Multithreaded inline assembly -


i'm trying create large number of sha256 hashes on t4 machine. t4 has 'sha256' instruction allows me calculate hash in 1 op code. created inline assembly template call sha256 opcode:

in c++ code:

extern "c" {    void processchunk(const char* buf, uint32_t* state); } 

pchunk.il:

.inline processchunk,8   .volatile   /* copy state */   ldd [%o1],%f0 /* load 8 bytes */    ldd [%o1 + 8],%f2 /* load 8 bytes */    ldd [%o1 +16],%f4 /* load 8 bytes */    ldd [%o1 +24],%f6 /* load 8 bytes */     /* copy data */   ldd [%o0],%f8 /* load 8 bytes */    ldd [%o0+8],%f10 /* load 8 bytes */    ldd [%o0+16],%f12 /* load 8 bytes */    ldd [%o0+24],%f14 /* load 8 bytes */    ldd [%o0+32],%f16 /* load 8 bytes */    ldd [%o0+40],%f18 /* load 8 bytes */    ldd [%o0+48],%f20 /* load 8 bytes */    ldd [%o0+56],%f22 /* load 8 bytes */     sha256   nop    std %f0, [%o1]   std %f2, [%o1+8]   std %f4, [%o1+16]   std %f6, [%o1+24] .end 

things working great in single threaded environment not fast enough. used openmp parallelize application can call processchunk simultaneously. multithreaded version of application works ok few threads when increase number of threads (16 example) begin bogus results. inputs processchunk function both stack variables local each thread. i've confirmed inputs generated correctly no matter number of threads. if put processchunk critical section, correct results performance degrades (single thread performs better). i'm stumped on problem might be. possible solaris threads step on floating point registers of thread?

any ideas how can debug this?

regards

update:

i changed code use quad sized (16 byte) load , saves:

.inline processchunk,8 .volatile   /* copy state */   ldq [%o1],    %f0   ldq [%o1 +16],%f4    /* copy data */   ldq [%o0],   %f8   ldq [%o0+16],%f12   ldq [%o0+32],%f16   ldq [%o0+48],%f20    lzd %o0,%o0   nop    stq %f0, [%o1]   stq %f4, [%o1+16] .end 

at first glance issue seems have gone away. performance degrades after 32 threads number i'm sticking (for moment @ least) , current code seem getting correct results. masked issue i'm going run further tests.

update 2:

i found time go , able decent results t4 (10s of millions of hashes in minute).

the changes made were:

  1. used assembly instead of inline assembly
  2. as functions leaf functions, didn't touch register window

i packed in library , made code available here

not spark architecture expert (i might wrong) here's guess:

your inline assembly code loads stack variable set of specific floating point registers able call sha asssembly operation.

how work 2 threads? both calls processchunk try copy different input values same cpu registers.

the way see it, cpu registers in asm code "global" variables high level programming language.

how many cores system have? maybe fine until have thread per core/set of hardware registers. imply behavior of code dependent on way threads scheduled on different cores of system.

do know how system behaves when schedules threads same process on cpu core? mean is: system store registers of unscheduled thread, in context switch?

a test run spawn number of thread equals n of cpu cores , run same test n+1 (my assumption here there floating point register set per cpu core).


Comments

Popular posts from this blog

html - Sizing a high-res image (~8MB) to display entirely in a small div (circular, diameter 100px) -

java - IntelliJ - No such instance method -

identifier - Is it possible for an html5 document to have two ids? -