Gemini-conduit design notes, from code review 8/2010

number of bounce buffers, make clear the effects of, in the readme

see vapi conduit for example

use gprofile or oprofile to measure startup of a large job
(magic argument for testcollperf?  correctnmess and performance iterations both > 0)
replace bootstrap exhcange with a faster one
replace pmi_barrier with a faster one

PHH DONE dont need segment addr and size info in peersegmentdata because gasnet already has that

PHH DONE the smsg peer data is almost all constant and doesn't need to be exchanged or kept around in the runtime array of structs

in gc_init_messaging, loop around

PHH DONE:multiplier sizing completion queue creates endpoints for the local supernode, which are not used when pshm is in use  there is a gasneti_pshm_in_supernode to test if we need an endpoint for them
PHH DONE: but first must fix shutdown coordination not to use SmsgSend w/i the supernode
PHH DONE: and implement non-GNI path for loopback AMs when PSHM is disabled

PHH DONE can save memory for smsg buffering and cq sizing as well (and endpoints)

round up gc_post_descriptor to mulitple of cache line  (only in PAR bulds?)
  probablyu a good idea everywhere since gni libraries and kernel write fields in it.

PHH DONE fix name of gc.h to be longer

LCS DONE remove am packet sequence number (only debug field or delete)
or maybe a timestamp field


PHH DONE eliminate from and to fields, could pack packet type together with numargs


PHH DONE (but worth checking again) remove unused and constant fields from all structs

LCS DONE insert padding in front of am medium payload if necessary to assure 64 bit alignment, just so the memcpy at destination will be faster  (use gasneti_align? call in call to gc_send inAMMediumRequest



PHH DONE-or-N/A if you want to piggyback credit return on amreply, AND use the smsg buffer to pass directly to the handler, then be sure to save at least one credit per currently executing handler

DONE-or-N/A in medium dispatch, there are two memcpy's one for data and one for header,
you could use just one, if the data was 64 aligned.  Now there are no memcpys


could memcpy the varargs arg list in amrequest, then revers the args in the call to the handler if necessary (too twisty)

LCS NOT DONE replace gc_atomic_dec_if_positive with gasneti_semaphore_trydown
Paul suggests semaphore is bigger, so it is fine the way it is

PHH DONE is a poll needed in the amrequest routines?  (yes! see other conduits)
put it up front
but NOT in the reply fns


DONE typeof is not portable, so change gc_post_descriptor_t to put the gni_post_descriptor first, and then cast to a gc_post_descriptor


PHH  DONE to piggyback credit return, use pointer to struct as token, then the handler can set a flag if it sends a reply
struct tokenstruct {
  gasnet_node_t source;
  int reply_sent;
}
then token is pointer to

PHH DONE gc_debug is not used

PHH DONE gasnete_put_nb always blocks, but it doesn't have to if the data is shorter than the bounce-register cutover (if source is outside segme)
PHH DONE if source is inside segment, then if source length is less than 128, then could copy to the immediate bounce buffer and return before the put is complete

when there are multiple 1 MB transfrer for GET, make the first one shorter, to end on an aligned boundary so that <all> the transfers are not misaligned

first could be MAX - misalignment, pr perhaps very small, try both ways

see testalign

PHH DONE if add quota for registrations, could ifre off all the 1 MB blocks and synchronize all of them at once,
for nbi, just increment the counters

PHH DONE for put_nbi_bulk , create a recursive nbi access region and return the handle for it as the handle for the put_nb_bulk

PHH DONE for put_nb, for small puts, copy the data to the immediate bounce buffer just so you can return faster

PHH DONE for medium size out-of segment puts, return as soon as data copied into bounce buffer
but before the rdma is finished

PHH N/A bounce for medium in-segmgnet puts, just block
- used bounce buffer instead just as for medium sized out-of-seg

PHH DONE put_nbi needs the same treatment for medium size as now done in put_nb

be careful to get 1M plus 17 bytes correct



consider having a 512 byte zero block somewher in mapped semgnet to use as source for short memset  

(or memset the immediate buffer)

DONE PHH: cray requires 4 byte alignment for get, could break up into a small transfer to reach alignment

DONE PHH OR together the source, dest, and length, then & 3 only once

DONE PHH: short misalighned messages should get an aligned block to the immediate buffer then
copy on completion

if we fire off multiple chunks then make sure that the registrations don't overlap
and break on page boundaries
to prevent duplicaton

nb. try to understand use of huge_tlb

read -h output for the tests
testlarge and testsmall have -in to have source be in-segment


DONE gc_rdma_get doesn't use immediate bounce buffer!

DONE gc_rdma_get doesn't retry memregister the way gc_rdma_put does


DONE Separate concept of size of immediate bounce buffer from the concept of when to switch from FMA to RDMA  (fma->rdma at 1 K for MPI 4K for libpgas)

INLINE the small fry

consider changing queue for smsg work queue into a circular buffer, becvause it cannto be more than gc_ranks

enqueus and dequeues do not need to share a lock, but they need an atomic coiunter for space available.
PHH: but still need to atomically check whether its already queued before enqueue

PHH DONE gc_queue push not used

PHH DONE on-queue indication could be a flag, rather than a pointer to which queue you are on, since there is only the smsg_work_queue

PHH GONE (qi no longer in gpd) note, queue item and gni_post cannot both be first in the gc_post_descriptor


If I call handlers from the smsg buffer must assure that a handler cannot internally call poll_smsng_q

PHH DONE (but should do again) look for unused fields (initialized only)

PHH GONE close smsg_fd 

PHH GONE remembery why /dev/zero instead of anon

either the sync synchronize in handle_am_X isn;t needed, or it doesn't work anyway

Alternate atomic thing is a gasnet_semaphore_t 
implementation is efficient.

PHH DONE get rid of loop around printf failed to coordinate shutdown

what does PMI_Finalize really mean? Is there a timer that is started?  Otherwise, bould move it above the flush_streams and trace_finish


Items added post-review:

PHH DONE: + Document and enforce a minimum value of GASNET_NETWORKDEPTH

PHH DONE: + Does shutdown msg need an am_credit to avoid GNI_SmsgSend() failing w/ NOT_DONE?

PHH DONE: + Requests could carry "banked" credits (if we can find header space)

+ Why do we still see NOT_DONE from SmsgSend (well before reaching shutdown).

+ If am_credits[] were in aux-seg, then could use AMO for gasnetc_send_am_nop().

+ Since pd's *are* in aux-seg, might use FMA for other internal purpose(s).
