Registration and Scaling ======================== Problem Statement ----------------- Copper startup at scale has to address two linked issues: - children attempting parent RPCs before the parent provider was callable - too much repeated address-book processing during startup Runtime Design -------------- Copper keeps the following production behaviors: - a prepared job-local address book before the full Copper launch - runtime support for ``-prefiltered_address_book`` - cached parent-readiness checking after the first successful readiness probe - compact retained timing logs instead of the deeper debug-phase startup logs Readiness Semantics ------------------- The readiness evaluation work established a few definitions that are important for reading Copper startup logs correctly. ``provider registration completed after `` The local rank finished its own provider setup and considers itself ready. This is a local statement only. It does not guarantee that the parent or any other remote rank is already callable. ``parent readiness confirmed`` A child rank successfully probed its parent and confirmed that the parent is ready to accept forwarded RPCs. ``parent not ready yet`` The child reached the parent address, but the parent was not yet ready to serve the requested RPC path. ``thallium exception while probing parent readiness ... HG_NOENTRY`` The readiness probe itself reached the remote address, but the expected RPC or provider entry was not yet available there. ``timed out waiting for parent readiness`` The child exhausted the overall readiness-wait budget before the parent became callable. Address-Book Preparation ------------------------ ``facility`` One rank filters the provided facility address book down to the current job allocation. ``discover`` The helper ``list_cxi_hsn_thallium`` is run across the allocation. Copper preserves the raw output and derives the final hostname-to-endpoint mapping from the column selected by ``net_type``. Retained Timing Signals ----------------------- The production-visible startup timing signals are: - ``provider registration completed after `` - ``first successful parent rpc_... completed after since provider startup`` These lines are intentionally compact so that large-scale runs keep useful startup observability without restoring heavy phase-by-phase debug logging. To retain these two lines in current builds, use at least ``-l 4``. They are emitted through ``LOG(INFO)``, so ``-l 0`` through ``-l 3`` will not show them. Observed Scaling Behavior ------------------------- The retained evaluation data shows two different views of startup behavior: - an earlier debug-phase evaluation that was useful for isolating the readiness race and provider-startup skew - a later stable startup path that uses prepared address-book generation and compact retained timing signals The earlier debug-phase evaluation showed three distinct regimes. .. list-table:: :header-rows: 1 * - Scale - Import result - Average provider registration time - Main readiness outcome * - ``2-64`` nodes - success - ``145 ms`` to ``2.15 s`` - readiness handshake is sufficient and clean * - ``128`` nodes - success - about ``10.25 s`` - handshake helps, but ``HG_NOENTRY`` still appears during readiness probing * - ``256`` nodes - failed in the original evaluation - about ``44.17 s`` - startup skew dominates; readiness timeouts and many probe failures appear The most important quantitative finding was the provider-registration growth: - ``145.36 ms`` at ``2`` nodes - ``154.31 ms`` at ``16`` nodes - ``2.15 s`` at ``64`` nodes - ``10.25 s`` at ``128`` nodes - ``44.17 s`` at ``256`` nodes This is why the readiness probe mattered but was not, by itself, the complete solution. By 128 and 256 nodes, children were often correct to wait; the parent really was not ready yet. Current Stable High-Scale Reference ----------------------------------- The most recent retained high-scale registration measurements should be read as the current stable reference, not as part of the earlier debug-heavy ``iter1``/``iter2`` diagnosis. .. list-table:: :header-rows: 1 * - Scale and run - Registration log coverage - Provider registration min / avg / max - Main meaning * - ``256`` nodes latest retained high-scale reference - ``256 / 256`` ranks - ``149686 / 41917765.86 / 43527063 us`` - startup is still expensive at this scale, but the run completed and the provider timing is fully captured * - ``512`` nodes current stable path - ``512 / 512`` ranks - ``140204 / 148274.60 / 171363 us`` - prepared address-book startup removed the earlier high-scale registration blow-up and brought provider registration back down to about ``0.15 s`` These later results are the preferred reference when discussing current production expectations at high scale. Iter1 Versus Iter2 at High Scale -------------------------------- This comparison should be read as a debugging-phase study, not as the final production baseline. These runs were collected while the registration/readiness logic was still under active investigation, with much heavier logging and with code that did not yet match the later stable startup path. The later ``iter2`` runs changed the probe policy from a fixed ``10 ms`` loop with a ``5 s`` budget to an exponential-backoff policy with longer patience. .. list-table:: :header-rows: 1 * - Scale and run - Import result - Import elapsed - Average provider registration - Readiness-probe exceptions - Main conclusion * - ``128`` nodes ``iter1`` - success - ``1:07.83`` - ``10.25 s`` - ``1399`` - handshake helped, but the original probe loop still saw many readiness failures * - ``128`` nodes ``iter2`` - success - ``0:58.03`` - ``10.44 s`` - ``0`` in the aggregate pass - better probing removed the visible readiness-race symptoms * - ``256`` nodes ``iter1`` - failed - ``0:54.23`` - ``44.17 s`` - ``16405`` - provider-registration skew was too large for the original loop * - ``256`` nodes ``iter2`` - success - ``1:45.52`` - ``45.72 s`` - ``843`` - better probing improved correctness, but did not make provider startup faster * - ``512`` nodes ``iter1`` - timed import completed, but cleanup was unstable - ``4:43.08`` - one sampled rank: ``206.28 s`` - still visible - provider-registration skew remained the dominant large-scale bottleneck The table remains valuable because it shows why the readiness handshake and the later startup redesign were needed. It should not, however, be used as the best summary of the current stable code path. Step-by-Step Runtime Sequence ----------------------------- The readiness change altered the runtime sequence from: ``child immediately sends parent RPC`` to: ``child probes parent readiness, waits if needed, and only then sends the real parent RPC`` Operationally, the sequence is: #. each rank starts its Mercury, Margo, and Thallium engine state #. each rank performs local provider setup #. the rank logs ``provider registration completed after `` when its own provider is ready #. a child rank later discovers that it needs parent service #. the child sends a readiness probe first #. if the parent is not ready, the child waits and retries #. after ``parent readiness confirmed``, the child sends the real forwarded RPC What the Code Measurement Means ------------------------------- The code-path notes established a second important distinction: - provider-registration timing measures local provider startup only - it does not measure network-wide readiness This explains why a rank can log ``provider registration completed`` and a different child can still see ``HG_NOENTRY`` while probing that parent. Local provider readiness and remote reachability are related, but they are not the same event. For practical startup measurement at scale: - ``provider registration completed after `` is the local registration completion time - ``first successful parent rpc_... completed after since provider startup`` is the compact service-readiness time for the parent-connected distributed path Operational Interpretation -------------------------- The maintained conclusion from the registration-readiness work is: - the readiness handshake is correct and useful - better probe timing is also useful - the long-term scaling bottleneck is still provider-registration skew In other words, the dominant large-scale problem is no longer just a narrow "child forwarded too early once" race. The larger issue is that provider startup becomes too slow and too uneven across ranks. Operational Use --------------- - Use ``facility`` mode when the site-maintained address book is reliable. - Use ``discover`` mode when endpoint discovery should be derived from the live allocation. - Preserve ``logs/copper_address_book_full_output.txt`` when ``discover`` mode is used; it is the primary provenance artifact for discovered endpoint data. - Treat ``provider registration completed after `` and the first successful parent-RPC timing as the two key retained startup indicators at production verbosity.