Registration and Scaling
Problem Statement
Copper startup at scale has to address two linked issues:
children attempting parent RPCs before the parent provider was callable
too much repeated address-book processing during startup
Runtime Design
Copper keeps the following production behaviors:
a prepared job-local address book before the full Copper launch
runtime support for
-prefiltered_address_bookcached parent-readiness checking after the first successful readiness probe
compact retained timing logs instead of the deeper debug-phase startup logs
Readiness Semantics
The readiness evaluation work established a few definitions that are important for reading Copper startup logs correctly.
provider registration completed after <us>The local rank finished its own provider setup and considers itself ready. This is a local statement only. It does not guarantee that the parent or any other remote rank is already callable.
parent readiness confirmedA child rank successfully probed its parent and confirmed that the parent is ready to accept forwarded RPCs.
parent not ready yetThe child reached the parent address, but the parent was not yet ready to serve the requested RPC path.
thallium exception while probing parent readiness ... HG_NOENTRYThe readiness probe itself reached the remote address, but the expected RPC or provider entry was not yet available there.
timed out waiting for parent readinessThe child exhausted the overall readiness-wait budget before the parent became callable.
Address-Book Preparation
facilityOne rank filters the provided facility address book down to the current job allocation.
discoverThe helper
list_cxi_hsn_thalliumis run across the allocation. Copper preserves the raw output and derives the final hostname-to-endpoint mapping from the column selected bynet_type.
Retained Timing Signals
The production-visible startup timing signals are:
provider registration completed after <us>first successful parent rpc_... completed after <us> since provider startup
These lines are intentionally compact so that large-scale runs keep useful startup observability without restoring heavy phase-by-phase debug logging.
To retain these two lines in current builds, use at least -l 4. They are
emitted through LOG(INFO), so -l 0 through -l 3 will not show
them.
Observed Scaling Behavior
The retained evaluation data shows two different views of startup behavior:
an earlier debug-phase evaluation that was useful for isolating the readiness race and provider-startup skew
a later stable startup path that uses prepared address-book generation and compact retained timing signals
The earlier debug-phase evaluation showed three distinct regimes.
Scale |
Import result |
Average provider registration time |
Main readiness outcome |
|---|---|---|---|
|
success |
|
readiness handshake is sufficient and clean |
|
success |
about |
handshake helps, but |
|
failed in the original evaluation |
about |
startup skew dominates; readiness timeouts and many probe failures appear |
The most important quantitative finding was the provider-registration growth:
145.36 msat2nodes154.31 msat16nodes2.15 sat64nodes10.25 sat128nodes44.17 sat256nodes
This is why the readiness probe mattered but was not, by itself, the complete solution. By 128 and 256 nodes, children were often correct to wait; the parent really was not ready yet.
Current Stable High-Scale Reference
The most recent retained high-scale registration measurements should be read as
the current stable reference, not as part of the earlier debug-heavy
iter1/iter2 diagnosis.
Scale and run |
Registration log coverage |
Provider registration min / avg / max |
Main meaning |
|---|---|---|---|
|
|
|
startup is still expensive at this scale, but the run completed and the provider timing is fully captured |
|
|
|
prepared address-book startup removed the earlier high-scale
registration blow-up and brought provider registration back down to
about |
These later results are the preferred reference when discussing current production expectations at high scale.
Iter1 Versus Iter2 at High Scale
This comparison should be read as a debugging-phase study, not as the final production baseline. These runs were collected while the registration/readiness logic was still under active investigation, with much heavier logging and with code that did not yet match the later stable startup path.
The later iter2 runs changed the probe policy from a fixed 10 ms loop
with a 5 s budget to an exponential-backoff policy with longer patience.
Scale and run |
Import result |
Import elapsed |
Average provider registration |
Readiness-probe exceptions |
Main conclusion |
|---|---|---|---|---|---|
|
success |
|
|
|
handshake helped, but the original probe loop still saw many readiness failures |
|
success |
|
|
|
better probing removed the visible readiness-race symptoms |
|
failed |
|
|
|
provider-registration skew was too large for the original loop |
|
success |
|
|
|
better probing improved correctness, but did not make provider startup faster |
|
timed import completed, but cleanup was unstable |
|
one sampled rank: |
still visible |
provider-registration skew remained the dominant large-scale bottleneck |
The table remains valuable because it shows why the readiness handshake and the later startup redesign were needed. It should not, however, be used as the best summary of the current stable code path.
Step-by-Step Runtime Sequence
The readiness change altered the runtime sequence from:
child immediately sends parent RPC
to:
child probes parent readiness, waits if needed, and only then sends the real parent RPC
Operationally, the sequence is:
each rank starts its Mercury, Margo, and Thallium engine state
each rank performs local provider setup
the rank logs
provider registration completed after <us>when its own provider is readya child rank later discovers that it needs parent service
the child sends a readiness probe first
if the parent is not ready, the child waits and retries
after
parent readiness confirmed, the child sends the real forwarded RPC
What the Code Measurement Means
The code-path notes established a second important distinction:
provider-registration timing measures local provider startup only
it does not measure network-wide readiness
This explains why a rank can log provider registration completed and a
different child can still see HG_NOENTRY while probing that parent. Local
provider readiness and remote reachability are related, but they are not the
same event.
For practical startup measurement at scale:
provider registration completed after <us>is the local registration completion timefirst successful parent rpc_... completed after <us> since provider startupis the compact service-readiness time for the parent-connected distributed path
Operational Interpretation
The maintained conclusion from the registration-readiness work is:
the readiness handshake is correct and useful
better probe timing is also useful
the long-term scaling bottleneck is still provider-registration skew
In other words, the dominant large-scale problem is no longer just a narrow “child forwarded too early once” race. The larger issue is that provider startup becomes too slow and too uneven across ranks.
Operational Use
Use
facilitymode when the site-maintained address book is reliable.Use
discovermode when endpoint discovery should be derived from the live allocation.Preserve
logs/copper_address_book_full_output.txtwhendiscovermode is used; it is the primary provenance artifact for discovered endpoint data.Treat
provider registration completed after <us>and the first successful parent-RPC timing as the two key retained startup indicators at production verbosity.