Registration and Scaling

Problem Statement

Copper startup at scale has to address two linked issues:

  • children attempting parent RPCs before the parent provider was callable

  • too much repeated address-book processing during startup

Runtime Design

Copper keeps the following production behaviors:

  • a prepared job-local address book before the full Copper launch

  • runtime support for -prefiltered_address_book

  • cached parent-readiness checking after the first successful readiness probe

  • compact retained timing logs instead of the deeper debug-phase startup logs

Readiness Semantics

The readiness evaluation work established a few definitions that are important for reading Copper startup logs correctly.

provider registration completed after <us>

The local rank finished its own provider setup and considers itself ready. This is a local statement only. It does not guarantee that the parent or any other remote rank is already callable.

parent readiness confirmed

A child rank successfully probed its parent and confirmed that the parent is ready to accept forwarded RPCs.

parent not ready yet

The child reached the parent address, but the parent was not yet ready to serve the requested RPC path.

thallium exception while probing parent readiness ... HG_NOENTRY

The readiness probe itself reached the remote address, but the expected RPC or provider entry was not yet available there.

timed out waiting for parent readiness

The child exhausted the overall readiness-wait budget before the parent became callable.

Address-Book Preparation

facility

One rank filters the provided facility address book down to the current job allocation.

discover

The helper list_cxi_hsn_thallium is run across the allocation. Copper preserves the raw output and derives the final hostname-to-endpoint mapping from the column selected by net_type.

Retained Timing Signals

The production-visible startup timing signals are:

  • provider registration completed after <us>

  • first successful parent rpc_... completed after <us> since provider startup

These lines are intentionally compact so that large-scale runs keep useful startup observability without restoring heavy phase-by-phase debug logging.

To retain these two lines in current builds, use at least -l 4. They are emitted through LOG(INFO), so -l 0 through -l 3 will not show them.

Observed Scaling Behavior

The retained evaluation data shows two different views of startup behavior:

  • an earlier debug-phase evaluation that was useful for isolating the readiness race and provider-startup skew

  • a later stable startup path that uses prepared address-book generation and compact retained timing signals

The earlier debug-phase evaluation showed three distinct regimes.

Scale

Import result

Average provider registration time

Main readiness outcome

2-64 nodes

success

145 ms to 2.15 s

readiness handshake is sufficient and clean

128 nodes

success

about 10.25 s

handshake helps, but HG_NOENTRY still appears during readiness probing

256 nodes

failed in the original evaluation

about 44.17 s

startup skew dominates; readiness timeouts and many probe failures appear

The most important quantitative finding was the provider-registration growth:

  • 145.36 ms at 2 nodes

  • 154.31 ms at 16 nodes

  • 2.15 s at 64 nodes

  • 10.25 s at 128 nodes

  • 44.17 s at 256 nodes

This is why the readiness probe mattered but was not, by itself, the complete solution. By 128 and 256 nodes, children were often correct to wait; the parent really was not ready yet.

Current Stable High-Scale Reference

The most recent retained high-scale registration measurements should be read as the current stable reference, not as part of the earlier debug-heavy iter1/iter2 diagnosis.

Scale and run

Registration log coverage

Provider registration min / avg / max

Main meaning

256 nodes latest retained high-scale reference

256 / 256 ranks

149686 / 41917765.86 / 43527063 us

startup is still expensive at this scale, but the run completed and the provider timing is fully captured

512 nodes current stable path

512 / 512 ranks

140204 / 148274.60 / 171363 us

prepared address-book startup removed the earlier high-scale registration blow-up and brought provider registration back down to about 0.15 s

These later results are the preferred reference when discussing current production expectations at high scale.

Iter1 Versus Iter2 at High Scale

This comparison should be read as a debugging-phase study, not as the final production baseline. These runs were collected while the registration/readiness logic was still under active investigation, with much heavier logging and with code that did not yet match the later stable startup path.

The later iter2 runs changed the probe policy from a fixed 10 ms loop with a 5 s budget to an exponential-backoff policy with longer patience.

Scale and run

Import result

Import elapsed

Average provider registration

Readiness-probe exceptions

Main conclusion

128 nodes iter1

success

1:07.83

10.25 s

1399

handshake helped, but the original probe loop still saw many readiness failures

128 nodes iter2

success

0:58.03

10.44 s

0 in the aggregate pass

better probing removed the visible readiness-race symptoms

256 nodes iter1

failed

0:54.23

44.17 s

16405

provider-registration skew was too large for the original loop

256 nodes iter2

success

1:45.52

45.72 s

843

better probing improved correctness, but did not make provider startup faster

512 nodes iter1

timed import completed, but cleanup was unstable

4:43.08

one sampled rank: 206.28 s

still visible

provider-registration skew remained the dominant large-scale bottleneck

The table remains valuable because it shows why the readiness handshake and the later startup redesign were needed. It should not, however, be used as the best summary of the current stable code path.

Step-by-Step Runtime Sequence

The readiness change altered the runtime sequence from:

child immediately sends parent RPC

to:

child probes parent readiness, waits if needed, and only then sends the real parent RPC

Operationally, the sequence is:

  1. each rank starts its Mercury, Margo, and Thallium engine state

  2. each rank performs local provider setup

  3. the rank logs provider registration completed after <us> when its own provider is ready

  4. a child rank later discovers that it needs parent service

  5. the child sends a readiness probe first

  6. if the parent is not ready, the child waits and retries

  7. after parent readiness confirmed, the child sends the real forwarded RPC

What the Code Measurement Means

The code-path notes established a second important distinction:

  • provider-registration timing measures local provider startup only

  • it does not measure network-wide readiness

This explains why a rank can log provider registration completed and a different child can still see HG_NOENTRY while probing that parent. Local provider readiness and remote reachability are related, but they are not the same event.

For practical startup measurement at scale:

  • provider registration completed after <us> is the local registration completion time

  • first successful parent rpc_... completed after <us> since provider startup is the compact service-readiness time for the parent-connected distributed path

Operational Interpretation

The maintained conclusion from the registration-readiness work is:

  • the readiness handshake is correct and useful

  • better probe timing is also useful

  • the long-term scaling bottleneck is still provider-registration skew

In other words, the dominant large-scale problem is no longer just a narrow “child forwarded too early once” race. The larger issue is that provider startup becomes too slow and too uneven across ranks.

Operational Use

  • Use facility mode when the site-maintained address book is reliable.

  • Use discover mode when endpoint discovery should be derived from the live allocation.

  • Preserve logs/copper_address_book_full_output.txt when discover mode is used; it is the primary provenance artifact for discovered endpoint data.

  • Treat provider registration completed after <us> and the first successful parent-RPC timing as the two key retained startup indicators at production verbosity.