SCoUT: Scalable Communication via Utility-Guided Temporal Grouping

Abstract

Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning when and who to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce SCoUT (Scalable Communication via Utility-guided Temporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples soft agent groups every K environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Experiments on large-scale benchmarks show that SCoUT learns targeted communication and remains effective in large scenarios, while prior methods degrade as the population grows. Finally, ablations confirm that temporal grouping and counterfactual communication credit are both critical for scalability.

Methodology

SCoUT learns both whom to talk to (via a soft grouping policy over agent descriptors) and what to send (via a shared descriptor and message pool), using a centralized PPO with group-aware value baselines and counterfactual advantages for communication actions.

Results

SCoUT maintains high win rates and elimination rates across Battle scales and strong catch rates across Pursuit scales; ablations show that both temporal grouping and counterfactual communication credit are critical.

Battle: training curves across 20v20, 64v64, 81v81, 100v100.

Pursuit: catch rate and completion rate across scales.

MAgent Battle

Win rate, elimination rate, milestone reach rates R₅₀/R₇₅, and decisiveness via TT₅₀/TT₇₅ (20 evaluation seeds).

Method	Win rate (%)				Elimination rate (%)
	20v20	64v64	81v81	100v100	20v20	64v64	81v81	100v100
SCoUT	100.0±0.0	100.0±0.0	100.0±0.0	100.0±0.0	95±6	98±4	99±3	99±3
ExpoComm	0.0±0.0	95.0±4.9	96.0±4.2	96.4±2.7	11±16	94±12	72±16	75±21
ExpoComm-Static	0.0±0.0	0.0±0.0	60.0±11.0	65.0±10.7	3±8	1±2	57±30	36±10
IDQN	0.0±0.0	0.0±0.0	0.0±0.0	0.0±0.0	0	0	7±2	0

Method	Milestone reach rate (%) R₅₀ / R₇₅				TT₅₀ / TT₇₅ (steps; horizon=200)
	20v20	64v64	81v81	100v100	20v20	64v64	81v81	100v100
SCoUT	100/100	100/100	100/100	100/100	24±1/30±1	24±2/29±1	27±3/32±2	31±4/39±2
ExpoComm	0/0	95/95	90/55	80/60	N/A	59±20/81±24	100±44/182±14	101±21/160±23
ExpoComm-Static	0/0	0/0	60/60	10/0	N/A	N/A	85±7/153±20	N/A
IDQN	0/0	0/0	0/0	0/0	N/A	N/A	N/A	N/A

Pursuit (PettingZoo SISL)

Catch%, Done%, and milestone times TT₅₀/TT₇₅ (20 evaluation seeds; horizon=500). For milestone rows, each entry is TT_k (R_k), where R_k is the % of episodes reaching the k% capture milestone.

Method	Catch rate (Catch%)					Completion rate (Done%)
	20P-8E	40P-16E	60P-24E	80P-32E	100P-40E	20P-8E	40P-16E	60P-24E	80P-32E	100P-40E
SCoUT	94±10	93±8	89±5	90±6	88±5	70	50	0	5	0
w/o counterfactual	72±21	18±20	28±27	29±13	58±26	20	0	0	0	0
w/o grouping	57±18	10±9	14±6	37±10	41±10	0	0	0	0	0
ExpoComm (Peer-n=7)	86±14	58±14	62±12	72±13	69±10	40	0	0	0	0

Method	TT₅₀ (steps; horizon=500) — R₅₀% in parentheses
	20P-8E	40P-16E	60P-24E	80P-32E	100P-40E
SCoUT	74±78 (100)	143±52 (100)	72±19 (100)	98±38 (100)	70±18 (100)
w/o counterfactual	173±149 (90)	N/A (15)	N/A (20)	N/A (5)	173±86 (55)
w/o grouping	212±184 (80)	N/A (0)	N/A (0)	N/A (15)	N/A (20)
ExpoComm (Peer-n=7)	125±73 (100)	191±114 (80)	307±80 (85)	184±93 (100)	137±66 (95)

Method	TT₇₅ (steps; horizon=500) — R₇₅% in parentheses
	20P-8E	40P-16E	60P-24E	80P-32E	100P-40E
SCoUT	138±82 (95)	227±65 (95)	139±64 (100)	212±79 (100)	149±41 (100)
w/o counterfactual	214±136 (65)	N/A (0)	N/A (15)	N/A (0)	N/A (35)
w/o grouping	N/A (30)	N/A (0)	N/A (0)	N/A (0)	N/A (0)
ExpoComm (Peer-n=7)	234±114 (80)	N/A (15)	N/A (20)	N/A (40)	425±59 (50)