SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

Abstract

Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning when and who to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce SCoUT (Scalable Communication via Utility-guided Temporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples soft agent groups every K environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Experiments on large-scale benchmarks show that SCoUT learns targeted communication and remains effective in large scenarios, while prior methods degrade as the population grows. Finally, ablations confirm that temporal grouping and counterfactual communication credit are both critical for scalability.

Methodology

SCoUT learns both whom to talk to (via a soft grouping policy over agent descriptors) and what to send (via a shared descriptor and message pool), using a centralized PPO with group-aware value baselines and counterfactual advantages for communication actions.

SCoUT architecture

Results

SCoUT maintains high win rates and elimination rates across Battle scales and strong catch rates across Pursuit scales; ablations show that both temporal grouping and counterfactual communication credit are critical.

Battle training curves

Battle: training curves across 20v20, 64v64, 81v81, 100v100.

Pursuit scaling summary

Pursuit: catch rate and completion rate across scales.

MAgent Battle

Win rate, elimination rate, milestone reach rates R50/R75, and decisiveness via TT50/TT75 (20 evaluation seeds).

Method Win rate (%) Elimination rate (%)
20v2064v6481v81100v100 20v2064v6481v81100v100
SCoUT100.0±0.0100.0±0.0100.0±0.0100.0±0.095±698±499±399±3
ExpoComm0.0±0.095.0±4.996.0±4.296.4±2.711±1694±1272±1675±21
ExpoComm-Static0.0±0.00.0±0.060.0±11.065.0±10.73±81±257±3036±10
IDQN0.0±0.00.0±0.00.0±0.00.0±0.0007±20
MethodMilestone reach rate (%) R50 / R75TT50 / TT75 (steps; horizon=200)
20v2064v6481v81100v10020v2064v6481v81100v100
SCoUT100/100100/100100/100100/10024±1/30±124±2/29±127±3/32±231±4/39±2
ExpoComm0/095/9590/5580/60N/A59±20/81±24100±44/182±14101±21/160±23
ExpoComm-Static0/00/060/6010/0N/AN/A85±7/153±20N/A
IDQN0/00/00/00/0N/AN/AN/AN/A

Pursuit (PettingZoo SISL)

Catch%, Done%, and milestone times TT50/TT75 (20 evaluation seeds; horizon=500). For milestone rows, each entry is TTk (Rk), where Rk is the % of episodes reaching the k% capture milestone.

MethodCatch rate (Catch%)Completion rate (Done%)
20P-8E40P-16E60P-24E80P-32E100P-40E20P-8E40P-16E60P-24E80P-32E100P-40E
SCoUT94±1093±889±590±688±57050050
w/o counterfactual72±2118±2028±2729±1358±26200000
w/o grouping57±1810±914±637±1041±1000000
ExpoComm (Peer-n=7)86±1458±1462±1272±1369±10400000
MethodTT50 (steps; horizon=500) — R50% in parentheses
20P-8E40P-16E60P-24E80P-32E100P-40E
SCoUT74±78 (100)143±52 (100)72±19 (100)98±38 (100)70±18 (100)
w/o counterfactual173±149 (90)N/A (15)N/A (20)N/A (5)173±86 (55)
w/o grouping212±184 (80)N/A (0)N/A (0)N/A (15)N/A (20)
ExpoComm (Peer-n=7)125±73 (100)191±114 (80)307±80 (85)184±93 (100)137±66 (95)
MethodTT75 (steps; horizon=500) — R75% in parentheses
20P-8E40P-16E60P-24E80P-32E100P-40E
SCoUT138±82 (95)227±65 (95)139±64 (100)212±79 (100)149±41 (100)
w/o counterfactual214±136 (65)N/A (0)N/A (15)N/A (0)N/A (35)
w/o groupingN/A (30)N/A (0)N/A (0)N/A (0)N/A (0)
ExpoComm (Peer-n=7)234±114 (80)N/A (15)N/A (20)N/A (40)425±59 (50)

SCoUT in action

SCoUT policies across environment scales.

MAgent Battle

20v20
Battle 20v20
64v64
Battle 64v64
81v81
Battle 81v81
100v100
Battle 100v100

PettingZoo Pursuit

20p-8e
Pursuit 20p-8e
40p-16e
Pursuit 40p-16e
60p-24e
Pursuit 60p-24e
80p-32e
Pursuit 80p-32e
100p-40e
Pursuit 100p-40e

BibTeX

To be added upon publication.