2
the correspondence between the detected targets in the radar
images and the communication users. The association step
is fundamental in establishing the appropriate beamforming
direction for each user based on the radar information. By
successfully addressing both TD and T2U association, radar
sensing can effectively contribute to efficient beam manage-
ment in wireless communication systems.
TD is a central and challenging problem in computer
vision, playing a fundamental role in several applications. A
remarkable breakthrough in TD performance has emerged with
the proposal of region-based convolutional neural network (R-
CNN) architecture, which extracts hierarchical features from
CNNs and utilizes region proposals for TD [13]. R-CNN
is a two-stage detector, which (i) proposes a set of object
bounding boxes, and (ii) predicts for each proposed bounding
box the presence of an object and its class. By contrast,
one-stage detectors, e.g., Single Shot Detector (SSD) [14]
and YOLO [15], complete the TD task in a single step.
Fundamental milestones in two-stage deep learning-based TD
are represented by the introduction of spatial pyramid pooling
networks (SPPNet) [16], Fast R-CNN [17], Faster R-CNN
[18], and Feature Pyramid Networks (FPNs) [19].
The You Only Look Once (YOLO) architecture has been
proposed in [15] as the first single-stage object detection
model. Renowned for its efficiency, the YOLO model infers
bounding boxes in parallel for separate image regions. Nev-
ertheless, it experiences lower localization accuracy compared
to the two-stage detectors. Several YOLO versions have been
developed to overcome this limitation, leading to the recently
proposed YOLOv8 [20], which represents the state-of-the-art
model for both efficiency and accuracy. We refer the interested
reader to [21] for an exhaustive survey on TD models.
YOLO-based multi-target detection algorithms have been
successfully applied to radar sensing in the automotive field.
A method for simultaneous detection and classification of
radar targets in automotive scenarios based on YOLOv3 is
proposed in [22]. In [23], the authors design a lightweight
CNN leveraging dense connections, residual connections, and
group convolution and compose it with the YOLO architecture
to build a lightweight model for the detection of marine ship
objects in radar images, showing good mean average precision
(mAP) over two experimentally acquired datasets. A multi-
data source YOLO-based TD is in [24], where the authors fuse
the information derived from a mmWave radar and a camera
to improve the detection performance.
The T2U association is a key operation in ISAC systems,
but research in this area remains limited. In realistic scenarios,
T2U association poses significant challenges due to: i) the
presence of multiple users and targets, potentially resulting in
a non-empty set difference; ii) differing resolutions in sensing
and communication information—for example, sensing pro-
vides angle, range, and Doppler, while communication yields
beam and received power data; iii) practical considerations,
where vehicles, treated as extended targets, undergo geometric
distortions known as foreshortening and overlay [25] when
illuminated by a sensing apparatus positioned at heights of
base station (BS). These distortions introduce biases in target
position estimation, leading to mismatches between commu-
nication and sensing information. The most notable works
addressing this topic are [25]–[27]. In [26], the focus is on
sensing-aided vehicular communications, where the challenge
lies in associating vehicle ID with detected targets using
radar and GPS data. The authors utilize the Kullback-Leibler
divergence as a similarity metric to solve the constrained data
association problem. Nevertheless, the accuracy and reliability
of the proposed approach can be affected by environmental
factors and by the need for an active low-rate uplink channel
for GPS data exchange. On the other hand, [27] proposes a
different approach that overcomes the aforementioned limita-
tions. They utilize ISAC-only data for multi-vehicle tracking
and perform ID association based on the Kullback-Leibler
divergence between estimated and predicted vehicle states
(range, velocity, and angle). In a more recent work [25],
the authors introduce reconfigurable intelligent surfaces (RIS)
mounted on vehicles’ roofs and strategically configured to
reflect the sensing signal towards the radar with a known
pattern, enabling accurate T2U association and improving TD.
The abovementioned works require additional hardware
[25] or side information [26]. The work in [27], instead,
operates on the assumption of successful TD, which is a
particularly challenging task for extended targets in highly
dynamic scenarios. Differently from the current state of the
art, in the context of ISAC systems, our paper proposes a
unified framework for multi-user TD and T2U association
to enhance beam management. Indeed, when the association
between observed radar target and communication equipment
is known at some time steps, beam prediction for a VE can
be naturally conditioned on the radar target.
In this paper, we propose the following contributions:
• Considering a hybrid MIMO communication system and
a mmWave MIMO radar, we employ the latest version
of the YOLO architecture (YOLOv8 [20]) to jointly
achieve real-time radar multi-target detection and analog
beam prediction at the BS for each detected radar target.
We model the beam prediction problem as a multi-class
classification task over a fixed-size codebook of beam-
forming vectors, and we integrate the beam classification
within the YOLO prediction heads. The resulting model is
trained and evaluated over realistic simulated radar range-
angle images and accurate ray-tracing wireless channel
data. We show that YOLO achieves considerable detec-
tion and classification performance on both radar target
detection and beam prediction tasks.
• Leveraging the beam prediction obtained from YOLOv8,
and the beamforming vectors selected for each VE at the
BS, we tackle the radar target-to-user (T2U) association
problem in the beamspace, showing that the probability
of correct association significantly increases with the
antenna array size at the BS—which highlights the related
increase in the separability of the VEs in the beamspace
required for effective association.
• We design a framework for the simulation of radar images
and wireless communication channels at the communi-
cation infrastructure. Dynamic vehicular simulations are
achieved integrating the Simulation of Urban Mobility
(SUMO) [28] vehicular traffic simulator, and the EM