Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning
By: João Morais , Sadjad Alikhani , Akshay Malhotra and more
Potential Business Impact:
Helps wireless devices learn from different data.
This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets, enabling applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, task-specific synthetic data generation, and informing decisions on model training/adaptation to new deployments. We evaluate candidate dataset distance metrics by how well they predict cross-dataset transferability: if two datasets have a small distance, a model trained on one should perform well on the other. We apply the framework on an unsupervised task, channel state information (CSI) compression, using autoencoders. Using metrics based on UMAP embeddings, combined with Wasserstein and Euclidean distances, we achieve Pearson correlations exceeding 0.85 between dataset distances and train-on-one/test-on-another task performance. We also apply the framework to a supervised beam prediction in the downlink using convolutional neural networks. For this task, we derive a label-aware distance by integrating supervised UMAP and penalties for dataset imbalance. Across both tasks, the resulting distances outperform traditional baselines and consistently exhibit stronger correlations with model transferability, supporting task-relevant comparisons between wireless datasets.
Similar Papers
Measuring Time-Series Dataset Similarity using Wasserstein Distance
Machine Learning (CS)
Finds similar patterns in data over time.
Wasserstein distance based semi-supervised manifold learning and application to GNSS multi-path detection
Machine Learning (CS)
Teaches computers to find bad signals with few examples.
Statistical Inference for Manifold Similarity and Alignability across Noisy High-Dimensional Datasets
Statistics Theory
Compares complex data by looking at its hidden shapes.