Generating Skyline Datasets for Data Science Models
Abstract
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.
Community
This paper introduces MODis, a multi-objective data discovery framework that augments data from external repositories to enhance multiple user-defined model performance metrics for a downstream model. This improves the effectiveness of data-driven AI and machine-learning models.
The authors propose three algorithms: ApxMODis, which employs a reduce-from-universal approach; BiMODis, which utilizes bi-directional strategies with correlation-based pruning; and DivMODis, which refines results to mitigate bias in the Pareto set.
The framework is validated on both tabular and graph data, demonstrating its applicability across diverse model types, including Gradient Boosting, Random Forest, and GNN models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Partitioning Strategies for Parallel Computation of Flexible Skylines (2025)
- Quality over Quantity: Boosting Data Efficiency Through Ensembled Multimodal Data Curation (2025)
- A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset (2025)
- Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment (2025)
- MultiFloodSynth: Multi-Annotated Flood Synthetic Dataset Generation (2025)
- Few-shot LLM Synthetic Data with Distribution Matching (2025)
- Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper