MM-Food-100K: Experiment Iteration and the Deep Dive into Data Value
In the field of AI, building a high-quality dataset is just as crucial as training a powerful model. We understand this deeply. We recently published a paper titled "MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance", aiming to introduce our work to the community: a large-scale multimodal food intelligence dataset, MM-Food-100K, and the innovative data protocol behind it. Our paper is available on arXiv at: https://arxiv.org/abs/2508.10429.
In the paper, we conducted a preliminary experiment to show that fine-tuning Large Vision-Language Models (LVLMs) on MM-Food-100K significantly improves their performance on food intelligence tasks. Although new models like GPT-5 have been released, the most advanced SFT (Supervised Fine-Tuning) services currently available are still based on GPT-4o. For this reason, we chose ChatGPT-4o as our benchmark, alongside Qwen-VL-MAX, to ensure our experiments were rigorous and representative of the current state of the art.
However, a quick preliminary study is just the beginning. To more deeply understand the impact of data scale on model performance, we embarked on a more detailed experimental iteration.
Experimental Design and Hyperparameter Configuration
To explore the relationship between data volume and model performance, we conducted an iterative experiment. By using different data subsets (100, 1,000, 10,000, and 50,000 samples), we were able to plot the performance curve and reveal the "data scaling law."
To ensure the reproducibility of our experiments, we meticulously recorded all hyperparameters used during the fine-tuning process. All experiments maintained the same parameter configurations to ensure that performance differences were solely attributable to changes in training data size.
Qwen-VL-MAX Hyperparameter Settings
The Qwen-VL-MAX model fine-tuning hyperparameters we used are as follows:
- Epochs: 3
- Learning Rate: 3e-4
- Batch Size: 16
- Sequence Length: 8192
- Validation Steps: 50
- LoRa Rank: 8
- LoRa Alpha: 32
- LoRa Dropout: 0.1
- Weight Decay: 0.01
- Learning Rate Warmup Ratio: 0.05
GPT-4o Hyperparameter Settings
The GPT-4o model fine-tuning hyperparameters we used are as follows:
- Epochs: 3
- Batch size: 16
- LR Multiplier: 2
- Seed: 2
Experimental Results and In-Depth Analysis
Our extended experiments used Qwen-VL-MAX and ChatGPT-4o as base models, fine-tuning them on our different data subsets. We focused on two core tasks: calorie regression and multi-task classification.
Regression Task: Calorie Prediction (Kcal)
We used MAE (Mean Absolute Error) and R² to measure prediction accuracy.
Model | Training Data Size | MAE (kcal) ↓ | RMSE (kcal) ↓ | R² ↑ |
---|---|---|---|---|
Qwen-VL-MAX | 0 (Base) | 126.5 | 185.3 | 0.521 |
100 | 125.4 | 184.2 | 0.525 | |
1,000 | 123.8 | 181.5 | 0.539 | |
10,000 | 107.5 | 159.1 | 0.612 | |
50,000 | 104.2 | 154.5 | 0.638 | |
GPT-4o | 0 (Base) | 98.7 | 148.1 | 0.685 |
100 | 98.4 | 147.8 | 0.687 | |
1,000 | 97.9 | 147.1 | 0.69 | |
10,000 | 96.2 | 144.9 | 0.702 | |
50,000 | 95.8 | 144.3 | 0.706 |
Key Findings:
- Diminishing Returns with Small Samples: Fine-tuning with just 100 or 1,000 samples resulted in very little performance improvement. The MAE for Qwen-VL-MAX only saw a minimal drop, and for GPT-4o, the gain was almost negligible. This suggests that a small amount of domain-specific data has a limited impact on powerful base models.
- The Scale Effect is Real: We observed a clear inflection point in performance when the training data size increased to 10,000 samples. Qwen-VL-MAX's MAE plummeted from 123.8 kcal with 1,000 samples to 107.5 kcal, a significant drop of 13.1%. This proves that high-quality data has immense value, but this value can only be unlocked after reaching a certain scale threshold.
- Continuous Improvement: As the data size continued to increase to 50,000, both models showed continuous improvement. Qwen-VL-MAX's overall MAE dropped by 17.6%, showing that a large-scale dataset can consistently boost a model's accuracy on specific tasks and help it close the gap with top-tier models.
Classification Task: Win Rate Comparison
We also analyzed the models' performance on classification tasks like dish names, ingredients, and cooking methods.
Model | Training Data Size | Dish Name (Win Rate) | Ingredients (Win Rate) | Cooking Method (Win Rate) |
---|---|---|---|---|
Qwen-VL-MAX | 100 | 50.5% | 50.2% | 50.3% |
1,000 | 51.5% | 51.3% | 51.2% | |
10,000 | 55.4% | 57.2% | 56.1% | |
50,000 | 57.9% | 60.2% | 58.7% | |
GPT-4o | 100 | 50.2% | 50.1% | 50.3% |
1,000 | 50.5% | 50.4% | 50.6% | |
10,000 | 50.8% | 50.8% | 50.6% | |
50,000 | 51.1% | 51.4% | 51.2% |
Key Findings:
- The classification results align with our regression findings. The performance gains were minimal with 100 and 1,000 samples but accelerated significantly once the data size crossed the 10,000-sample mark. This further validates the "data threshold effect": a model's true potential can only be unlocked when it's exposed to a sufficiently large amount of high-quality data.
- These results demonstrate that the true value of MM-Food-100K lies in its scale. It not only provides high-quality annotations but, more importantly, offers a sufficient number of samples to unlock a model's full potential, enabling a significant performance breakthrough on domain-specific tasks.
Conclusion: Dataset Scale Determines True Value
Through this experimental iteration, we not only re-validated the value of the MM-Food-100K dataset but, more importantly, proved that its scale is its most crucial asset. A small amount of data may have a limited impact on a large model, but once the data volume reaches a certain level, the gains become non-linear, allowing the model to achieve a significant performance leap in a specific domain.
Although our public dataset contains 100,000 samples, we could only evaluate the effects up to 50,000 samples due to the current fine-tuning service limitations of large models. This raises a question for us: What would the performance ceiling be if we could fine-tune with 100,000, 500,000, or even more high-quality data points? We are confident that once these technical limitations are overcome, the true potential of large-scale datasets like MM-Food-100K will be fully unleashed.
We hope this post inspires more members of the community to explore this area. We look forward to seeing more research on the impact of large-scale data on model performance as we collectively explore the endless possibilities that data offers for AI development.