stereoplegic 's Collections Dataset generation
updated
Ensemble-Instruct: Generating Instruction-Tuning Data with a
Heterogeneous Mixture of LMs
Paper
• 2310.13961
• Published
• 5
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
Paper
• 2202.07922
• Published
• 1
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models
Paper
• 2310.13671
• Published
• 19
Fabricator: An Open Source Toolkit for Generating Labeled Training Data
with Teacher LLMs
Paper
• 2309.09582
• Published
• 4
Auto-Instruct: Automatic Instruction Generation and Ranking for
Black-Box Language Models
Paper
• 2310.13127
• Published
• 12
TeGit: Generating High-Quality Instruction-Tuning Data with
Text-Grounded Task Design
Paper
• 2309.05447
• Published
• 1
Ada-Instruct: Adapting Instruction Generators for Complex Reasoning
Paper
• 2310.04484
• Published
• 5
Diversity of Thought Improves Reasoning Abilities of Large Language
Models
Paper
• 2310.07088
• Published
• 5
Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large
Language Models
Paper
• 2310.01119
• Published
• 1
Training Generative Question-Answering on Synthetic Data Obtained from
an Instruct-tuned Model
Paper
• 2310.08072
• Published
• 1
Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations
Paper
• 2310.07849
• Published
• 2
Generative Data Augmentation using LLMs improves Distributional
Robustness in Question Answering
Paper
• 2309.06358
• Published
• 1
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback
Paper
• 2309.00267
• Published
• 53
Adapting Large Language Models via Reading Comprehension
Paper
• 2309.09530
• Published
• 82
Self-Alignment with Instruction Backtranslation
Paper
• 2308.06259
• Published
• 43
Unnatural Instructions: Tuning Language Models with (Almost) No Human
Labor
Paper
• 2212.09689
• Published
• 1
Democratizing Reasoning Ability: Tailored Learning from Large Language
Model
Paper
• 2310.13332
• Published
• 16
Teaching Language Models to Self-Improve through Interactive
Demonstrations
Paper
• 2310.13522
• Published
• 12
Self-Convinced Prompting: Few-Shot Question Answering with Repeated
Introspection
Paper
• 2310.05035
• Published
• 1
Tuna: Instruction Tuning using Feedback from Large Language Models
Paper
• 2310.13385
• Published
• 10
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning
Paper
• 2310.11716
• Published
• 6
CITING: Large Language Models Create Curriculum for Instruction Tuning
Paper
• 2310.02527
• Published
• 3
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
• 2307.08701
• Published
• 24
Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning
Paper
• 2310.04474
• Published
• 2
UltraFeedback: Boosting Language Models with High-quality Feedback
Paper
• 2310.01377
• Published
• 5
Promptor: A Conversational and Autonomous Prompt Generation Agent for
Intelligent Text Entry Techniques
Paper
• 2310.08101
• Published
• 2
FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation
Paper
• 2310.03214
• Published
• 20
WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct
Paper
• 2308.09583
• Published
• 7
Retrieval-Generation Synergy Augmented Large Language Models
Paper
• 2310.05149
• Published
• 1
Prompting Large Language Models with Chain-of-Thought for Few-Shot
Knowledge Base Question Generation
Paper
• 2310.08395
• Published
• 1
Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models
Paper
• 2310.08491
• Published
• 57
LMDX: Language Model-based Document Information Extraction and
Localization
Paper
• 2309.10952
• Published
• 67
Quality-Diversity through AI Feedback
Paper
• 2310.13032
• Published
• 1
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images
Paper
• 2310.16825
• Published
• 36
A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation
Paper
• 2310.16656
• Published
• 53
In-Context Pretraining: Language Modeling Beyond Document Boundaries
Paper
• 2310.10638
• Published
• 30
Large Language Models Are Also Good Prototypical Commonsense Reasoners
Paper
• 2309.13165
• Published
• 1
DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller
Language Models
Paper
• 2310.05074
• Published
• 1
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language
Models
Paper
• 2309.12284
• Published
• 19
Commonsense Knowledge Transfer for Pre-trained Language Models
Paper
• 2306.02388
• Published
• 1
Snowman: A Million-scale Chinese Commonsense Knowledge Graph Distilled
from Foundation Model
Paper
• 2306.10241
• Published
• 1
VIGC: Visual Instruction Generation and Correction
Paper
• 2308.12714
• Published
• 1
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Paper
• 2309.11998
• Published
• 26
In-Context Alignment: Chat with Vanilla Language Models Before
Fine-Tuning
Paper
• 2308.04275
• Published
• 1
Large Language Model as a User Simulator
Paper
• 2308.11534
• Published
• 2
Enable Language Models to Implicitly Learn Self-Improvement From Data
Paper
• 2310.00898
• Published
• 24
Textbooks Are All You Need II: phi-1.5 technical report
Paper
• 2309.05463
• Published
• 89
Aligning Large Language Models through Synthetic Feedback
Paper
• 2305.13735
• Published
• 1
Reinforced Self-Training (ReST) for Language Modeling
Paper
• 2308.08998
• Published
• 3
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources
Paper
• 2306.04751
• Published
• 5
Query2doc: Query Expansion with Large Language Models
Paper
• 2303.07678
• Published
• 2
Query Expansion by Prompting Large Language Models
Paper
• 2305.03653
• Published
• 3
Generative Relevance Feedback with Large Language Models
Paper
• 2304.13157
• Published
• 1
InPars-v2: Large Language Models as Efficient Dataset Generators for
Information Retrieval
Paper
• 2301.01820
• Published
• 1
Exploring the Viability of Synthetic Query Generation for Relevance
Prediction
Paper
• 2305.11944
• Published
• 1
LM-CPPF: Paraphrasing-Guided Data Augmentation for Contrastive
Prompt-Based Few-Shot Fine-Tuning
Paper
• 2305.18169
• Published
• 1
Automated Annotation with Generative AI Requires Validation
Paper
• 2306.00176
• Published
• 1
Augmented Large Language Models with Parametric Knowledge Guiding
Paper
• 2305.04757
• Published
• 2
Pre-training with Large Language Model-based Document Expansion for
Dense Passage Retrieval
Paper
• 2308.08285
• Published
• 1
Learning to Retrieve In-Context Examples for Large Language Models
Paper
• 2307.07164
• Published
• 23
Tuning Language Models as Training Data Generators for
Augmentation-Enhanced Few-Shot Learning
Paper
• 2211.03044
• Published
• 1
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large
Language Models
Paper
• 2309.10707
• Published
• 2
PromptMix: A Class Boundary Augmentation Method for Large Language Model
Distillation
Paper
• 2310.14192
• Published
• 2
The Program Testing Ability of Large Language Models for Code
Paper
• 2310.05727
• Published
• 2
Assessing the potential of AI-assisted pragmatic annotation: The case of
apologies
Paper
• 2305.08339
• Published
• 1
Effectiveness of Data Augmentation for Parameter Efficient Tuning with
Limited Data
Paper
• 2303.02577
• Published
• 1
Rethink the Effectiveness of Text Data Augmentation: An Empirical
Analysis
Paper
• 2306.07664
• Published
• 1
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language
Modeling Likewise
Paper
• 2310.19019
• Published
• 9
Textbooks Are All You Need
Paper
• 2306.11644
• Published
• 154
Connecting Large Language Models with Evolutionary Algorithms Yields
Powerful Prompt Optimizers
Paper
• 2309.08532
• Published
• 54
SAIL: Search-Augmented Instruction Learning
Paper
• 2305.15225
• Published
• 2
Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data
Paper
• 2309.13876
• Published
• 1
Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning
Paper
• 2305.18170
• Published
• 2
Constructing Multilingual Code Search Dataset Using Neural Machine
Translation
Paper
• 2306.15604
• Published
• 1
TRACED: Execution-aware Pre-training for Source Code
Paper
• 2306.07487
• Published
• 1
Too Few Bug Reports? Exploring Data Augmentation for Improved
Changeset-based Bug Localization
Paper
• 2305.16430
• Published
• 1
Learning to Reason and Memorize with Self-Notes
Paper
• 2305.00833
• Published
• 5
Generating Efficient Training Data via LLM-based Attribute Manipulation
Paper
• 2307.07099
• Published
• 1
End-to-end Knowledge Retrieval with Multi-modal Queries
Paper
• 2306.00424
• Published
• 1
EchoPrompt: Instructing the Model to Rephrase Queries for Improved
In-context Learning
Paper
• 2309.10687
• Published
• 1
AugGPT: Leveraging ChatGPT for Text Data Augmentation
Paper
• 2302.13007
• Published
• 1
Large Language Models as Annotators: Enhancing Generalization of NLP
Models at Minimal Cost
Paper
• 2306.15766
• Published
• 1
Quick Starting Dialog Systems with Paraphrase Generation
Paper
• 2204.02546
• Published
• 1
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
• 2111.04130
• Published
• 1
Scaling Relationship on Learning Mathematical Reasoning with Large
Language Models
Paper
• 2308.01825
• Published
• 23
Harnessing the Power of David against Goliath: Exploring Instruction
Data Generation without Using Closed-Source Models
Paper
• 2308.12711
• Published
• 1
AnnoLLM: Making Large Language Models to Be Better Crowdsourced
Annotators
Paper
• 2303.16854
• Published
• 1
Training Language Models with Language Feedback at Scale
Paper
• 2303.16755
• Published
• 1
Magicoder: Source Code Is All You Need
Paper
• 2312.02120
• Published
• 82
Asking Questions the Human Way: Scalable Question-Answer Generation from
Text Corpus
Paper
• 2002.00748
• Published
• 1
Beyond Human Data: Scaling Self-Training for Problem-Solving with
Language Models
Paper
• 2312.06585
• Published
• 29
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with
Refined Data Generation
Paper
• 2312.14187
• Published
• 49
Self-Instruct: Aligning Language Model with Self Generated Instructions
Paper
• 2212.10560
• Published
• 9
WizardLM: Empowering Large Language Models to Follow Complex
Instructions
Paper
• 2304.12244
• Published
• 13
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Paper
• 2306.08568
• Published
• 33
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction
Paper
• 2401.06201
• Published
• 2
AceCoder: Utilizing Existing Code to Enhance Code Generation
Paper
• 2303.17780
• Published
• 1
SPADE: Synthesizing Assertions for Large Language Model Pipelines
Paper
• 2401.03038
• Published
• 2
Mixture of Soft Prompts for Controllable Data Generation
Paper
• 2303.01580
• Published
• 1
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
Paper
• 2401.16380
• Published
• 51
Improving Text Embeddings with Large Language Models
Paper
• 2401.00368
• Published
• 82
CooK: Empowering General-Purpose Language Models with Modular and
Collaborative Knowledge
Paper
• 2305.09955
• Published
• 1
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
• 2402.10379
• Published
• 31
Pre-trained Language Models as Re-Annotators
Paper
• 2205.05368
• Published
• 1
A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages
Paper
• 2402.01939
• Published
• 1
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
Large Vision-Language Models
Paper
• 2403.00231
• Published
• 2
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Paper
• 2309.11346
• Published
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
• 2403.15042
• Published
• 27
MathGenie: Generating Synthetic Data with Question Back-translation for
Enhancing Mathematical Reasoning of LLMs
Paper
• 2402.16352
• Published
• 2
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
• 2402.10176
• Published
• 38
CodecLM: Aligning Language Models with Tailored Synthetic Data
Paper
• 2404.05875
• Published
• 18
NExT: Teaching Large Language Models to Reason about Code Execution
Paper
• 2404.14662
• Published
• 4
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task
Adaptation
Paper
• 2402.18334
• Published
• 12
GeMQuAD : Generating Multilingual Question Answering Datasets from Large
Language Models using Few Shot Learning
Paper
• 2404.09163
• Published
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
• 2404.14361
• Published
• 2
DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by
Diversifying Synthetic Query Generation
Paper
• 2404.02489
• Published
• 1
Prompting-based Synthetic Data Generation for Few-Shot Question
Answering
Paper
• 2405.09335
• Published
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Paper
• 2405.10040
• Published
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale
Synthetic Data
Paper
• 2405.14333
• Published
• 44
Grounding Data Science Code Generation with Input-Output Specifications
Paper
• 2402.08073
• Published
AgentTuning: Enabling Generalized Agent Abilities for LLMs
Paper
• 2310.12823
• Published
• 36
SemCoder: Training Code Language Models with Comprehensive Semantics
Paper
• 2406.01006
• Published
• 1
CrossTune: Black-Box Few-Shot Classification with Label Enhancement
Paper
• 2403.12468
• Published
TarGEN: Targeted Data Generation with Large Language Models
Paper
• 2310.17876
• Published
Enhancing Conversational Search: Large Language Model-Aided Informative
Query Rewriting
Paper
• 2310.09716
• Published
Automatically Generating Numerous Context-Driven SFT Data for LLMs
across Diverse Granularity
Paper
• 2405.16579
• Published