|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- ko |
|
base_model: |
|
- naver-hyperclovax/HyperCLOVAX-SEED-Think-14B |
|
--- |
|
|
|
## Model Card for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B |
|
|
|
### **Summary** |
|
|
|
This model enhances a user-forked version of **HyperCLOVAX-SEED-Think-14B** by integrating ideas from Meta AI × UCSD's **DeepConf**. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency. |
|
|
|
The core of this method is the **Lowest Group Confidence (LGC)**, a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (**Top-p% Filtering**, **Confidence-Weighted Voting**) and online optimization (**Early Abort**), ultimately achieving higher accuracy at a lower computational cost. |
|
|
|
----- |
|
|
|
### **1. Background and Motivation** |
|
|
|
While **Self-Consistency**—generating multiple paths and taking a majority vote—can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths. |
|
|
|
The **DeepConf** framework addresses this by reading the model's internal **token generation probability distribution (confidence)** to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link. |
|
|
|
----- |
|
|
|
### **2. Methods** |
|
|
|
#### **2.1 Confidence Metric: Lowest Group Confidence (LGC)** |
|
|
|
LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory. |
|
|
|
* **Intuition**: The quality of a path is limited by its most uncertain or speculative segment. |
|
|
|
The formula is: |
|
$$\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)$$ |
|
|
|
Here, $\\text{conf}(y\_i)$ is the generation probability of token $y\_i$. Our implementation defaults to using the softmax probability of the top-1 token. |
|
|
|
#### **2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting** |
|
|
|
* **Top-p% Filtering**: Among $N$ generated paths, only the **top p%** with the highest confidence scores are included in the final vote. |
|
* **Confidence-Weighted Voting**: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it). |
|
* **Literature Example**: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.) |
|
|
|
#### **2.3 Online Method: Adaptive Sampling (Early Abort)** |
|
|
|
1. **Warm-up**: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\\tau$. |
|
2. **Monitoring**: For each new path, if its real-time LGC drops below $\\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths. |
|
|
|
<!-- end list --> |
|
|
|
* **Reported Gains**: This technique can reduce the number of sampled tokens by \~85% while maintaining or even improving accuracy. |
|
|
|
#### **2.4 HyperCLOVAX (Think/Answer) Specialization** |
|
|
|
We leverage the model's ChatML structure, which separates the `thinking` (exploration) and `answer` (formal response) stages, by applying a dual-threshold system: $\\tau\_{\\text{think}} \< \\tau\_{\\text{answer}}$. |
|
|
|
* **Thinking Stage**: A looser threshold encourages broader exploration of ideas. |
|
* **Answer Stage**: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output. |
|
|
|
----- |
|
|
|
### **3. Hyperparameters (Recommended Defaults)** |
|
|
|
| Name | Description | Default Value (Example) | |
|
| ------------------- | -------------------------------------------------- | ---------------------------------------- | |
|
| `W` | Sliding window length (tokens) | 2048 | |
|
| `p` | Percentage for Top-p% Filtering | 10 | |
|
| `M` | Number of warm-up paths for calibration | 16 | |
|
| $\\tau\_{\\text{think}}$ | Early abort threshold for the `thinking` stage | Dynamic (based on warm-up) | |
|
| $\\tau\_{\\text{answer}}$ | Early abort threshold for the `answer` stage | Dynamic (based on warm-up, stricter) | |
|
| `N_max` | Max number of paths to sample (online) | Optional limit (e.g., 64) | |
|
|
|
----- |
|
|
|
### **4. Evaluation** |
|
|
|
#### **4.1 AIME 2025 (30-question slice) — `deepconf` vs. `original`** |
|
|
|
*Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.* |
|
|
|
| Metric | `original` | `deepconf` | Notes | |
|
| ----------------------- | ---------- | ---------- | --------------------------------------------- | |
|
| **Total Correct** | 8 | **10** | +2 questions correct | |
|
| **Accuracy (out of 30)** | 26.7% | **33.3%** | +6.7%p improvement | |
|
| Attempts (Format OK) | 8 | 11 | `deepconf` attempted 3 more questions | |
|
| Format Failures | 22 | 19 | `deepconf` shows better format stability | |
|
| **Head-to-Head** | — | — | **2 Wins / 0 Losses / 28 Ties for `deepconf`** | |
|
|
|
**Breakdown by Part:** |
|
|
|
* **Part I**: Both models solved 6/15 questions (Tie). |
|
* **Part II**: `original` solved 2/15, while `deepconf` solved 4/15. **The performance gain was concentrated in the more difficult second half.** |
|
|
|
*Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.* |
|
|
|
#### **4.2 Efficiency & Speed (10-question sample test)** |
|
|
|
| Metric | Improvement with `deepconf` | |
|
| ------------------------- | ---------------------------- | |
|
| **Majority-Vote Accuracy** | +20.0%p | |
|
| **Avg. Generated Tokens** | –29.6% | |
|
| **Avg. Generation Time** | –41.6% | |
|
|
|
***Caution: These results are based on a very small sample size (N≈10).*** However, they signal a meaningful improvement across accuracy, speed, and cost. |
|
|
|
----- |
|
|
|
### **5. Use Cases and Recommended Pipeline** |
|
|
|
This model is ideal for **mathematical and logical reasoning tasks** where it offers significant sample savings and improved reliability compared to standard self-consistency. |
|
|
|
**Recommended Pipeline:** |
|
|
|
1. **Online**: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently. |
|
2. **Offline**: Apply Top-p% Filtering (with `p=10` as a starting point) to the remaining high-quality paths. |
|
3. **Finalization**: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer. |
|
|
|
----- |
|
|
|
### **6. Limitations & What to Watch Out For** |
|
|
|
* **Confidence Miscalibration**: If the model's probability estimates are not well-calibrated, the threshold $\\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics. |
|
* **Domain Shift**: The optimal hyperparameters ($\\tau, W, p$) may need recalibration when applied to new domains or problem styles. |
|
* **Unintended Early Aborts**: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period. |
|
* **Reliance on Format Validation**: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed. |
|
|
|
----- |
|
|
|
### **7. Responsible Use** |
|
|
|
* **Expose Reasoning**: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors. |
|
* **Resource Allocation**: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment. |
|
* **Bias and Fairness**: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs. |
|
|
|
----- |
|
|
|
### **Citation** |
|
|
|
* **Original Idea**: Fu, Wang, Tian, Zhao et al., *Deep Think With Confidence* (Meta AI, UCSD) |
|
|