Update README.md

bd817b4 verified 25 days ago

8.81 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	- ko
	base_model:
	- naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
	---

	## Model Card for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B

	### Summary

	This model enhances a user-forked version of HyperCLOVAX-SEED-Think-14B by integrating ideas from Meta AI × UCSD's DeepConf. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.

	The core of this method is the Lowest Group Confidence (LGC), a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (Top-p% Filtering, Confidence-Weighted Voting) and online optimization (Early Abort), ultimately achieving higher accuracy at a lower computational cost.

	-----

	### 1. Background and Motivation

	While Self-Consistency—generating multiple paths and taking a majority vote—can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.

	The DeepConf framework addresses this by reading the model's internal token generation probability distribution (confidence) to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.

	-----

	### 2. Methods

	#### 2.1 Confidence Metric: Lowest Group Confidence (LGC)

	LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.

	* Intuition: The quality of a path is limited by its most uncertain or speculative segment.

	The formula is:
	$$\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)$$

	Here, $\\text{conf}(y\_i)$ is the generation probability of token $y\_i$. Our implementation defaults to using the softmax probability of the top-1 token.

	#### 2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting

	* Top-p% Filtering: Among $N$ generated paths, only the top p% with the highest confidence scores are included in the final vote.
	* Confidence-Weighted Voting: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
	* Literature Example: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)

	#### 2.3 Online Method: Adaptive Sampling (Early Abort)

	1. Warm-up: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\\tau$.
	2. Monitoring: For each new path, if its real-time LGC drops below $\\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.

	<!-- end list -->

	* Reported Gains: This technique can reduce the number of sampled tokens by \~85% while maintaining or even improving accuracy.

	#### 2.4 HyperCLOVAX (Think/Answer) Specialization

	We leverage the model's ChatML structure, which separates the `thinking` (exploration) and `answer` (formal response) stages, by applying a dual-threshold system: $\\tau\_{\\text{think}} \< \\tau\_{\\text{answer}}$.

	* Thinking Stage: A looser threshold encourages broader exploration of ideas.
	* Answer Stage: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.

	-----

	### 3. Hyperparameters (Recommended Defaults)

	\| Name \| Description \| Default Value (Example) \|
	\| ------------------- \| -------------------------------------------------- \| ---------------------------------------- \|
	\| `W` \| Sliding window length (tokens) \| 2048 \|
	\| `p` \| Percentage for Top-p% Filtering \| 10 \|
	\| `M` \| Number of warm-up paths for calibration \| 16 \|
	\| $\\tau\_{\\text{think}}$ \| Early abort threshold for the `thinking` stage \| Dynamic (based on warm-up) \|
	\| $\\tau\_{\\text{answer}}$ \| Early abort threshold for the `answer` stage \| Dynamic (based on warm-up, stricter) \|
	\| `N_max` \| Max number of paths to sample (online) \| Optional limit (e.g., 64) \|

	-----

	### 4. Evaluation

	#### 4.1 AIME 2025 (30-question slice) — `deepconf` vs. `original`

	Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.

	\| Metric \| `original` \| `deepconf` \| Notes \|
	\| ----------------------- \| ---------- \| ---------- \| --------------------------------------------- \|
	\| Total Correct \| 8 \| 10 \| +2 questions correct \|
	\| Accuracy (out of 30) \| 26.7% \| 33.3% \| +6.7%p improvement \|
	\| Attempts (Format OK) \| 8 \| 11 \| `deepconf` attempted 3 more questions \|
	\| Format Failures \| 22 \| 19 \| `deepconf` shows better format stability \|
	\| Head-to-Head \| — \| — \| 2 Wins / 0 Losses / 28 Ties for `deepconf` \|

	Breakdown by Part:

	* Part I: Both models solved 6/15 questions (Tie).
	* Part II: `original` solved 2/15, while `deepconf` solved 4/15. The performance gain was concentrated in the more difficult second half.

	Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.

	#### 4.2 Efficiency & Speed (10-question sample test)

	\| Metric \| Improvement with `deepconf` \|
	\| ------------------------- \| ---------------------------- \|
	\| Majority-Vote Accuracy \| +20.0%p \|
	\| Avg. Generated Tokens \| –29.6% \|
	\| Avg. Generation Time \| –41.6% \|

	*Caution: These results are based on a very small sample size (N≈10).* However, they signal a meaningful improvement across accuracy, speed, and cost.

	-----

	### 5. Use Cases and Recommended Pipeline

	This model is ideal for mathematical and logical reasoning tasks where it offers significant sample savings and improved reliability compared to standard self-consistency.

	Recommended Pipeline:

	1. Online: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
	2. Offline: Apply Top-p% Filtering (with `p=10` as a starting point) to the remaining high-quality paths.
	3. Finalization: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.

	-----

	### 6. Limitations & What to Watch Out For

	* Confidence Miscalibration: If the model's probability estimates are not well-calibrated, the threshold $\\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
	* Domain Shift: The optimal hyperparameters ($\\tau, W, p$) may need recalibration when applied to new domains or problem styles.
	* Unintended Early Aborts: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
	* Reliance on Format Validation: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.

	-----

	### 7. Responsible Use

	* Expose Reasoning: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
	* Resource Allocation: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
	* Bias and Fairness: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.

	-----

	### Citation

	* Original Idea: Fu, Wang, Tian, Zhao et al., Deep Think With Confidence (Meta AI, UCSD)

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	- ko
	base_model:
	- naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
	---

	## Model Card for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B

	### Summary

	This model enhances a user-forked version of HyperCLOVAX-SEED-Think-14B by integrating ideas from Meta AI × UCSD's DeepConf. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.

	The core of this method is the Lowest Group Confidence (LGC), a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (Top-p% Filtering, Confidence-Weighted Voting) and online optimization (Early Abort), ultimately achieving higher accuracy at a lower computational cost.

	-----

	### 1. Background and Motivation

	While Self-Consistency—generating multiple paths and taking a majority vote—can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.

	The DeepConf framework addresses this by reading the model's internal token generation probability distribution (confidence) to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.

	-----

	### 2. Methods

	#### 2.1 Confidence Metric: Lowest Group Confidence (LGC)

	LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.

	* Intuition: The quality of a path is limited by its most uncertain or speculative segment.

	The formula is:
	$$\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)$$

	Here, $\\text{conf}(y\_i)$ is the generation probability of token $y\_i$. Our implementation defaults to using the softmax probability of the top-1 token.

	#### 2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting

	* Top-p% Filtering: Among $N$ generated paths, only the top p% with the highest confidence scores are included in the final vote.
	* Confidence-Weighted Voting: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
	* Literature Example: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)

	#### 2.3 Online Method: Adaptive Sampling (Early Abort)

	1. Warm-up: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\\tau$.
	2. Monitoring: For each new path, if its real-time LGC drops below $\\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.

	<!-- end list -->

	* Reported Gains: This technique can reduce the number of sampled tokens by \~85% while maintaining or even improving accuracy.

	#### 2.4 HyperCLOVAX (Think/Answer) Specialization

	We leverage the model's ChatML structure, which separates the `thinking` (exploration) and `answer` (formal response) stages, by applying a dual-threshold system: $\\tau\_{\\text{think}} \< \\tau\_{\\text{answer}}$.

	* Thinking Stage: A looser threshold encourages broader exploration of ideas.
	* Answer Stage: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.

	-----

	### 3. Hyperparameters (Recommended Defaults)

	\| Name \| Description \| Default Value (Example) \|
	\| ------------------- \| -------------------------------------------------- \| ---------------------------------------- \|
	\| `W` \| Sliding window length (tokens) \| 2048 \|
	\| `p` \| Percentage for Top-p% Filtering \| 10 \|
	\| `M` \| Number of warm-up paths for calibration \| 16 \|
	\| $\\tau\_{\\text{think}}$ \| Early abort threshold for the `thinking` stage \| Dynamic (based on warm-up) \|
	\| $\\tau\_{\\text{answer}}$ \| Early abort threshold for the `answer` stage \| Dynamic (based on warm-up, stricter) \|
	\| `N_max` \| Max number of paths to sample (online) \| Optional limit (e.g., 64) \|

	-----

	### 4. Evaluation

	#### 4.1 AIME 2025 (30-question slice) — `deepconf` vs. `original`

	Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.

	\| Metric \| `original` \| `deepconf` \| Notes \|
	\| ----------------------- \| ---------- \| ---------- \| --------------------------------------------- \|
	\| Total Correct \| 8 \| 10 \| +2 questions correct \|
	\| Accuracy (out of 30) \| 26.7% \| 33.3% \| +6.7%p improvement \|
	\| Attempts (Format OK) \| 8 \| 11 \| `deepconf` attempted 3 more questions \|
	\| Format Failures \| 22 \| 19 \| `deepconf` shows better format stability \|
	\| Head-to-Head \| — \| — \| 2 Wins / 0 Losses / 28 Ties for `deepconf` \|

	Breakdown by Part:

	* Part I: Both models solved 6/15 questions (Tie).
	* Part II: `original` solved 2/15, while `deepconf` solved 4/15. The performance gain was concentrated in the more difficult second half.

	Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.

	#### 4.2 Efficiency & Speed (10-question sample test)

	\| Metric \| Improvement with `deepconf` \|
	\| ------------------------- \| ---------------------------- \|
	\| Majority-Vote Accuracy \| +20.0%p \|
	\| Avg. Generated Tokens \| –29.6% \|
	\| Avg. Generation Time \| –41.6% \|

	*Caution: These results are based on a very small sample size (N≈10).* However, they signal a meaningful improvement across accuracy, speed, and cost.

	-----

	### 5. Use Cases and Recommended Pipeline

	This model is ideal for mathematical and logical reasoning tasks where it offers significant sample savings and improved reliability compared to standard self-consistency.

	Recommended Pipeline:

	1. Online: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
	2. Offline: Apply Top-p% Filtering (with `p=10` as a starting point) to the remaining high-quality paths.
	3. Finalization: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.

	-----

	### 6. Limitations & What to Watch Out For

	* Confidence Miscalibration: If the model's probability estimates are not well-calibrated, the threshold $\\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
	* Domain Shift: The optimal hyperparameters ($\\tau, W, p$) may need recalibration when applied to new domains or problem styles.
	* Unintended Early Aborts: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
	* Reliance on Format Validation: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.

	-----

	### 7. Responsible Use

	* Expose Reasoning: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
	* Resource Allocation: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
	* Bias and Fairness: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.

	-----

	### Citation

	* Original Idea: Fu, Wang, Tian, Zhao et al., Deep Think With Confidence (Meta AI, UCSD)