Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Dripper(MinerU-HTML)
|
| 2 |
|
| 3 |
**Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
|
|
@@ -328,5 +333,4 @@ Contributions are welcome! Please feel free to submit a Pull Request.
|
|
| 328 |
- Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference
|
| 329 |
- Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction
|
| 330 |
- Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3)
|
| 331 |
-
- Inspired by various HTML content extraction research
|
| 332 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- opendatalab/AICC
|
| 5 |
+
---
|
| 6 |
# Dripper(MinerU-HTML)
|
| 7 |
|
| 8 |
**Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
|
|
|
|
| 333 |
- Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference
|
| 334 |
- Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction
|
| 335 |
- Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3)
|
| 336 |
+
- Inspired by various HTML content extraction research
|
|
|