opendatalab
/

MinerU-HTML

Text Generation

html-extraction

content-extraction

information-extraction

Model card Files Files and versions

SFKs commited on 12 days ago

Commit

b51ea12

·

verified ·

1 Parent(s): b263efc

Update README.md

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -1,3 +1,8 @@
 # Dripper(MinerU-HTML)
 **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
@@ -328,5 +333,4 @@ Contributions are welcome! Please feel free to submit a Pull Request.
 - Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference
 - Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction
 - Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3)
-- Inspired by various HTML content extraction research

+---
+license: apache-2.0
+datasets:
+- opendatalab/AICC
+---
 # Dripper(MinerU-HTML)
 **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
 - Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference
 - Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction
 - Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3)
+- Inspired by various HTML content extraction research