qiujiantao commited on
Commit
d6512d6
·
verified ·
1 Parent(s): 3b7ddca

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +332 -3
README.md CHANGED
@@ -1,3 +1,332 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dripper(MinerU-HTML)
2
+
3
+ **Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.
4
+
5
+ ## Features
6
+
7
+ - 🚀 **LLM-Powered Extraction**: Uses state-of-the-art language models to intelligently identify main content
8
+ - 🎯 **State Machine Guidance**: Implements logits processing with state machines for structured JSON output
9
+ - 🔄 **Fallback Mechanism**: Automatically falls back to alternative extraction methods on errors
10
+ - 📊 **Comprehensive Evaluation**: Built-in evaluation framework with ROUGE and item-level metrics
11
+ - 🌐 **REST API Server**: FastAPI-based server for easy integration
12
+ - ⚡ **Distributed Processing**: Ray-based parallel processing for large-scale evaluation
13
+ - 🔧 **Multiple Extractors**: Supports various baseline extractors for comparison
14
+
15
+ ## Installation
16
+
17
+ ### Prerequisites
18
+
19
+ - Python >= 3.10
20
+ - CUDA-capable GPU (recommended for LLM inference)
21
+ - Sufficient memory for model loading
22
+
23
+ ### Install from Source
24
+
25
+ The installation process automatically handles dependencies. The `setup.py` reads dependencies from `requirements.txt` and optionally from `baselines.txt`.
26
+
27
+ #### Basic Installation (Core Functionality)
28
+
29
+ For basic usage of Dripper, install with core dependencies only:
30
+
31
+ ```bash
32
+ # Clone the repository
33
+ git clone https://github.com/opendatalab/MinerU-HTML
34
+ cd MinerU-HTML
35
+
36
+ # Install the package with core dependencies only
37
+ # Dependencies from requirements.txt are automatically installed
38
+ pip install .
39
+ ```
40
+
41
+ #### Installation with Baseline Extractors (for Evaluation)
42
+
43
+ If you need to run baseline evaluations and comparisons, install with the `baselines` extra:
44
+
45
+ ```bash
46
+ # Install with baseline extractor dependencies
47
+ pip install -e .[baselines]
48
+ ```
49
+
50
+ This will install additional libraries required for baseline extractors:
51
+
52
+ - `readabilipy`, `readability_lxml` - Readability-based extractors
53
+ - `resiliparse` - Resilient HTML parsing
54
+ - `justext` - JustText extractor
55
+ - `gne` - General News Extractor
56
+ - `goose3` - Goose3 article extractor
57
+ - `boilerpy3` - Boilerplate removal
58
+ - `crawl4ai` - AI-powered web content extraction
59
+
60
+ **Note**: The baseline extractors are only needed for running comparative evaluations. For basic usage of Dripper, the core installation is sufficient.
61
+
62
+ ## Quick Start
63
+
64
+ ### 1. Download the model
65
+
66
+ visit our model at [MinerU-HTML](https://huggingface.co/opendatalab/MinerU-HTML) and download the model, you can use the following command to download the model:
67
+
68
+ ```bash
69
+ huggingface-cli download opendatalab/MinerU-HTML
70
+ ```
71
+
72
+ ### 2. Using the Python API
73
+
74
+ ```python
75
+ from dripper.api import Dripper
76
+
77
+ # Initialize Dripper with model configuration
78
+ dripper = Dripper(
79
+ config={
80
+ 'model_path': '/path/to/your/model',
81
+ 'tp': 1, # Tensor parallel size
82
+ 'state_machine': None, # or 'v1', or 'v2
83
+ 'use_fall_back': True,
84
+ 'raise_errors': False,
85
+ }
86
+ )
87
+
88
+ # Extract main content from HTML
89
+ html_content = "<html>...</html>"
90
+ result = dripper.process(html_content)
91
+
92
+ # Access results
93
+ main_html = result[0].main_html
94
+ ```
95
+
96
+ ### 3. Using the REST API Server
97
+
98
+ ```bash
99
+ # Start the server
100
+ python -m dripper.server \
101
+ --model_path /path/to/your/model \
102
+ --state_machine v2 \
103
+ --port 7986
104
+
105
+ # Or use environment variables
106
+ export DRIPPER_MODEL_PATH=/path/to/your/model
107
+ export DRIPPER_STATE_MACHINE=v2
108
+ export DRIPPER_PORT=7986
109
+ python -m dripper.server
110
+ ```
111
+
112
+ Then make requests to the API:
113
+
114
+ ```bash
115
+ # Extract main content
116
+ curl -X POST "http://localhost:7986/extract" \
117
+ -H "Content-Type: application/json" \
118
+ -d '{"html": "<html>...</html>", "url": "https://example.com"}'
119
+
120
+ # Health check
121
+ curl http://localhost:7986/health
122
+ ```
123
+
124
+ ## Configuration
125
+
126
+ ### Dripper Configuration Options
127
+
128
+ | Parameter | Type | Default | Description |
129
+ | --------------- | ---- | ------------ | ------------------------------------------------ |
130
+ | `model_path` | str | **Required** | Path to the LLM model directory |
131
+ | `tp` | int | 1 | Tensor parallel size for model inference |
132
+ | `state_machine` | str | None | State machine version: `'v1'`, `'v2'`, or `None` |
133
+ | `use_fall_back` | bool | True | Enable fallback to trafilatura on errors |
134
+ | `raise_errors` | bool | False | Raise exceptions on errors (vs returning None) |
135
+ | `debug` | bool | False | Enable debug logging |
136
+ | `early_load` | bool | False | Load model during initialization |
137
+
138
+ ### Environment Variables
139
+
140
+ - `DRIPPER_MODEL_PATH`: Path to the LLM model
141
+ - `DRIPPER_STATE_MACHINE`: State machine version (`v1`, `v2`, or empty)
142
+ - `DRIPPER_PORT`: Server port number (default: 7986)
143
+ - `VLLM_USE_V1`: Must be set to `'0'` when using state machine
144
+
145
+ ## Usage Examples
146
+
147
+ ### Batch Processing
148
+
149
+ ```python
150
+ from dripper.api import Dripper
151
+
152
+ dripper = Dripper(config={'model_path': '/path/to/model'})
153
+
154
+ # Process multiple HTML strings
155
+ html_list = ["<html>...</html>", "<html>...</html>"]
156
+ results = dripper.process(html_list)
157
+
158
+ for result in results:
159
+ print(result.main_html)
160
+ ```
161
+
162
+ ### Evaluation
163
+
164
+ #### Baseline Evaluation
165
+
166
+ ```bash
167
+ python app/eval_baseline.py \
168
+ --bench /path/to/benchmark.jsonl \
169
+ --task_dir /path/to/output \
170
+ --extractor_name dripper-md \
171
+ --default_config gpu \
172
+ --model_path /path/to/model
173
+ ```
174
+
175
+ #### Two-Step Evaluation
176
+
177
+ ```bash
178
+ # if inferencen with no state machine, set VLLM_USE_V1=1
179
+ export VLLM_USE_V1=1
180
+ # if use state machine, set VLLM_USE_V1=0
181
+ # export VLLM_USE_V1=0
182
+
183
+ RESULT_PATH=/path/to/output
184
+ EXP_NAME=MinerU-HTML
185
+ MODEL_PATH=/path/to/model
186
+ BENCH_DATA=/path/to/benchmark.jsonl
187
+
188
+ # Step 1: Prepare for evaluation
189
+ python app/eval_with_answer.py \
190
+ --bench $BENCH_DATA \
191
+ --task_dir $RESULT_PATH/$MODEL_NAME \
192
+ --step 1 --cpus 128 --force_update
193
+
194
+ # Step 2: Run inference
195
+ python app/run_inference.py \
196
+ --task_dir $RESULT_PATH/$MODEL_NAME \
197
+ --model_path $MODEL_PATH \
198
+ --output_path $RESULT_PATH/$MODEL_NAME/res.jsonl \
199
+ --no_logits
200
+
201
+ # Step 3: process results
202
+ python app/process_res.py \
203
+ --response $RESULT_PATH/$MODEL_NAME/res.jsonl \
204
+ --answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \
205
+ --error $RESULT_PATH/$MODEL_NAME/err.jsonl
206
+
207
+ # Step 4: Evaluate with answers
208
+ python app/eval_with_answer.py \
209
+ --bench $BENCH_DATA \
210
+ --task_dir $RESULT_PATH/$MODEL_NAME \
211
+ --answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \
212
+ --step 2 --cpus 128 --force_update
213
+ ```
214
+
215
+ ## Project Structure
216
+
217
+ ```
218
+ Dripper/
219
+ ├── dripper/ # Main package
220
+ │ ├── api.py # Dripper API class
221
+ │ ├── server.py # FastAPI server
222
+ │ ├── base.py # Core data structures
223
+ │ ├── exceptions.py # Custom exceptions
224
+ │ ├── inference/ # LLM inference modules
225
+ │ │ ├── inference.py # Generation functions
226
+ │ │ ├── prompt.py # Prompt generation
227
+ │ │ ├── logits.py # Response parsing
228
+ │ │ └── logtis_processor/ # State machine logits processors
229
+ │ ├── process/ # HTML processing
230
+ │ │ ├── simplify_html.py
231
+ │ │ ├── map_to_main.py
232
+ │ │ └── html_utils.py
233
+ │ ├── eval/ # Evaluation modules
234
+ │ │ ├── metric.py # ROUGE and item-level metrics
235
+ │ │ ├── eval.py # Evaluation functions
236
+ │ │ ├── process.py # Processing utilities
237
+ │ │ └── benckmark.py # Benchmark data structures
238
+ │ └── eval_baselines/ # Baseline extractors
239
+ │ ├── base.py # Evaluation framework
240
+ │ └── baselines/ # Extractor implementations
241
+ ├── app/ # Application scripts
242
+ │ ├── eval_baseline.py # Baseline evaluation script
243
+ │ ├── eval_with_answer.py # Two-step evaluation
244
+ │ ├── run_inference.py # Inference script
245
+ │ └── process_res.py # Result processing
246
+ ├── requirements.txt # Core Python dependencies (auto-installed)
247
+ ├── baselines.txt # Optional dependencies for baseline extractors
248
+ ├── LICENCE # Apache License 2.0
249
+ ├── NOTICE # Copyright and attribution notices
250
+ └── setup.py # Package setup (handles dependency installation)
251
+ ```
252
+
253
+ ## Supported Extractors
254
+
255
+ Dripper supports various baseline extractors for comparison:
256
+
257
+ - **Dripper** (`dripper-md`, `dripper-html`): The main LLM-based extractor
258
+ - **Trafilatura**: Fast and accurate content extraction
259
+ - **Readability**: Mozilla's readability algorithm
260
+ - **BoilerPy3**: Python port of Boilerpipe
261
+ - **NewsPlease**: News article extractor
262
+ - **Goose3**: Article extractor
263
+ - **GNE**: General News Extractor
264
+ - **Crawl4ai**: AI-powered web content extraction
265
+ - And more...
266
+
267
+ ## Evaluation Metrics
268
+
269
+ - **ROUGE Scores**: ROUGE-N precision, recall, and F1 scores
270
+ - **Item-Level Metrics**: Per-tag-type (main/other) precision, recall, F1, and accuracy
271
+ - **HTML Output**: Extracted main HTML for visual inspection
272
+
273
+ ## Development
274
+
275
+ ### Running Tests
276
+
277
+ ```bash
278
+ # Add test commands here when available
279
+ ```
280
+
281
+ ### Code Style
282
+
283
+ The project uses pre-commit hooks for code quality. Install them:
284
+
285
+ ```bash
286
+ pre-commit install
287
+ ```
288
+
289
+ ## Troubleshooting
290
+
291
+ ### Common Issues
292
+
293
+ 1. **VLLM_USE_V1 Error**: When using state machine, ensure `VLLM_USE_V1=0` is set:
294
+
295
+ ```bash
296
+ export VLLM_USE_V1=0
297
+ ```
298
+
299
+ 2. **Model Loading Errors**: Verify model path and ensure sufficient GPU memory
300
+
301
+ 3. **Import Errors**: Ensure the package is properly installed:
302
+
303
+ ```bash
304
+ # Reinstall the package (this will automatically install dependencies from requirements.txt)
305
+ pip install -e .
306
+
307
+ # If you need baseline extractors for evaluation:
308
+ pip install -e .[baselines]
309
+ ```
310
+
311
+ ## License
312
+
313
+ This project is licensed under the Apache License, Version 2.0. See the [LICENCE](LICENCE) file for details.
314
+
315
+ ### Copyright Notice
316
+
317
+ This project contains code and model weights derived from Qwen3. Original Qwen3 Copyright 2024 Alibaba Cloud, licensed under Apache License 2.0. Modifications and additional training Copyright 2025 OpenDatalab Shanghai AILab, licensed under Apache License 2.0.
318
+
319
+ For more information, please see the [NOTICE](NOTICE) file.
320
+
321
+
322
+ ## Contributing
323
+
324
+ Contributions are welcome! Please feel free to submit a Pull Request.
325
+
326
+ ## Acknowledgments
327
+
328
+ - Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference
329
+ - Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction
330
+ - Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3)
331
+ - Inspired by various HTML content extraction research
332
+