File size: 11,181 Bytes
b51ea12
 
 
 
ea6f154
 
 
 
 
 
 
 
 
 
6ebd0d2
 
b51ea12
d6512d6
 
 
 
 
 
 
 
 
 
 
 
 
 
2498a57
 
d6512d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b51ea12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
license: apache-2.0
datasets:
- opendatalab/AICC
language:
- en
- zh
pipeline_tag: text-generation
tags:
- commoncrawl
- html-extraction
- content-extraction
- information-extraction
- qwen
base_model:
- Qwen/Qwen3-0.6B
---
# Dripper(MinerU-HTML)

**Dripper(MinerU-HTML)** is an advanced HTML main content extraction tool based on Large Language Models (LLMs). It provides a complete pipeline for extracting primary content from HTML pages using LLM-based classification and state machine-guided generation.

## Features

- 🚀 **LLM-Powered Extraction**: Uses state-of-the-art language models to intelligently identify main content
- 🎯 **State Machine Guidance**: Implements logits processing with state machines for structured JSON output
- 🔄 **Fallback Mechanism**: Automatically falls back to alternative extraction methods on errors
- 📊 **Comprehensive Evaluation**: Built-in evaluation framework with ROUGE and item-level metrics
- 🌐 **REST API Server**: FastAPI-based server for easy integration
-**Distributed Processing**: Ray-based parallel processing for large-scale evaluation
- 🔧 **Multiple Extractors**: Supports various baseline extractors for comparison

## Github 🔧 🔧 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML)

## Installation

### Prerequisites

- Python >= 3.10
- CUDA-capable GPU (recommended for LLM inference)
- Sufficient memory for model loading

### Install from Source

The installation process automatically handles dependencies. The `setup.py` reads dependencies from `requirements.txt` and optionally from `baselines.txt`.

#### Basic Installation (Core Functionality)

For basic usage of Dripper, install with core dependencies only:

```bash
# Clone the repository
git clone https://github.com/opendatalab/MinerU-HTML
cd MinerU-HTML

# Install the package with core dependencies only
# Dependencies from requirements.txt are automatically installed
pip install .
```

#### Installation with Baseline Extractors (for Evaluation)

If you need to run baseline evaluations and comparisons, install with the `baselines` extra:

```bash
# Install with baseline extractor dependencies
pip install -e .[baselines]
```

This will install additional libraries required for baseline extractors:

- `readabilipy`, `readability_lxml` - Readability-based extractors
- `resiliparse` - Resilient HTML parsing
- `justext` - JustText extractor
- `gne` - General News Extractor
- `goose3` - Goose3 article extractor
- `boilerpy3` - Boilerplate removal
- `crawl4ai` - AI-powered web content extraction

**Note**: The baseline extractors are only needed for running comparative evaluations. For basic usage of Dripper, the core installation is sufficient.

## Quick Start

### 1. Download the model

visit our model at [MinerU-HTML](https://huggingface.co/opendatalab/MinerU-HTML) and download the model, you can use the following command to download the model:

```bash
huggingface-cli download opendatalab/MinerU-HTML
```

### 2. Using the Python API

```python
from dripper.api import Dripper

# Initialize Dripper with model configuration
dripper = Dripper(
    config={
        'model_path': '/path/to/your/model',
        'tp': 1,  # Tensor parallel size
        'state_machine': None,  # or 'v1', or 'v2
        'use_fall_back': True,
        'raise_errors': False,
    }
)

# Extract main content from HTML
html_content = "<html>...</html>"
result = dripper.process(html_content)

# Access results
main_html = result[0].main_html
```

### 3. Using the REST API Server

```bash
# Start the server
python -m dripper.server \
    --model_path /path/to/your/model \
    --state_machine v2 \
    --port 7986

# Or use environment variables
export DRIPPER_MODEL_PATH=/path/to/your/model
export DRIPPER_STATE_MACHINE=v2
export DRIPPER_PORT=7986
python -m dripper.server
```

Then make requests to the API:

```bash
# Extract main content
curl -X POST "http://localhost:7986/extract" \
  -H "Content-Type: application/json" \
  -d '{"html": "<html>...</html>", "url": "https://example.com"}'

# Health check
curl http://localhost:7986/health
```

## Configuration

### Dripper Configuration Options

| Parameter       | Type | Default      | Description                                      |
| --------------- | ---- | ------------ | ------------------------------------------------ |
| `model_path`    | str  | **Required** | Path to the LLM model directory                  |
| `tp`            | int  | 1            | Tensor parallel size for model inference         |
| `state_machine` | str  | None         | State machine version: `'v1'`, `'v2'`, or `None` |
| `use_fall_back` | bool | True         | Enable fallback to trafilatura on errors         |
| `raise_errors`  | bool | False        | Raise exceptions on errors (vs returning None)   |
| `debug`         | bool | False        | Enable debug logging                             |
| `early_load`    | bool | False        | Load model during initialization                 |

### Environment Variables

- `DRIPPER_MODEL_PATH`: Path to the LLM model
- `DRIPPER_STATE_MACHINE`: State machine version (`v1`, `v2`, or empty)
- `DRIPPER_PORT`: Server port number (default: 7986)
- `VLLM_USE_V1`: Must be set to `'0'` when using state machine

## Usage Examples

### Batch Processing

```python
from dripper.api import Dripper

dripper = Dripper(config={'model_path': '/path/to/model'})

# Process multiple HTML strings
html_list = ["<html>...</html>", "<html>...</html>"]
results = dripper.process(html_list)

for result in results:
    print(result.main_html)
```

### Evaluation

#### Baseline Evaluation

```bash
python app/eval_baseline.py \
    --bench /path/to/benchmark.jsonl \
    --task_dir /path/to/output \
    --extractor_name dripper-md \
    --default_config gpu \
    --model_path /path/to/model
```

#### Two-Step Evaluation

```bash
# if inferencen with no state machine, set VLLM_USE_V1=1
export VLLM_USE_V1=1
# if use state machine, set VLLM_USE_V1=0
# export VLLM_USE_V1=0

RESULT_PATH=/path/to/output
EXP_NAME=MinerU-HTML
MODEL_PATH=/path/to/model
BENCH_DATA=/path/to/benchmark.jsonl

# Step 1: Prepare for evaluation
python app/eval_with_answer.py \
    --bench $BENCH_DATA \
    --task_dir $RESULT_PATH/$MODEL_NAME \
    --step 1 --cpus 128 --force_update

# Step 2: Run inference
python app/run_inference.py \
    --task_dir $RESULT_PATH/$MODEL_NAME \
    --model_path $MODEL_PATH \
    --output_path $RESULT_PATH/$MODEL_NAME/res.jsonl \
    --no_logits

# Step 3: process results
python app/process_res.py \
    --response $RESULT_PATH/$MODEL_NAME/res.jsonl \
    --answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \
    --error $RESULT_PATH/$MODEL_NAME/err.jsonl

# Step 4: Evaluate with answers
python app/eval_with_answer.py \
    --bench $BENCH_DATA \
    --task_dir $RESULT_PATH/$MODEL_NAME \
    --answer $RESULT_PATH/$MODEL_NAME/ans.jsonl \
    --step 2 --cpus 128 --force_update
```

## Project Structure

```
Dripper/
├── dripper/                 # Main package
│   ├── api.py              # Dripper API class
│   ├── server.py           # FastAPI server
│   ├── base.py             # Core data structures
│   ├── exceptions.py        # Custom exceptions
│   ├── inference/          # LLM inference modules
│   │   ├── inference.py    # Generation functions
│   │   ├── prompt.py       # Prompt generation
│   │   ├── logits.py       # Response parsing
│   │   └── logtis_processor/  # State machine logits processors
│   ├── process/            # HTML processing
│   │   ├── simplify_html.py
│   │   ├── map_to_main.py
│   │   └── html_utils.py
│   ├── eval/               # Evaluation modules
│   │   ├── metric.py       # ROUGE and item-level metrics
│   │   ├── eval.py         # Evaluation functions
│   │   ├── process.py      # Processing utilities
│   │   └── benckmark.py    # Benchmark data structures
│   └── eval_baselines/     # Baseline extractors
│       ├── base.py         # Evaluation framework
│       └── baselines/       # Extractor implementations
├── app/                    # Application scripts
│   ├── eval_baseline.py    # Baseline evaluation script
│   ├── eval_with_answer.py # Two-step evaluation
│   ├── run_inference.py    # Inference script
│   └── process_res.py     # Result processing
├── requirements.txt        # Core Python dependencies (auto-installed)
├── baselines.txt          # Optional dependencies for baseline extractors
├── LICENCE                # Apache License 2.0
├── NOTICE                 # Copyright and attribution notices
└── setup.py               # Package setup (handles dependency installation)
```

## Supported Extractors

Dripper supports various baseline extractors for comparison:

- **Dripper** (`dripper-md`, `dripper-html`): The main LLM-based extractor
- **Trafilatura**: Fast and accurate content extraction
- **Readability**: Mozilla's readability algorithm
- **BoilerPy3**: Python port of Boilerpipe
- **NewsPlease**: News article extractor
- **Goose3**: Article extractor
- **GNE**: General News Extractor
- **Crawl4ai**: AI-powered web content extraction
- And more...

## Evaluation Metrics

- **ROUGE Scores**: ROUGE-N precision, recall, and F1 scores
- **Item-Level Metrics**: Per-tag-type (main/other) precision, recall, F1, and accuracy
- **HTML Output**: Extracted main HTML for visual inspection

## Development

### Running Tests

```bash
# Add test commands here when available
```

### Code Style

The project uses pre-commit hooks for code quality. Install them:

```bash
pre-commit install
```

## Troubleshooting

### Common Issues

1. **VLLM_USE_V1 Error**: When using state machine, ensure `VLLM_USE_V1=0` is set:

   ```bash
   export VLLM_USE_V1=0
   ```

2. **Model Loading Errors**: Verify model path and ensure sufficient GPU memory

3. **Import Errors**: Ensure the package is properly installed:

   ```bash
   # Reinstall the package (this will automatically install dependencies from requirements.txt)
   pip install -e .

   # If you need baseline extractors for evaluation:
   pip install -e .[baselines]
   ```

## License

This project is licensed under the Apache License, Version 2.0. See the [LICENCE](LICENCE) file for details.

### Copyright Notice

This project contains code and model weights derived from Qwen3. Original Qwen3 Copyright 2024 Alibaba Cloud, licensed under Apache License 2.0. Modifications and additional training Copyright 2025 OpenDatalab Shanghai AILab, licensed under Apache License 2.0.

For more information, please see the [NOTICE](NOTICE) file.


## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Acknowledgments

- Built on top of [vLLM](https://github.com/vllm-project/vllm) for efficient LLM inference
- Uses [Trafilatura](https://github.com/adbar/trafilatura) for fallback extraction
- Finetuned on [Qwen3](https://github.com/QwenLM/Qwen3)
- Inspired by various HTML content extraction research