ttt421
/

modernbert-ja-location-classifier

Text Classification

multi-label-classification

location-detection

Model card Files Files and versions

modernbert-ja-location-classifier / README.md

ttt421's picture

Upload folder using huggingface_hub

6ad3241 verified 2 months ago

|

history blame contribute delete

3.56 kB

	---
	language:
	- ja
	license: apache-2.0
	base_model: sbintuitions/modernbert-ja-310m
	tags:
	- text-classification
	- multi-label-classification
	- japanese
	- location-detection
	- emergency-call
	datasets: []
	metrics:
	- f1
	- precision
	- recall
	pipeline_tag: text-classification
	---

	# ModernBERT 日本語場所タイプ分類モデル (改善版)

	このモデルは、日本語の緊急通報テキストから場所タイプを多ラベル分類するために、`sbintuitions/modernbert-ja-310m`をファインチューニングしたものです。

	## モデル概要

	- ベースモデル: sbintuitions/modernbert-ja-310m
	- タスク: 多ラベルテキスト分類（場所タイプ検出）
	- 言語: 日本語
	- ラベル: apartment, outdoor, highway, station, commercial_facility

	## 学習設定

	### 改善内容
	1. クラス重み付け損失関数の導入（少数クラスを重視）
	2. エポック数: 20
	3. バッチサイズ: 8
	4. 学習率: 2e-5
	5. Warmup ratio: 0.15

	### クラス重み
	```python
	CLASS_WEIGHTS = [
	1.0, # apartment (236件)
	1.72, # outdoor (137件)
	18.15, # highway (13件)
	9.08, # station (26件)
	2.03, # commercial_facility (116件)
	]
	```

	## 性能

	テストデータでの評価結果:

	\| クラス \| Precision \| Recall \| F1-Score \|
	\|--------\|-----------\|--------\|----------\|
	\| apartment \| 0.88 \| 0.95 \| 0.91 \|
	\| outdoor \| 0.88 \| 0.83 \| 0.86 \|
	\| highway \| 0.67 \| 1.00 \| 0.80 \|
	\| station \| 1.00 \| 0.75 \| 0.86 \|
	\| commercial_facility \| 0.83 \| 0.71 \| 0.76 \|

	総合スコア:
	- Micro Avg: Precision 0.86, Recall 0.84, F1 0.85
	- Macro Avg: Precision 0.85, Recall 0.85, F1 0.84

	## 使い方

	### インストール

	```bash
	pip install transformers torch
	```

	### 基本的な使用例

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# モデルとトークナイザーのロード
	model_name = "ttt421/modernbert-ja-location-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# 推論
	text = "マンションの3階から火が出ています"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.sigmoid(outputs.logits)[0]

	# 結果の表示
	labels = ["apartment", "outdoor", "highway", "station", "commercial_facility"]
	threshold = 0.5

	print("検出された場所タイプ:")
	for label, prob in zip(labels, probs):
	if prob > threshold:
	print(f" {label}: {prob:.3f}")
	```

	### バッチ処理

	```python
	texts = [
	"高速道路で事故が発生しました",
	"駅のホームで人が倒れています",
	"ショッピングモールで迷子になりました"
	]

	inputs = tokenizer(texts, return_tensors="pt", truncation=True, max_length=1024, padding=True)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.sigmoid(outputs.logits)

	for i, text in enumerate(texts):
	print(f"
	テキスト: {text}")
	print("場所タイプ:")
	for label, prob in zip(labels, probs[i]):
	if prob > threshold:
	print(f" {label}: {prob:.3f}")
	```

	## 制限事項

	- `highway`はテストサンプルが4件と少ないため、精度が不安定
	- `commercial_facility`のRecallが0.71と改善の余地あり

	## ライセンス

	Apache 2.0

	## 引用

	ベースモデル: [sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)