File size: 5,469 Bytes
08de968
2fb1ed6
 
08de968
2fb1ed6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08de968
 
2fb1ed6
08de968
2fb1ed6
 
08de968
2fb1ed6
 
08de968
2fb1ed6
08de968
 
 
2fb1ed6
08de968
2fb1ed6
08de968
 
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
 
 
08de968
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
 
08de968
2fb1ed6
08de968
2fb1ed6
 
 
08de968
 
2fb1ed6
 
 
08de968
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
 
 
 
 
 
 
 
 
 
 
 
 
 
08de968
2fb1ed6
 
08de968
 
2fb1ed6
08de968
2fb1ed6
08de968
 
2fb1ed6
 
 
08de968
2fb1ed6
 
08de968
2fb1ed6
 
08de968
2fb1ed6
 
08de968
2fb1ed6
08de968
2fb1ed6
 
 
 
 
 
 
 
 
08de968
2fb1ed6
08de968
2fb1ed6
 
 
 
08de968
2fb1ed6
08de968
2fb1ed6
 
 
08de968
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
 
 
 
 
08de968
 
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
08de968
2fb1ed6
 
08de968
2fb1ed6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
language:
- en
library_name: transformers
license: cc-by-4.0
tags:
- kl3m
- kl3m-004
- correction
- legal
- financial
- enterprise
- slm
date: '2024-02-20T00:00:00.000Z'
pipeline_tag: text-generation
widget:
 - text: "Tne Uni+ed 5tates is nct responsib|e for the<|sep|>"
 - temperature: 0.3
 - do_sample: True
---

# kl3m-004-correction-001 Model

kl3m-004-correction-001 is a small, ~500M parameter language model model designed to assist in the correction of common typing, spelling,
OCR, and format issues in English text, especially in the financial and legal domains.

Notably, this model has been trained on the [alea-institute/kl3m-004-char-8k-cased](https://huggingface.co/alea-institute/kl3m-004-char-8k-cased) tokenizer, which
is a BPE tokenizer trained with a 3-character maximum token constraint.  

This model was originally trained 3 days on 1xRTX3090, and a large ~3B parameter MoE is pending release.



## Getting Started

Simply prompt the model with the original text, followed by the `<|sep|>` token, and wait for stop token (`<|end|>`) generation.  You can use `pipeline` to handle this for you.


### Deterministic

In many situations, deterministic correction (i.e., most probable logit sequence) is fine.

```python
from transformers import pipeline
p = pipeline('text-generation', 'alea-institute/kl3m-004-correction-001', device='cpu')

text = "Tne Uni+ed 5tates is nct responsib|e for 5uch pr0duction"

correction = p(text + "<|sep|>", max_new_tokens=512, return_full_text=False)[0]['generated_text']

# Output: The United States is not responsible for such production
```

### Sampled with Frequency Weighting

In other situations, it can be useful to generate multiple corrections with a sampler and evaluate the distribution.  For example:
* using a string or token-based distance metric to score or rank corrections
* showing multiple suggestions to a user with frequency-weighted order


```python
from transformers import pipeline
from collections import Counter

p = pipeline('text-generation', 'alea-institute/kl3m-004-correction-001', device='cuda')

text = "Tne Uni+ed 5tates is nct responsib|e for 5uch pr0duction"

corrections = Counter(
  [
    g['generated_text']
    for g in p(
      text + "<|sep|>",
      max_new_tokens=512,
      return_full_text=False,
      temperature=0.5,
      # top_p, top_k, custom sampler, etc.
      do_sample=True,
      num_return_sequences=10
    )
  ]
).most_common(3)

# Output: [('The United States is not responsible for such production', 7), ('the United States is not responsible for such production', 3)]
```


## Source

[https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)


## Training Data
This model was trained on a dataset generated with the KL3M data collection and the [https://github.com/alea-institute/alea-data-generator](alea-data-generator) library, which
can create realistic synthetic samples using traditional (non-generative) techniques.  

The source code to retrieve and process this dataset is available here:
[https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)

Some pre-tokenized subsets of the KL3M data collection are available on Hugging Face:
[https://huggingface.co/datasets?sort=most_rows&search=kl3m-data](https://huggingface.co/datasets?sort=most_rows&search=kl3m-data)

Complete, raw data is available upon request at this time via S3 under a Requester Pays model.  We are actively working on a
zero-cost distribution model as soon as we can obtain additional support.

## Model Details

### Summary
- **Architecture**: LlamaForCausalLM
- **Parameters**: 478.2M
- **Context Window**: 512 tokens (no ROPE)
- **Language(s)**: Primarily English
- **Tokenizer**: kl3m-004-char-8k-cased BPE tokenizer (8K tokens, between 1-3 characters each)
- **Developed by**: [ALEA Institute](https://aleainstitute.ai)
- **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
- **Hardware Requirements**: Runs real-time in fp32 on CPU or consumer NV/AMD GPUs

## Key Features

- **Clean Training Data**: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible.
- **Low Toxicity**: [Empirically lower toxicity and bias](https://github.com/alea-institute/kl3m-toxicity)
- **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
- **Efficient Deployment**: Optimized for real-time inference on consumer hardware.

## Use Cases

- Correcting common typing or spelling errors
- Correcting common OCR errors
- Correcting common formatting errors 

## License

Model weights are released under the CC-BY 4.0 License.

## Contact

The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
 
- GitHub: https://github.com/alea-institute/kl3m-model-research
- Email: [email protected]
- Website: https://aleainstitute.ai


## Citation

Tokenizer, dataset, and model publications are pending.

## Contact

For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).

![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)