|
--- |
|
pipeline_tag: text-generation |
|
inference: true |
|
widget: |
|
- text: "public class HelloWorld {\n public static void main(String[] args) {" |
|
example_title: Hello world |
|
group: Java |
|
license: bigcode-openrail-m |
|
datasets: |
|
- bigcode/starcoderdata |
|
metrics: |
|
- code_eval |
|
library_name: transformers |
|
language: |
|
- code |
|
tags: |
|
- NarrowTransformer |
|
model-index: |
|
- name: NT-Java-1.1B |
|
results: |
|
- task: |
|
type: text-generation |
|
dataset: |
|
type: nuprl/MultiPL-E |
|
name: MultiPL-HumanEval (Java) |
|
metrics: |
|
- name: pass@1 |
|
type: pass@1 |
|
value: 18.3 |
|
verified: false |
|
extra_gated_prompt: >- |
|
## Model License Agreement |
|
|
|
Please read the BigCode [OpenRAIL-M |
|
license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) |
|
agreement before accepting it. |
|
|
|
extra_gated_fields: |
|
I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox |
|
duplicated_from: bigcode-data/starcoderbase-1b |
|
--- |
|
|
|
# Model Summary |
|
|
|
The Narrow Transformer (NT) model **NT-Java-1.1B** is an open-source specialized code model built by extending pre-training on StarCoderBase-1B, designed for coding tasks in Java programming. The model is a decoder-only transformer with Multi-Query Attention and with a context length of 8192 tokens. The model was trained with Java subset of the StarCoderData dataset, which is ~22B tokens. |
|
|
|
- **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM) |
|
- **Paper:** |
|
- **Language(s):** Java |
|
|
|
<br> |
|
|
|
# Intended Uses |
|
|
|
Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. Being a small language model (SLM), the NT-Java-1.1B can be deployed on consumer-grade PCs. It outperforms comparably-sized open-source code models in Java programming tasks. Feel free to explore this powerful language model for your Java projects! |
|
|
|
Quantized versions of NT-Java-1.1B, [NT-Java-1.1B-GGUF](https://huggingface.co/infosys/NT-Java-1.1B-GGUF), performs comparably to open 1B models on MultiPL-E Java code benchmarks and can be used with multiple frameworks, including CTranslate2, GPT4ALL, etc., making it versatile for various deployment scenarios. |
|
|
|
**Feel free to share your generations in the Community tab!** |
|
|
|
**Primary Use cases** |
|
|
|
The model is tailored for commercial use in Java programming tasks. It is particularly suited for: |
|
|
|
1. Use in memory/compute constrained environments. |
|
2. Application in latency-sensitive scenarios. |
|
3. Code generation and completion tasks in Java. |
|
4. FIM (code infilling) tasks specific to Java. |
|
|
|
# How to Use |
|
|
|
## Sample inference code |
|
|
|
### Generation |
|
```Java |
|
# pip install -q transformers |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
checkpoint = "infosys/NT-Java-1.1B" |
|
device = "cuda" # for GPU usage or "cpu" for CPU usage |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) |
|
|
|
inputs = tokenizer.encode("public class HelloWorld {\n public static void main(String[] args) {", return_tensors="pt").to(device) |
|
outputs = model.generate(inputs) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
### Fill-in-the-middle |
|
Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: |
|
|
|
```Java |
|
input_text = "<fim_prefix>public class PalindromeChecker {\n public static boolean isPalindrome(String str) {\n <fim_suffix>return true;\n }\n<fim_middle>" |
|
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) |
|
outputs = model.generate(inputs) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
### Quantized Versions through `bitsandbytes` |
|
* _Using 8-bit precision (int8)_ |
|
|
|
```java |
|
# pip install bitsandbytes accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
|
|
# to use 4bit use `load_in_4bit=True` instead |
|
quantization_config = BitsAndBytesConfig(load_in_8bit=True) |
|
|
|
checkpoint = "infosys/NT-Java-1.1B" |
|
tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
|
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config) |
|
|
|
inputs = tokenizer.encode("public class HelloWorld {\n public static void main(String[] args) {", return_tensors="pt").to("cuda") |
|
outputs = model.generate(inputs) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
# Attribution & Other Requirements |
|
|
|
The pretraining dataset for the model was curated to include only data with permissive licenses. Despite this, the model is capable of generating source code verbatim from the dataset. The licenses of such code may necessitate attribution and adherence to other specific conditions. To facilitate compliance, we provide a [search index](https://huggingface.co/spaces/bigcode/search) that enables users to trace the origins of generated code within the pretraining data, allowing for proper attribution and adherence to licensing requirements. |
|
|
|
# Limitations |
|
|
|
The NT-Java-1.1B model has been trained on publicly available datasets and is offered without any safety guarantees. As with all language models, its outputs are inherently unpredictable, and the generated code may not perform as expected. Additionally, the code may be inefficient or contain bugs and security vulnerabilities. Consequently, it is imperative for users and developers to undertake extensive safety testing and to implement robust filtering mechanisms tailored to their specific needs. |
|
|
|
# Training |
|
|
|
## Model |
|
|
|
- **Architecture:** GPT-2 model with Multi-Query Attention and Fill-in-the-Middle objective. |
|
- **Pretraining steps:** 100K |
|
- **Context length:** 8K tokens |
|
- **Pretraining tokens:** 22 billion |
|
- **Precision:** bfloat16 |
|
|
|
## Hardware |
|
|
|
- **GPUs:** 6 NVIDIA A100 80GB |
|
- **Training time:** 10 days |
|
|
|
## Software |
|
|
|
- **Orchestration:** [Megatron-LM](https://github.com/bigcode-project/Megatron-LM) |
|
- **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch) |
|
|
|
# License |
|
The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement). |
|
# Citation |
|
``` |
|
@article{li2023starcoder, |
|
title={NARROW TRANSFORMER: STARCODER-BASED JAVA-LM FOR DESKTOP}, |
|
author={Kamalkumar Rathinasamy and Balaji A J and Rajab Ali Mondal and Ankush Kumar and Harshini K and Gagan Gayari and Sreenivasa Raghavan Karumboor Seshadri}, |
|
year={2024}, |
|
eprint={2305.06161}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |