Nbeau
/

grammarBERT

Model card Files Files and versions Community

grammarBERT / README.md

Nbeau's picture

Update README.md

992962c verified 4 months ago

|

history blame contribute delete

2.36 kB

	---
	datasets:
	- bigcode/the-stack-v2
	base_model:
	- microsoft/codebert-base
	---
	# grammarBERT

	`grammarBERT` is a specialized fine-tuning of `codeBERT`, using a Masked Language Modeling (MLM) task focused on derivation sequences specific to Python 3.8. By fine-tuning on Python’s Abstract Syntax Tree (AST) structures, `grammarBERT` combines `codeBERT`’s capabilities in natural language and code token handling with a unique focus on derivation sequences, enhancing performance for grammar-based programming tasks. This is particularly useful for applications requiring syntactic understanding, improved parsing accuracy, and context-aware code generation or transformation.

	## Model Overview

	- Base Model: `codeBERT`
	- Task: Masked Language Modeling on derivation sequences
	- Supported Language: Python 3.8
	- Applications: Parsing, code transformation, syntactic analysis, grammar-based programming

	## Model Usage

	To use the `grammarBERT` model with Python 3.8-specific derivation sequences, load the model and tokenizer as shown below:

	```python
	from transformers import RobertaForMaskedLM, RobertaTokenizer

	# Load the pre-trained grammarBERT model and tokenizer
	model = RobertaForMaskedLM.from_pretrained("Nbeau/grammarBERT")
	tokenizer = RobertaTokenizer.from_pretrained("Nbeau/grammarBERT")

	# Tokenize and prepare a code snippet
	code_snippet = "def enumerate_items(items):"
	# Convert code to a derivation sequence (requires `ast2seq` function)
	derivation_sequence = ast2seq(code_snippet) # `ast2seq` available at https://github.com/NathanaelBeau/grammarBERT/asdl/
	input_ids = tokenizer.encode(derivation_sequence, return_tensors='pt')

	# Use the model for masked token prediction or further fine-tuning
	outputs = model(input_ids)
	```

	### Training and Fine-Tuning

	To train your own `grammarBERT` on a custom dataset or adapt it for different Python versions, follow the setup instructions in the [grammarBERT GitHub repository](https://github.com/NathanaelBeau/grammarBERT). The repository provides detailed guidance for:

	- Preparing Python Abstract Syntax Tree (AST) sequences.
	- Configuring tokenization for derivation sequences.
	- Running training scripts for Masked Language Modeling (MLM) fine-tuning.

	This setup allows for targeted fine-tuning on derivation sequences tailored to your specific grammar requirements.