neuralbioinfo
/

prokbert-mini

@@ -20,20 +20,15 @@ tokenization_parameters = {
     'kmer': 6,
     'shift': 1
 }
 # Initialize the tokenizer and model
 tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
 model = MegatronBertForMaskedLM.from_pretrained("nerualbioinfo/prokbert-mini-k6s2")
 # Example DNA sequence
 sequence = 'ATGTCCGCGGGACCT'
 # Tokenize the sequence
 inputs = tokenizer(sequence, return_tensors="pt")
 # Ensure that inputs have a batch dimension
 inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
 # Generate outputs from the model
 outputs = model(**inputs)
 ```
@@ -91,48 +86,6 @@ After segmentation, sequences are encoded into a vector format. The LCA method a
 4. **Create a Padded/Truncated Array**: Generate a uniform array structure, padding or truncating as necessary.
 5. **Save the Array to HDF**: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
-```python
-import pkg_resources
-from os.path import join
-from prokbert.sequtils import *
-# Directory for pretraining FASTA files
-pretraining_fasta_files_dir = pkg_resources.resource_filename('prokbert','data/pretraining')
-# Define segmentation and tokenization parameters
-segmentation_params = {
-    'max_length': 256,  # Split the sequence into segments of length L
-    'min_length': 6,
-    'type': 'random'
-}
-tokenization_parameters = {
-    'kmer': 6,
-    'shift': 1,
-    'max_segment_length': 2003,
-    'token_limit': 2000
-}
-# Setup configuration
-defconfig = SeqConfig()
-segmentation_params = defconfig.get_and_set_segmentation_parameters(segmentation_params)
-tokenization_params = defconfig.get_and_set_tokenization_parameters(tokenization_parameters)
-# Load and segment sequences
-input_fasta_files = [join(pretraining_fasta_files_dir, file) for file in get_non_empty_files(pretraining_fasta_files_dir)]
-sequences = load_contigs(input_fasta_files, IsAddHeader=True, adding_reverse_complement=True, AsDataFrame=True, to_uppercase=True, is_add_sequence_id=True)
-segment_db = segment_sequences(sequences, segmentation_params, AsDataFrame=True)
-# Tokenization
-tokenized = batch_tokenize_segments_with_ids(segment_db, tokenization_params)
-expected_max_token = max(len(arr) for arrays in tokenized.values() for arr in arrays)
-X, torchdb = get_rectangular_array_from_tokenized_dataset(tokenized, tokenization_params['shift'], expected_max_token)
-# Save to HDF file
-hdf_file = '/tmp/pretraining.h5'
-save_to_hdf(X, hdf_file, database=torchdb, compression=True)
-```
 ### Installation of ProkBERT (if needed)
@@ -177,8 +130,10 @@ Please report any issues with the model or its outputs to the Neural Bioinformat
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
 ## Reference
-```
 If you use ProkBERT-mini in your research, please cite the following paper:
 @ARTICLE{10.3389/fmicb.2023.1331233,
     AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
     TITLE={ProkBERT family: genomic language models for microbiome applications},

     'kmer': 6,
     'shift': 1
 }
 # Initialize the tokenizer and model
 tokenizer = ProkBERTTokenizer(tokenization_params=tokenization_parameters, operation_space='sequence')
 model = MegatronBertForMaskedLM.from_pretrained("nerualbioinfo/prokbert-mini-k6s2")
 # Example DNA sequence
 sequence = 'ATGTCCGCGGGACCT'
 # Tokenize the sequence
 inputs = tokenizer(sequence, return_tensors="pt")
 # Ensure that inputs have a batch dimension
 inputs = {key: value.unsqueeze(0) for key, value in inputs.items()}
 # Generate outputs from the model
 outputs = model(**inputs)
 ```
 4. **Create a Padded/Truncated Array**: Generate a uniform array structure, padding or truncating as necessary.
 5. **Save the Array to HDF**: Store the processed data in an HDF (Hierarchical Data Format) file for efficient retrieval and use in training models.
 ### Installation of ProkBERT (if needed)
 - **Feedback and inquiries:** [[email protected]](mailto:[email protected])
 ## Reference
 If you use ProkBERT-mini in your research, please cite the following paper:
+```
 @ARTICLE{10.3389/fmicb.2023.1331233,
     AUTHOR={Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
     TITLE={ProkBERT family: genomic language models for microbiome applications},