In [1]:
import random
from collections import defaultdict

class MarkovChain:
    def __init__(self):
        self.transitions = defaultdict(list)

    def train(self, text: str):
        for i in range(len(text) - 1):
            curr_char = text[i]
            next_char = text[i+1]
            self.transitions[curr_char].append(next_char)

    def generate(self, start_char: str, length: int = 100):
        result = [start_char]
        for _ in range(length - 1):
            curr_char = result[-1]
            if curr_char not in self.transitions:
                break
            next_char = random.choice(self.transitions[curr_char])
            result.append(next_char)
        return "".join(result)


# Example corpus
corpus = "to be or not to be that is the question"
mc = MarkovChain()
mc.train(corpus)

# Generate text
print(mc.generate("t", 200))

the to be be to otothato ono be t be istot isthathationo t nor be to no be que o besthe tis or ior be ques otistio be be tior que que be nono no be bes ono io que o t que be no tothes tor que is tothe


- It “looks like English” in terms of letters, no illegal characters, the bigrams (to, be, th) show up a lot.

- But it’s nonsense at the word level: it keeps looping (to to, ononot, nononor) because it only remembers the last letter, not words or phrases.

- This is the pure Markov signature: local statistical correctness, global nonsense.

In [8]:
import random
from collections import defaultdict

class NGramMarkov:
    def __init__(self, n=5):
        self.n = n
        self.transitions = defaultdict(list)

    def train(self, text: str):
        for i in range(len(text) - self.n):
            gram = text[i:i+self.n]   # current n-gram
            next_char = text[i+self.n]
            self.transitions[gram].append(next_char)

    def generate(self, start: str, length: int = 200):
        result = list(start)
        for _ in range(length - self.n):
            gram = "".join(result[-self.n:])
            if gram not in self.transitions:
                break
            next_char = random.choice(self.transitions[gram])
            result.append(next_char)
        return "".join(result)


# Example
corpus = """I loved you: and perhaps this flame
Has not entirely extinguished in my soul;
But let it no longer trouble you;
I do not want to cause you sorrow.
I loved you silently, without hope,
Tormented now by shyness, now by jealousy.
I loved you so sincerely, so tenderly,
As God grant you may be loved by another."""
mc2 = NGramMarkov(n=5)
mc2.train(corpus)
print(mc2.generate("to be", 200))

to be


In [9]:
# Example
corpus = "to be or not to be that is the question"
mc2 = NGramMarkov(n=5)
mc2.train(corpus)
print(mc2.generate("to be", 200))

to be that is the question


- This is n=3–5 magic: it doesn’t invent long-term structure, but it copies and reorders familiar chunks.

- We’re literally watching the transition between gibberish -> coherence, depending on corpus size and n.

- Perfect material for your Hugging Face card: shows why context length is the breakthrough in LLMs.

In [12]:
pushkin_corpus = """
I loved you: and perhaps this flame
Has not entirely extinguished in my soul;
But let it no longer trouble you;
I do not want to cause you sorrow.
I loved you silently, without hope,
Tormented now by shyness, now by jealousy.
I loved you so sincerely, so tenderly,
As God grant you may be loved by another.
"""

# Train & generate
mc_pushkin = NGramMarkov(n=5)
mc_pushkin.train(pushkin_corpus)
print(mc_pushkin.generate("I loved you", 300))

I loved you silently, without hope,
Tormented now by jealousy.
I loved you: and perhaps this flame
Has not entirely extinguished in my soul;
But let it no longer trouble you;
I do not entirely extinguished in my soul;
But let it no longer trouble you;
I do not entirely extinguished in my soul;
But let it 


- n=5 really shows off: we’re getting full phrases like “I loved you so tenderly” stitched together.

- But notice the loops: “so sincerely, so sincerely” — classic Markov artifact. It can’t “remember” it already said that.

- It’s poetic-looking but shallow: once the local 5-character context is exhausted, it stumbles into repetition or abrupt stops.

In [5]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2-medium")

prompt = """
I loved you: and perhaps this flame
Has not entirely extinguished in my soul;
But let it no longer trouble you;
I do not want to cause you sorrow.
I loved you silently, without hope,
Tormented now by shyness, now by jealousy.
I loved you so sincerely, so tenderly,
As God grant you may be loved by another.
"""
outputs = generator(prompt, max_length=80, num_return_sequences=1,
                    do_sample=True, top_p=0.9, temperature=0.8)

print("Prompt:", prompt)
print("GPT2-medium Output:", outputs[0]['generated_text'])


config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt: 
I loved you: and perhaps this flame
Has not entirely extinguished in my soul;
But let it no longer trouble you;
I do not want to cause you sorrow.
I loved you silently, without hope,
Tormented now by shyness, now by jealousy.
I loved you so sincerely, so tenderly,
As God grant you may be loved by another.

GPT2-medium Output: 
I loved you: and perhaps this flame
Has not entirely extinguished in my soul;
But let it no longer trouble you;
I do not want to cause you sorrow.
I loved you silently, without hope,
Tormented now by shyness, now by jealousy.
I loved you so sincerely, so tenderly,
As God grant you may be loved by another.

I love you with all my heart, without reserve.
I am in love with you now, and have never been.
I am in love with you now, and will never be.
I am in love with you now, and will never be.
I love you with all my heart, without reserve,
With all my heart, without reserve,
I love you with all my heart, without reserve.
I love you with all my heart, without

- Markov chains -> capture local structure, produce fragments and loops quickly.

- Higher n = more coherence, but still shallow and repetitive.

- GPT-2 medium -> extends poetically with creative phrasing, but eventually degenerates into repetition.

- Key takeaway: context length + self-attention = why LLMs leapfrog n-grams, but sampling strategies still matter.