AdityaAdaki commited on
Commit
99856c0
·
1 Parent(s): b7729a0
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ .env
agri_synthetic_data_generated_groq.jsonl ADDED
File without changes
agri_synthetic_data_generated_lmstudio.jsonl ADDED
File without changes
agri_synthetic_data_generated_ollama.jsonl ADDED
File without changes
agri_synthetic_data_generated_openrouter.jsonl ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ {"text": "<bos><start_of_turn>user\nAnalyze the provided description of plant symptoms and provide a detailed technical diagnosis and recommendations.\nRice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.<end_of_turn>\n<start_of_turn>model\nThe described symptoms are characteristic of Rice Blast disease, caused by the fungus *Magnaporthe oryzae*. The diamond-shaped lesions with grayish-white centers and dark brown borders are typical of this disease. The presence of these lesions, especially during the tillering stage, can significantly impact the plant's ability to produce healthy tillers and subsequently affect grain yield.\n\nDetailed Technical Diagnosis:\n1. **Disease Identification**: The symptoms described are classic for Rice Blast, a major fungal disease affecting rice crops worldwide.\n2. **Causal Pathogen**: The disease is caused by *Magnaporthe oryzae*, a fungus that thrives in humid environments with moderate temperatures.\n3. **Impact**: The disease can cause significant yield loss by damaging leaves and potentially infecting neck nodes, leading to neck blast, which directly affects grain filling.\n\nRecommendations:\n1. **Cultural Practices**: Implement cultural practices that reduce disease pressure, such as using resistant varieties, adjusting planting dates, and ensuring balanced nitrogen fertilization.\n2. **Chemical Control**: Apply fungicides that are effective against *Magnaporthe oryzae*. Systemic fungicides, such as those from the triazole group, can be effective when applied at the onset of disease symptoms or as a preventive measure during high disease pressure periods.\n3. **Monitoring**: Regularly monitor the field for early detection of disease symptoms to enable timely intervention.\n4. **Sanitation**: Remove infected crop residues to reduce the source of inoculum for subsequent crops.\n\nBy adopting these strategies, the impact of Rice Blast can be minimized, and yield losses can be reduced. Regular monitoring and a combination of cultural and chemical controls are key to managing this disease effectively.<end_of_turn>"}
2
+ {"text": "<bos><start_of_turn>user\nBriefly identify the problem shown in the description and give the main point.\nRice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.<end_of_turn>\n<start_of_turn>model\nThe problem shown is Rice Blast disease, characterized by diamond-shaped lesions on the leaves with grayish-white centers and dark brown borders. The main point is that the disease is causing significant damage to the rice leaves during the tillering stage.<end_of_turn>"}
data.py ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import google.generativeai as genai
2
+ import os
3
+ import json
4
+ import time
5
+ import re
6
+ import math
7
+ import sys # To check for file existence and handle errors
8
+
9
+ # --- Configuration ---
10
+ API_KEY = "AIzaSyChwTleVWKrti4nQmFJAA5aFLuZPXYf7ds"
11
+ if not API_KEY:
12
+ raise ValueError("Please set the GOOGLE_API_KEY environment variable.")
13
+
14
+ MODEL_NAME = "gemini-1.5-flash-latest" # Or "gemini-1.5-pro-latest"
15
+
16
+ # Your list of detailed input descriptions (Keep the full list here)
17
+ input_descriptions = [
18
+ # === PASTE THE 20 DESCRIPTIONS FROM THE PREVIOUS ANSWER HERE ===
19
+ """Rice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.
20
+ Cotton - Bollworm: Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.
21
+ Maize - Nitrogen Deficiency: Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.
22
+ Tomato - Early Blight: Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.
23
+ Wheat - Stripe Rust: Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.
24
+ Chili - Leaf Curl Virus: The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.
25
+ Groundnut - Tikka Disease (Leaf Spot): Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.
26
+ Soybean - Healthy: Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.
27
+ Mango - Anthracnose on Fruit: A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).
28
+ Sugarcane - Early Shoot Borer Damage: Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic "dead heart" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.
29
+ Potato - Late Blight: Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.
30
+ Brinjal (Eggplant) - Aphid Infestation: Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.
31
+ Okra (Lady's Finger) - Yellow Vein Mosaic Virus: The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.
32
+ Cabbage - Diamondback Moth Larvae: Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or "window panes" (where only one layer of the leaf epidermis is left). Some silken threads might be present.
33
+ Pigeon Pea (Tur/Arhar) - Pod Borer Damage: Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.
34
+ Mustard - White Rust: Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.
35
+ Apple - Powdery Mildew: Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.
36
+ Grapes - Downy Mildew: Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots ("oil spots"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.
37
+ Papaya - Mealybug Infestation: Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.
38
+ Banana - Sigatoka Leaf Spot: Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis.
39
+ """,
40
+ "The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.",
41
+ "Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.",
42
+ "Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.",
43
+ "Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.",
44
+ "Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.",
45
+ "The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.",
46
+ "Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.",
47
+ "Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.",
48
+ "A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).",
49
+ "Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic \"dead heart\" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.",
50
+ "Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.",
51
+ "Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.",
52
+ "The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.",
53
+ "Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or \"window panes\" (where only one layer of the leaf epidermis is left). Some silken threads might be present.",
54
+ "Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.",
55
+ "Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.",
56
+ "Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.",
57
+ "Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots (\"oil spots\"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.",
58
+ "Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.",
59
+ "Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis."
60
+ # === END OF DESCRIPTION LIST ===
61
+ ]
62
+
63
+ OUTPUT_FILENAME = "agri_synthetic_data_generated.jsonl"
64
+ DESCRIPTIONS_PER_BATCH = 5
65
+ # Consider slightly higher temperature for variation across runs, but check quality
66
+ GENERATION_TEMPERATURE = 0.5
67
+ # Adjust sleep time based on API rate limits (seconds)
68
+ SLEEP_TIME_PER_API_CALL = 20
69
+
70
+ # --- Prompt Templates (Remain the same as previous script) ---
71
+ # template_detailed, template_short, template_easy...
72
+ template_detailed = """
73
+ You are an AI assistant generating synthetic fine-tuning data for an agricultural LLM (Gemma-3). You act as a **senior agricultural scientist**. You will be given {num_descriptions} distinct text descriptions simulating field observations. For **EACH** description, you must generate a corresponding **detailed and technically accurate** diagnosis and management strategy, formatted **exactly** as specified.
74
+
75
+ **INPUT DESCRIPTIONS:**
76
+ {input_descriptions_formatted}
77
+
78
+ **TASK:**
79
+ For **EACH** of the {num_descriptions} Input Descriptions provided above, generate a **single output string** strictly following the Gemma-3 format. The response part within each string should contain your detailed, technical diagnosis and recommendations based *only* on the corresponding Input Description.
80
+
81
+ **OUTPUT FORMAT (Generate one block containing all {num_descriptions} strings consecutively, separated by '---'):**
82
+ {output_format_structure}
83
+
84
+ **Formatting & Constraints:**
85
+ * Generate **one single continuous block of text** containing all {num_descriptions} formatted `OUTPUT_STRING` sections.
86
+ * Use `\\n---\\n` as a separator between each formatted string output.
87
+ * Each of the {num_descriptions} output strings MUST individually adhere strictly to the `<bos>...<end_of_turn>` structure.
88
+ * Ensure the correct description is pasted inside the user turn for each corresponding output string.
89
+ * Base diagnosis/recommendations ONLY on the corresponding Input Description.
90
+ * Responses should be thorough and use appropriate technical terms.
91
+ * Do NOT add any extra text outside this structure.
92
+ --- Produce only the final formatted block based on the Input Descriptions. ---
93
+ """
94
+ template_short = """
95
+ You are an AI assistant generating synthetic fine-tuning data for an agricultural LLM (Gemma-3). You act as a **quick diagnostic assistant**. You will be given {num_descriptions} distinct text descriptions simulating field observations. For **EACH** description, you must generate a corresponding **very brief identification** and key point, formatted **exactly** as specified.
96
+
97
+ **INPUT DESCRIPTIONS:**
98
+ {input_descriptions_formatted}
99
+
100
+ **TASK:**
101
+ For **EACH** of the {num_descriptions} Input Descriptions provided above, generate a **single output string** strictly following the Gemma-3 format. The response part within each string should contain your extremely concise identification and key point/action (1-3 sentences max) based *only* on the corresponding Input Description.
102
+
103
+ **OUTPUT FORMAT (Generate one block containing all {num_descriptions} strings consecutively, separated by '---'):**
104
+ {output_format_structure}
105
+
106
+ **Formatting & Constraints:**
107
+ * Generate **one single continuous block of text** containing all {num_descriptions} formatted `OUTPUT_STRING` sections.
108
+ * Use `\\n---\\n` as a separator between each formatted string output.
109
+ * Each of the {num_descriptions} output strings MUST individually adhere strictly to the `<bos>...<end_of_turn>` structure.
110
+ * Ensure the correct description is pasted inside the user turn for each corresponding output string.
111
+ * Base identification ONLY on the corresponding Input Description.
112
+ * Responses should be extremely concise.
113
+ * Do NOT add any extra text outside this structure.
114
+ --- Produce only the final formatted block based on the Input Descriptions. ---
115
+ """
116
+ template_easy = """
117
+ You are an AI assistant generating synthetic fine-tuning data for an agricultural LLM (Gemma-3). You act as a **friendly local agricultural advisor**. You will be given {num_descriptions} distinct text descriptions simulating field observations. For **EACH** description, you must generate a corresponding explanation and recommendation in **simple, easy-to-understand language**, formatted **exactly** as specified.
118
+
119
+ **INPUT DESCRIPTIONS:**
120
+ {input_descriptions_formatted}
121
+
122
+ **TASK:**
123
+ For **EACH** of the {num_descriptions} Input Descriptions provided above, generate a **single output string** strictly following the Gemma-3 format. The response part within each string should explain the likely problem and suggest first steps in **simple, non-technical language** suitable for an Indian farmer, based *only* on the corresponding Input Description.
124
+
125
+ **OUTPUT FORMAT (Generate one block containing all {num_descriptions} strings consecutively, separated by '---'):**
126
+ {output_format_structure}
127
+
128
+ **Formatting & Constraints:**
129
+ * Generate **one single continuous block of text** containing all {num_descriptions} formatted `OUTPUT_STRING` sections.
130
+ * Use `\\n---\\n` as a separator between each formatted string output.
131
+ * Each of the {num_descriptions} output strings MUST individually adhere strictly to the `<bos>...<end_of_turn>` structure.
132
+ * Ensure the correct description is pasted inside the user turn for each corresponding output string.
133
+ * Base explanation ONLY on the corresponding Input Description.
134
+ * Use simple phrasing, avoid jargon.
135
+ * Do NOT add any extra text outside this structure.
136
+ --- Produce only the final formatted block based the Input Descriptions. ---
137
+ """
138
+ templates = { "detailed": template_detailed, "short": template_short, "easy": template_easy }
139
+ user_prompts_for_template = {
140
+ "detailed": "Analyze the provided description of plant symptoms and provide a detailed technical diagnosis and recommendations.",
141
+ "short": "Briefly identify the problem shown in the description and give the main point.",
142
+ "easy": "Explain what might be wrong based on this description, using simple words, and what I should do first."
143
+ }
144
+
145
+ # --- Helper Function (Remains the same) ---
146
+ # format_prompt_for_batch function... (kept concise for brevity, same as before)
147
+
148
+ # --- Load Existing Data ---
149
+ existing_texts = set()
150
+ if os.path.exists(OUTPUT_FILENAME):
151
+ print(f"Loading existing entries from {OUTPUT_FILENAME} to avoid duplicates...")
152
+ try:
153
+ with open(OUTPUT_FILENAME, 'r', encoding='utf-8') as infile:
154
+ for line in infile:
155
+ try:
156
+ data = json.loads(line)
157
+ if 'text' in data:
158
+ existing_texts.add(data['text'])
159
+ except (json.JSONDecodeError, KeyError) as e:
160
+ print(f" Skipping invalid line: {line.strip()} - Error: {e}")
161
+ print(f"Loaded {len(existing_texts)} unique existing entries.")
162
+ except Exception as e:
163
+ print(f"Error loading existing file: {e}. Starting fresh.")
164
+ else:
165
+ print(f"Output file {OUTPUT_FILENAME} not found. Starting fresh.")
166
+
167
+ # --- Main Generation Logic ---
168
+ genai.configure(api_key=API_KEY)
169
+ model = genai.GenerativeModel(MODEL_NAME)
170
+
171
+ generation_config = genai.GenerationConfig(
172
+ temperature=GENERATION_TEMPERATURE,
173
+ # Add other parameters if needed
174
+ )
175
+ safety_settings = [ # Adjust as needed
176
+ {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
177
+ {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
178
+ {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
179
+ {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
180
+ ]
181
+
182
+ total_entries_generated_this_run = 0
183
+ num_duplicates_skipped_this_run = 0
184
+ num_batches = math.ceil(len(input_descriptions) / DESCRIPTIONS_PER_BATCH)
185
+
186
+ print(f"\nStarting data generation for {len(input_descriptions)} descriptions...")
187
+ print(f"Will process in {num_batches} batches of up to {DESCRIPTIONS_PER_BATCH}.")
188
+ print(f"Output will be appended to: {OUTPUT_FILENAME}")
189
+ print(f"Using model: {MODEL_NAME}, Temperature: {GENERATION_TEMPERATURE}")
190
+ print(f"Pausing {SLEEP_TIME_PER_API_CALL} seconds between API calls.")
191
+
192
+ try:
193
+ # Open file in append mode
194
+ with open(OUTPUT_FILENAME, 'a', encoding='utf-8') as f:
195
+ for i in range(num_batches):
196
+ start_index = i * DESCRIPTIONS_PER_BATCH
197
+ end_index = start_index + DESCRIPTIONS_PER_BATCH
198
+ batch_descriptions = input_descriptions[start_index:end_index]
199
+ actual_batch_size = len(batch_descriptions)
200
+
201
+ print(f"\n--- Processing Batch {i+1}/{num_batches} ({actual_batch_size} descriptions) ---")
202
+
203
+ if not batch_descriptions:
204
+ print("Skipping empty batch.")
205
+ continue
206
+
207
+ for template_key in templates.keys():
208
+ print(f" Generating '{template_key}' style responses...")
209
+
210
+ # --- Format the prompt (using the same logic as previous script) ---
211
+ formatted_prompt = ""
212
+ input_desc_block = ""
213
+ output_struct_block = ""
214
+ user_prompt = user_prompts_for_template[template_key]
215
+ for idx, desc in enumerate(batch_descriptions, 1):
216
+ input_desc_block += f"INPUT_DESCRIPTION_{idx}:\n{desc}\n\n"
217
+ pasted_desc_safe = desc.replace('`', r'\`').replace('{', '{{').replace('}', '}}')
218
+ output_struct_block += f"OUTPUT_STRING_{idx}:\n"
219
+ output_struct_block += f"<bos><start_of_turn>user\n{user_prompt}\n{pasted_desc_safe}<end_of_turn>\n<start_of_turn>model\n[MODEL'S RESPONSE FOR DESCRIPTION {idx} HERE]<end_of_turn>\n"
220
+ if idx < actual_batch_size:
221
+ output_struct_block += "---\n"
222
+ formatted_prompt = templates[template_key].format(
223
+ num_descriptions=actual_batch_size,
224
+ input_descriptions_formatted=input_desc_block.strip(),
225
+ output_format_structure=output_struct_block.strip()
226
+ )
227
+
228
+ # --- Call the API (with retries) ---
229
+ # (API call logic with retries remains the same as previous script)
230
+ api_call_successful = False
231
+ retry_count = 0
232
+ max_retries = 2
233
+ response_text = None # Initialize response_text
234
+ while not api_call_successful and retry_count <= max_retries:
235
+ try:
236
+ print(f" Sending request to Gemini (Attempt {retry_count + 1})...")
237
+ response = model.generate_content(
238
+ formatted_prompt,
239
+ generation_config=generation_config,
240
+ safety_settings=safety_settings
241
+ )
242
+ # Check for empty response or potential errors in the response object itself
243
+ if response.parts:
244
+ response_text = response.text
245
+ else:
246
+ # Handle cases where the response might be blocked or empty
247
+ print(" WARNING: Received response with no processable parts. Checking prompt feedback.")
248
+ if hasattr(response, 'prompt_feedback') and response.prompt_feedback:
249
+ print(f" Prompt Feedback: {response.prompt_feedback}")
250
+ response_text = None # Treat as failure for processing
251
+ # Decide if this is a retryable error or skip
252
+ if response.prompt_feedback and response.prompt_feedback.block_reason:
253
+ print(" Response blocked, skipping.")
254
+ break # Exit retry loop for blocked content
255
+ else:
256
+ print(" Empty response, retrying...")
257
+
258
+
259
+ # If response_text is valid, mark successful
260
+ if response_text:
261
+ api_call_successful = True
262
+ elif not api_call_successful and retry_count < max_retries: # Only retry if not blocked and retries left
263
+ raise ValueError("Empty or non-text response part received") # Trigger retry
264
+ elif not api_call_successful: # Max retries reached or blocked
265
+ break # Exit retry loop
266
+
267
+ except Exception as e:
268
+ retry_count += 1
269
+ print(f" ERROR during API call (Attempt {retry_count}): {e}")
270
+ if retry_count > max_retries:
271
+ print(" Max retries reached. Skipping this batch/template.")
272
+ response_text = None
273
+ break
274
+ print(f" Retrying in {SLEEP_TIME_PER_API_CALL} seconds...")
275
+ time.sleep(SLEEP_TIME_PER_API_CALL)
276
+
277
+ # --- Process and Save Response (with duplicate check) ---
278
+ if api_call_successful and response_text:
279
+ split_parts = re.split(r'\n---\n', response_text.strip())
280
+ generated_count_in_batch = 0
281
+ skipped_count_in_batch = 0
282
+
283
+ if len(split_parts) == actual_batch_size:
284
+ for entry_text in split_parts:
285
+ cleaned_text = entry_text.strip()
286
+ if cleaned_text.startswith("OUTPUT_STRING_"):
287
+ cleaned_text = cleaned_text.split('\n', 1)[-1].strip()
288
+
289
+ if cleaned_text.startswith("<bos>") and cleaned_text.endswith("<end_of_turn>"):
290
+ # *** Check for duplicates ***
291
+ if cleaned_text not in existing_texts:
292
+ data_entry = {"text": cleaned_text}
293
+ json.dump(data_entry, f, ensure_ascii=False)
294
+ f.write('\n')
295
+ existing_texts.add(cleaned_text) # Add to set *after* writing
296
+ total_entries_generated_this_run += 1
297
+ generated_count_in_batch += 1
298
+ else:
299
+ num_duplicates_skipped_this_run += 1
300
+ skipped_count_in_batch += 1
301
+ # print(f" Skipped duplicate entry.") # Optional: uncomment for verbose logging
302
+ else:
303
+ print(f" WARNING: Invalid format detected in generated entry part:\n{cleaned_text[:100]}...")
304
+ print(f" Processed batch for '{template_key}': Saved {generated_count_in_batch} new entries, Skipped {skipped_count_in_batch} duplicates.")
305
+ else:
306
+ print(f" WARNING: Expected {actual_batch_size} parts but got {len(split_parts)} for '{template_key}'. Skipping save for this malformed batch response.")
307
+ print(f" Raw response snippet:\n{response_text[:500]}...")
308
+
309
+ elif api_call_successful and not response_text:
310
+ print(" WARNING: API call successful but received empty/blocked response, no entries saved.")
311
+ # else: (Handled by retry loop failure message)
312
+ # pass
313
+
314
+ # --- Rate Limiting Pause ---
315
+ print(f" Pausing for {SLEEP_TIME_PER_API_CALL} seconds...")
316
+ time.sleep(SLEEP_TIME_PER_API_CALL)
317
+
318
+ except Exception as e:
319
+ print(f"\nAn unexpected error occurred outside the main loop: {e}")
320
+ import traceback
321
+ traceback.print_exc() # Print full traceback for debugging
322
+
323
+ finally:
324
+ print(f"\n--- Run Finished ---")
325
+ print(f"Entries generated in THIS run: {total_entries_generated_this_run}")
326
+ print(f"Duplicates skipped in THIS run: {num_duplicates_skipped_this_run}")
327
+ print(f"Total unique entries now in {OUTPUT_FILENAME}: {len(existing_texts)}") # Reflects total after run completion
dataset.json ADDED
File without changes
generate_synthetic_data_groq.py ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from groq import Groq # Import Groq library
2
+ import os
3
+ import json
4
+ import time
5
+ import re
6
+ import math
7
+ import sys
8
+
9
+ # --- Configuration ---
10
+ # !!! GET YOUR API KEY FROM console.groq.com !!!
11
+ API_KEY = "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"
12
+ if not API_KEY:
13
+ raise ValueError("Please set the GROQ_API_KEY environment variable.")
14
+
15
+ # Choose a Groq model (Check console.groq.com for available models)
16
+ MODEL_NAME = "meta-llama/llama-4-scout-17b-16e-instruct" # Powerful Llama 3 model
17
+ # Other options: "mixtral-8x7b-32768", "gemma-7b-it"
18
+
19
+ # Your list of detailed input descriptions (Paste the 20 descriptions here)
20
+ input_descriptions = [
21
+ # === PASTE THE 20 DESCRIPTIONS FROM THE PREVIOUS ANSWER HERE ===
22
+ """Rice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.
23
+ Cotton - Bollworm: Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.
24
+ Maize - Nitrogen Deficiency: Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.
25
+ Tomato - Early Blight: Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.
26
+ Wheat - Stripe Rust: Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.
27
+ Chili - Leaf Curl Virus: The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.
28
+ Groundnut - Tikka Disease (Leaf Spot): Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.
29
+ Soybean - Healthy: Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.
30
+ Mango - Anthracnose on Fruit: A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).
31
+ Sugarcane - Early Shoot Borer Damage: Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic "dead heart" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.
32
+ Potato - Late Blight: Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.
33
+ Brinjal (Eggplant) - Aphid Infestation: Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.
34
+ Okra (Lady's Finger) - Yellow Vein Mosaic Virus: The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.
35
+ Cabbage - Diamondback Moth Larvae: Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or "window panes" (where only one layer of the leaf epidermis is left). Some silken threads might be present.
36
+ Pigeon Pea (Tur/Arhar) - Pod Borer Damage: Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.
37
+ Mustard - White Rust: Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.
38
+ Apple - Powdery Mildew: Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.
39
+ Grapes - Downy Mildew: Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots ("oil spots"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.
40
+ Papaya - Mealybug Infestation: Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.
41
+ Banana - Sigatoka Leaf Spot: Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis.
42
+ """,
43
+ "The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.",
44
+ "Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.",
45
+ "Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.",
46
+ "Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.",
47
+ "Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.",
48
+ "The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.",
49
+ "Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.",
50
+ "Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.",
51
+ "A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).",
52
+ "Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic \"dead heart\" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.",
53
+ "Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.",
54
+ "Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.",
55
+ "The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.",
56
+ "Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or \"window panes\" (where only one layer of the leaf epidermis is left). Some silken threads might be present.",
57
+ "Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.",
58
+ "Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.",
59
+ "Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.",
60
+ "Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots (\"oil spots\"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.",
61
+ "Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.",
62
+ "Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis."
63
+ # === END OF DESCRIPTION LIST ===
64
+ ]
65
+
66
+ OUTPUT_FILENAME = "agri_synthetic_data_generated_groq.jsonl" # Changed filename
67
+ DESCRIPTIONS_PER_BATCH = 5
68
+ # Consider slightly higher temperature for variation across runs, but check quality
69
+ GENERATION_TEMPERATURE = 0.7 # Groq defaults often work well
70
+ # Adjust sleep time based on API rate limits (Groq free tier limits are per day/week/month often)
71
+ SLEEP_TIME_PER_API_CALL = 5 # Groq is fast, but API limits still apply. Start low and increase if needed.
72
+
73
+ # --- Prompt Templates (Remain the same conceptually) ---
74
+ # template_detailed, template_short, template_easy...
75
+ # (These template strings remain identical to the previous script)
76
+ template_detailed = """... (same as before) ..."""
77
+ template_short = """... (same as before) ..."""
78
+ template_easy = """... (same as before) ..."""
79
+ templates = { "detailed": template_detailed, "short": template_short, "easy": template_easy }
80
+ user_prompts_for_template = {
81
+ "detailed": "Analyze the provided description of plant symptoms and provide a detailed technical diagnosis and recommendations.",
82
+ "short": "Briefly identify the problem shown in the description and give the main point.",
83
+ "easy": "Explain what might be wrong based on this description, using simple words, and what I should do first."
84
+ }
85
+
86
+
87
+ # --- Load Existing Data (Remains the same) ---
88
+ existing_texts = set()
89
+ if os.path.exists(OUTPUT_FILENAME):
90
+ print(f"Loading existing entries from {OUTPUT_FILENAME} to avoid duplicates...")
91
+ try:
92
+ with open(OUTPUT_FILENAME, 'r', encoding='utf-8') as infile:
93
+ for line in infile:
94
+ try:
95
+ data = json.loads(line)
96
+ if 'text' in data:
97
+ existing_texts.add(data['text'])
98
+ except (json.JSONDecodeError, KeyError) as e:
99
+ print(f" Skipping invalid line: {line.strip()} - Error: {e}")
100
+ print(f"Loaded {len(existing_texts)} unique existing entries.")
101
+ except Exception as e:
102
+ print(f"Error loading existing file: {e}. Starting fresh.")
103
+ else:
104
+ print(f"Output file {OUTPUT_FILENAME} not found. Starting fresh.")
105
+
106
+ # --- Main Generation Logic ---
107
+ client = Groq(api_key=API_KEY) # Instantiate Groq client
108
+
109
+ total_entries_generated_this_run = 0
110
+ num_duplicates_skipped_this_run = 0
111
+ num_batches = math.ceil(len(input_descriptions) / DESCRIPTIONS_PER_BATCH)
112
+
113
+ print(f"\nStarting data generation for {len(input_descriptions)} descriptions...")
114
+ print(f"Will process in {num_batches} batches of up to {DESCRIPTIONS_PER_BATCH}.")
115
+ print(f"Output will be appended to: {OUTPUT_FILENAME}")
116
+ print(f"Using Groq model: {MODEL_NAME}, Temperature: {GENERATION_TEMPERATURE}")
117
+ print(f"Pausing {SLEEP_TIME_PER_API_CALL} seconds between API calls.")
118
+
119
+ try:
120
+ with open(OUTPUT_FILENAME, 'a', encoding='utf-8') as f:
121
+ for i in range(num_batches):
122
+ start_index = i * DESCRIPTIONS_PER_BATCH
123
+ end_index = start_index + DESCRIPTIONS_PER_BATCH
124
+ batch_descriptions = input_descriptions[start_index:end_index]
125
+ actual_batch_size = len(batch_descriptions)
126
+
127
+ print(f"\n--- Processing Batch {i+1}/{num_batches} ({actual_batch_size} descriptions) ---")
128
+
129
+ if not batch_descriptions:
130
+ print("Skipping empty batch.")
131
+ continue
132
+
133
+ for template_key in templates.keys():
134
+ print(f" Generating '{template_key}' style responses...")
135
+
136
+ # --- Format the prompt ---
137
+ # (Formatting logic remains the same as previous script)
138
+ formatted_prompt = ""
139
+ input_desc_block = ""
140
+ output_struct_block = ""
141
+ user_prompt = user_prompts_for_template[template_key]
142
+ for idx, desc in enumerate(batch_descriptions, 1):
143
+ input_desc_block += f"INPUT_DESCRIPTION_{idx}:\n{desc}\n\n"
144
+ pasted_desc_safe = desc.replace('`', r'\`').replace('{', '{{').replace('}', '}}')
145
+ output_struct_block += f"OUTPUT_STRING_{idx}:\n"
146
+ output_struct_block += f"<bos><start_of_turn>user\n{user_prompt}\n{pasted_desc_safe}<end_of_turn>\n<start_of_turn>model\n[MODEL'S RESPONSE FOR DESCRIPTION {idx} HERE]<end_of_turn>\n"
147
+ if idx < actual_batch_size:
148
+ output_struct_block += "---\n"
149
+ formatted_prompt = templates[template_key].format(
150
+ num_descriptions=actual_batch_size,
151
+ input_descriptions_formatted=input_desc_block.strip(),
152
+ output_format_structure=output_struct_block.strip()
153
+ )
154
+
155
+ # --- Structure messages for Groq API ---
156
+ messages = [
157
+ # Optional system prompt (can help set context)
158
+ # {
159
+ # "role": "system",
160
+ # "content": "You are an AI assistant generating synthetic fine-tuning data..."
161
+ # },
162
+ {
163
+ "role": "user",
164
+ "content": formatted_prompt # The entire detailed instruction set goes here
165
+ }
166
+ ]
167
+
168
+ # --- Call the Groq API (with retries) ---
169
+ api_call_successful = False
170
+ retry_count = 0
171
+ max_retries = 2
172
+ response_text = None
173
+ while not api_call_successful and retry_count <= max_retries:
174
+ try:
175
+ print(f" Sending request to Groq (Attempt {retry_count + 1})...")
176
+ chat_completion = client.chat.completions.create(
177
+ messages=messages,
178
+ model=MODEL_NAME,
179
+ temperature=GENERATION_TEMPERATURE,
180
+ # max_tokens=4096, # Optional: Specify max output tokens if needed
181
+ # top_p=..., # Optional
182
+ )
183
+ # Extract response content
184
+ if chat_completion.choices and chat_completion.choices[0].message:
185
+ response_text = chat_completion.choices[0].message.content
186
+ api_call_successful = True
187
+ else:
188
+ # Handle unexpected empty or malformed response from Groq
189
+ print(" WARNING: Received unexpected response structure from Groq.")
190
+ print(f" Raw completion object: {chat_completion}")
191
+ response_text = None
192
+ raise ValueError("Invalid Groq response structure") # Trigger retry
193
+
194
+ except Exception as e:
195
+ retry_count += 1
196
+ print(f" ERROR during Groq API call (Attempt {retry_count}): {e}")
197
+ if retry_count > max_retries:
198
+ print(" Max retries reached. Skipping this batch/template.")
199
+ response_text = None
200
+ break
201
+ print(f" Retrying in {SLEEP_TIME_PER_API_CALL} seconds...")
202
+ time.sleep(SLEEP_TIME_PER_API_CALL)
203
+
204
+
205
+ # --- Process and Save Response (with duplicate check) ---
206
+ # (This logic remains identical to the previous script)
207
+ if api_call_successful and response_text:
208
+ split_parts = re.split(r'\n---\n', response_text.strip())
209
+ generated_count_in_batch = 0
210
+ skipped_count_in_batch = 0
211
+
212
+ if len(split_parts) == actual_batch_size:
213
+ for entry_text in split_parts:
214
+ cleaned_text = entry_text.strip()
215
+ if cleaned_text.startswith("OUTPUT_STRING_"):
216
+ cleaned_text = cleaned_text.split('\n', 1)[-1].strip()
217
+
218
+ if cleaned_text.startswith("<bos>") and cleaned_text.endswith("<end_of_turn>"):
219
+ if cleaned_text not in existing_texts:
220
+ data_entry = {"text": cleaned_text}
221
+ json.dump(data_entry, f, ensure_ascii=False)
222
+ f.write('\n')
223
+ existing_texts.add(cleaned_text)
224
+ total_entries_generated_this_run += 1
225
+ generated_count_in_batch += 1
226
+ else:
227
+ num_duplicates_skipped_this_run += 1
228
+ skipped_count_in_batch += 1
229
+ else:
230
+ print(f" WARNING: Invalid format detected in generated entry part:\n{cleaned_text[:100]}...")
231
+ print(f" Processed batch for '{template_key}': Saved {generated_count_in_batch} new entries, Skipped {skipped_count_in_batch} duplicates.")
232
+ else:
233
+ print(f" WARNING: Expected {actual_batch_size} parts but got {len(split_parts)} for '{template_key}'. Skipping save for this malformed batch response.")
234
+ print(f" Raw response snippet:\n{response_text[:500]}...")
235
+
236
+ elif api_call_successful and not response_text:
237
+ print(" WARNING: API call successful but received empty response, no entries saved.")
238
+ # else: No need for else, handled by retry logic
239
+
240
+ # --- Rate Limiting Pause ---
241
+ print(f" Pausing for {SLEEP_TIME_PER_API_CALL} seconds...")
242
+ time.sleep(SLEEP_TIME_PER_API_CALL)
243
+
244
+ except Exception as e:
245
+ print(f"\nAn unexpected error occurred outside the main loop: {e}")
246
+ import traceback
247
+ traceback.print_exc()
248
+
249
+ finally:
250
+ print(f"\n--- Run Finished ---")
251
+ print(f"Entries generated in THIS run: {total_entries_generated_this_run}")
252
+ print(f"Duplicates skipped in THIS run: {num_duplicates_skipped_this_run}")
253
+ print(f"Total unique entries now in {OUTPUT_FILENAME}: {len(existing_texts)}")
generate_synthetic_data_lmstudio.py ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import os
3
+ import json
4
+ import time
5
+ import re
6
+ import sys
7
+ import math
8
+ from dotenv import load_dotenv
9
+
10
+ load_dotenv()
11
+
12
+ # --- Configuration ---
13
+ # Set LM Studio API URL - default is local server
14
+ LMSTUDIO_API_URL = os.environ.get("LMSTUDIO_API_URL", "http://localhost:1234/v1/chat/completions")
15
+
16
+ # Choose a model loaded in your LM Studio instance
17
+ MODEL_NAME = "default" # LM Studio uses loaded model by default
18
+
19
+ # Your list of detailed input descriptions
20
+ input_descriptions = [
21
+ "Rice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.",
22
+ "Cotton - Bollworm: Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.",
23
+ "Maize - Nitrogen Deficiency: Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.",
24
+ "Tomato - Early Blight: Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.",
25
+ "Wheat - Stripe Rust: Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.",
26
+ "Chili - Leaf Curl Virus: The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.",
27
+ "Groundnut - Tikka Disease (Leaf Spot): Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.",
28
+ "Soybean - Healthy: Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.",
29
+ "Mango - Anthracnose on Fruit: A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).",
30
+ "Sugarcane - Early Shoot Borer Damage: Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic \"dead heart\" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.",
31
+ "Potato - Late Blight: Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.",
32
+ "Brinjal (Eggplant) - Aphid Infestation: Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.",
33
+ "Okra (Lady's Finger) - Yellow Vein Mosaic Virus: The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.",
34
+ "Cabbage - Diamondback Moth Larvae: Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or \"window panes\" (where only one layer of the leaf epidermis is left). Some silken threads might be present.",
35
+ "Pigeon Pea (Tur/Arhar) - Pod Borer Damage: Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.",
36
+ "Mustard - White Rust: Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.",
37
+ "Apple - Powdery Mildew: Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.",
38
+ "Grapes - Downy Mildew: Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots (\"oil spots\"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.",
39
+ "Papaya - Mealybug Infestation: Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.",
40
+ "Banana - Sigatoka Leaf Spot: Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis."
41
+ ]
42
+
43
+ OUTPUT_FILENAME = "agri_synthetic_data_generated_lmstudio.jsonl"
44
+ GENERATION_TEMPERATURE = 0.7
45
+ SLEEP_TIME_PER_API_CALL = 1 # LM Studio is local, so we can typically be a bit faster
46
+
47
+ # --- Prompt Template ---
48
+ single_prompt_template = """
49
+ You are an AI assistant generating fine-tuning data for an agricultural diagnosis chatbot.
50
+ Your task is to generate a response in a specific format based on the provided plant symptom description and the user's implied question style.
51
+
52
+ Follow these instructions STRICTLY:
53
+
54
+ 1. Analyze the User Question Style: Understand the level of detail requested by the user prompt ({user_question_style}).
55
+ 2. Analyze the Input Description: Carefully read the provided plant symptom description.
56
+ 3. Generate the Response: Create a helpful and accurate response that directly addresses the user's question style, based *only* on the information in the input description.
57
+ 4. Format the Output EXACTLY as specified below and make sure to include ALL the tags, especially the final <end_of_turn> tag:
58
+
59
+ <bos><start_of_turn>user
60
+ {user_question}
61
+ {input_description}<end_of_turn>
62
+ <start_of_turn>model
63
+ [YOUR GENERATED RESPONSE HERE BASED ON THE DESCRIPTION AND QUESTION STYLE]<end_of_turn>
64
+
65
+ IMPORTANT: Make sure your response includes all required tags, especially ending with <end_of_turn>
66
+
67
+ --- Input Data ---
68
+
69
+ User Question Style: {user_question_style}
70
+ User Question: {user_question}
71
+ Input Description:
72
+ {input_description}
73
+
74
+ --- Generate the Output Below ---
75
+ """
76
+
77
+ # Define the different user question styles
78
+ user_prompts_for_style = {
79
+ "detailed": "Analyze the provided description of plant symptoms and provide a detailed technical diagnosis and recommendations.",
80
+ "short": "Briefly identify the problem shown in the description and give the main point.",
81
+ "easy": "Explain what might be wrong based on this description, using simple words, and what I should do first."
82
+ }
83
+
84
+ # --- Load Existing Data ---
85
+ existing_texts = set()
86
+ if os.path.exists(OUTPUT_FILENAME):
87
+ print(f"Loading existing entries from {OUTPUT_FILENAME} to avoid duplicates...")
88
+ try:
89
+ with open(OUTPUT_FILENAME, 'r', encoding='utf-8') as infile:
90
+ for line in infile:
91
+ try:
92
+ data = json.loads(line)
93
+ if 'text' in data:
94
+ existing_texts.add(data['text'])
95
+ except (json.JSONDecodeError, KeyError) as e:
96
+ print(f" Skipping invalid line: {line.strip()} - Error: {e}")
97
+ print(f"Loaded {len(existing_texts)} unique existing entries.")
98
+ except Exception as e:
99
+ print(f"Error loading existing file: {e}. Starting fresh.")
100
+ else:
101
+ print(f"Output file {OUTPUT_FILENAME} not found. Starting fresh.")
102
+
103
+ # --- Main Generation Logic ---
104
+ total_entries_generated_this_run = 0
105
+ num_duplicates_skipped_this_run = 0
106
+ total_api_calls = len(input_descriptions) * len(user_prompts_for_style)
107
+ current_api_call = 0
108
+
109
+ print(f"\nStarting data generation for {len(input_descriptions)} descriptions...")
110
+ print(f"Will make {total_api_calls} individual API calls via LM Studio.")
111
+ print(f"Output will be appended to: {OUTPUT_FILENAME}")
112
+ print(f"Using LM Studio model: {MODEL_NAME}, Temperature: {GENERATION_TEMPERATURE}")
113
+ print(f"Pausing {SLEEP_TIME_PER_API_CALL} seconds between API calls.")
114
+
115
+ try:
116
+ with open(OUTPUT_FILENAME, 'a', encoding='utf-8') as f:
117
+ # Iterate through each description
118
+ for i, description in enumerate(input_descriptions):
119
+ print(f"\n--- Processing Description {i+1}/{len(input_descriptions)} ---")
120
+ print(f"Input: {description[:100]}...") # Print snippet
121
+
122
+ # Iterate through each prompt style for the current description
123
+ for style_key, user_question in user_prompts_for_style.items():
124
+ current_api_call += 1
125
+ print(f" Generating '{style_key}' style response ({current_api_call}/{total_api_calls})...")
126
+
127
+ # Format the prompt
128
+ pasted_desc_safe = description.replace('{', '{{').replace('}', '}}')
129
+ formatted_prompt = single_prompt_template.format(
130
+ user_question_style=style_key,
131
+ user_question=user_question,
132
+ input_description=pasted_desc_safe
133
+ )
134
+
135
+ # Prepare payload for LM Studio API
136
+ payload = {
137
+ "messages": [
138
+ {
139
+ "role": "user",
140
+ "content": formatted_prompt
141
+ }
142
+ ],
143
+ "temperature": GENERATION_TEMPERATURE,
144
+ "max_tokens": 2048,
145
+ "stream": False
146
+ }
147
+
148
+ # Call the LM Studio API (with retries)
149
+ api_call_successful = False
150
+ retry_count = 0
151
+ max_retries = 3
152
+ response_text = None
153
+
154
+ while not api_call_successful and retry_count <= max_retries:
155
+ try:
156
+ print(f" Sending request to LM Studio (Attempt {retry_count + 1})...")
157
+ start_time = time.time()
158
+
159
+ # Make the POST request to LM Studio
160
+ response = requests.post(
161
+ LMSTUDIO_API_URL,
162
+ json=payload,
163
+ headers={"Content-Type": "application/json"}
164
+ )
165
+
166
+ end_time = time.time()
167
+ print(f" LM Studio API call took {end_time - start_time:.2f} seconds.")
168
+
169
+ # Check for HTTP errors
170
+ response.raise_for_status()
171
+
172
+ # Parse the JSON response
173
+ response_data = response.json()
174
+
175
+ # Extract response content (LM Studio format)
176
+ if response_data.get('choices') and len(response_data['choices']) > 0:
177
+ response_text = response_data['choices'][0]['message']['content']
178
+ print(f" DEBUG - Last 50 chars of response: {response_text[-50:] if response_text else 'None'}")
179
+
180
+ # Check if response looks like expected format
181
+ if response_text and "<bos>" in response_text:
182
+ api_call_successful = True
183
+ else:
184
+ print(" WARNING: Received response doesn't contain '<bos>'. Might be malformed.")
185
+ print(f" Raw Response Snippet: {response_text[:200]}...")
186
+ # Extract what looks like the format we want
187
+ if "<start_of_turn>" in response_text:
188
+ # Try to extract just the formatted part
189
+ pattern = r"<bos>.*?<end_of_turn>"
190
+ match = re.search(pattern, response_text, re.DOTALL)
191
+ if match:
192
+ response_text = match.group(0)
193
+ api_call_successful = True
194
+ print(" Found partial match in response, extracted usable content.")
195
+ else:
196
+ raise ValueError("Response format incorrect")
197
+ else:
198
+ raise ValueError("Response format incorrect")
199
+ else:
200
+ print(" WARNING: Received unexpected response structure from LM Studio.")
201
+ print(f" Raw response data: {response_data}")
202
+ response_text = None
203
+ raise ValueError("Invalid LM Studio response structure")
204
+
205
+ except requests.exceptions.RequestException as e:
206
+ retry_count += 1
207
+ print(f" ERROR during LM Studio API call (Attempt {retry_count}): {e}")
208
+
209
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** (retry_count - 1))
210
+
211
+ if retry_count > max_retries:
212
+ print(f" Max retries reached for Description {i+1}, Style '{style_key}'. Skipping.")
213
+ response_text = None
214
+ break
215
+ print(f" Retrying in {wait_time:.2f} seconds...")
216
+ time.sleep(wait_time)
217
+ except Exception as e:
218
+ retry_count += 1
219
+ print(f" ERROR processing request/response (Attempt {retry_count}): {e}")
220
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** (retry_count - 1))
221
+ if retry_count > max_retries:
222
+ print(f" Max retries reached for Description {i+1}, Style '{style_key}'. Skipping.")
223
+ response_text = None
224
+ break
225
+ print(f" Retrying in {wait_time:.2f} seconds...")
226
+ time.sleep(wait_time)
227
+
228
+ # Process and Save Response
229
+ if api_call_successful and response_text:
230
+ cleaned_text = response_text.strip()
231
+ print(f" DEBUG - Response starts with: {cleaned_text[:50]}...")
232
+ print(f" DEBUG - Response ends with: ...{cleaned_text[-50:]}")
233
+
234
+ # Ensure response has both starting and ending tags
235
+ if "<bos>" in cleaned_text:
236
+ # Check for end tag and add if missing
237
+ if "<end_of_turn>" not in cleaned_text:
238
+ print(" WARNING: Response missing <end_of_turn> tag. Adding it for consistency.")
239
+ cleaned_text = cleaned_text + "<end_of_turn>"
240
+
241
+ # Check for duplicates
242
+ if cleaned_text not in existing_texts:
243
+ data_entry = {"text": cleaned_text}
244
+ json.dump(data_entry, f, ensure_ascii=False)
245
+ f.write('\n')
246
+ existing_texts.add(cleaned_text)
247
+ total_entries_generated_this_run += 1
248
+ print(f" Saved 1 new entry for '{style_key}'.")
249
+ else:
250
+ num_duplicates_skipped_this_run += 1
251
+ print(f" Skipped 1 duplicate entry for '{style_key}'.")
252
+ else:
253
+ print(f" WARNING: Invalid format detected in final generated entry for '{style_key}'.")
254
+ print(f" Failed validation. Raw text: {cleaned_text[:150]}...")
255
+
256
+ elif not api_call_successful:
257
+ print(f" Skipping save for '{style_key}' due to API call failures after retries.")
258
+
259
+ # Rate Limiting Pause (after each API call)
260
+ if current_api_call < total_api_calls: # Don't sleep after the very last call
261
+ time.sleep(SLEEP_TIME_PER_API_CALL)
262
+
263
+ except KeyboardInterrupt:
264
+ print("\n--- KeyboardInterrupt detected. Stopping script early. ---")
265
+ except Exception as e:
266
+ print(f"\nAn unexpected error occurred outside the main loop: {e}")
267
+ import traceback
268
+ traceback.print_exc()
269
+
270
+ finally:
271
+ print(f"\n--- Run Finished ---")
272
+ print(f"Total API calls attempted: {current_api_call}")
273
+ print(f"Entries generated in THIS run: {total_entries_generated_this_run}")
274
+ print(f"Duplicates skipped in THIS run: {num_duplicates_skipped_this_run}")
275
+ print(f"Total unique entries now in {OUTPUT_FILENAME}: {len(existing_texts)}")
generate_synthetic_data_ollama.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import os
3
+ import json
4
+ import time
5
+ import re
6
+ import sys
7
+ import math
8
+ from dotenv import load_dotenv
9
+
10
+ load_dotenv()
11
+
12
+ # --- Configuration ---
13
+ # Set Ollama API URL - default is local server
14
+ OLLAMA_API_URL = os.environ.get("OLLAMA_API_URL", "hhttp://192.168.38.87:1234/v1/completions")
15
+
16
+ # Choose a model available in your Ollama instance
17
+ # Examples: "llama3", "mistral", "gemma:7b", etc.
18
+ MODEL_NAME = "lmstudio-community/gemma-3-4b-it-GGUF"
19
+
20
+ # Your list of detailed input descriptions
21
+ input_descriptions = [
22
+ "Rice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.",
23
+ "Cotton - Bollworm: Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.",
24
+ "Maize - Nitrogen Deficiency: Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.",
25
+ "Tomato - Early Blight: Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.",
26
+ "Wheat - Stripe Rust: Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.",
27
+ "Chili - Leaf Curl Virus: The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.",
28
+ "Groundnut - Tikka Disease (Leaf Spot): Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.",
29
+ "Soybean - Healthy: Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.",
30
+ "Mango - Anthracnose on Fruit: A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).",
31
+ "Sugarcane - Early Shoot Borer Damage: Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic \"dead heart\" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.",
32
+ "Potato - Late Blight: Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.",
33
+ "Brinjal (Eggplant) - Aphid Infestation: Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.",
34
+ "Okra (Lady's Finger) - Yellow Vein Mosaic Virus: The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.",
35
+ "Cabbage - Diamondback Moth Larvae: Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or \"window panes\" (where only one layer of the leaf epidermis is left). Some silken threads might be present.",
36
+ "Pigeon Pea (Tur/Arhar) - Pod Borer Damage: Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.",
37
+ "Mustard - White Rust: Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.",
38
+ "Apple - Powdery Mildew: Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.",
39
+ "Grapes - Downy Mildew: Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots (\"oil spots\"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.",
40
+ "Papaya - Mealybug Infestation: Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.",
41
+ "Banana - Sigatoka Leaf Spot: Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis."
42
+ ]
43
+
44
+ OUTPUT_FILENAME = "agri_synthetic_data_generated_ollama.jsonl"
45
+ GENERATION_TEMPERATURE = 0.7
46
+ SLEEP_TIME_PER_API_CALL = 1 # Ollama is local, so we can typically be a bit faster
47
+
48
+ # --- Prompt Template ---
49
+ single_prompt_template = """
50
+ You are an AI assistant generating fine-tuning data for an agricultural diagnosis chatbot.
51
+ Your task is to generate a response in a specific format based on the provided plant symptom description and the user's implied question style.
52
+
53
+ Follow these instructions STRICTLY:
54
+
55
+ 1. Analyze the User Question Style: Understand the level of detail requested by the user prompt ({user_question_style}).
56
+ 2. Analyze the Input Description: Carefully read the provided plant symptom description.
57
+ 3. Generate the Response: Create a helpful and accurate response that directly addresses the user's question style, based *only* on the information in the input description.
58
+ 4. Format the Output EXACTLY as specified below and make sure to include ALL the tags, especially the final <end_of_turn> tag:
59
+
60
+ <bos><start_of_turn>user
61
+ {user_question}
62
+ {input_description}<end_of_turn>
63
+ <start_of_turn>model
64
+ [YOUR GENERATED RESPONSE HERE BASED ON THE DESCRIPTION AND QUESTION STYLE]<end_of_turn>
65
+
66
+ IMPORTANT: Make sure your response includes all required tags, especially ending with <end_of_turn>
67
+
68
+ --- Input Data ---
69
+
70
+ User Question Style: {user_question_style}
71
+ User Question: {user_question}
72
+ Input Description:
73
+ {input_description}
74
+
75
+ --- Generate the Output Below ---
76
+ """
77
+
78
+ # Define the different user question styles
79
+ user_prompts_for_style = {
80
+ "detailed": "Analyze the provided description of plant symptoms and provide a detailed technical diagnosis and recommendations.",
81
+ "short": "Briefly identify the problem shown in the description and give the main point.",
82
+ "easy": "Explain what might be wrong based on this description, using simple words, and what I should do first."
83
+ }
84
+
85
+ # --- Load Existing Data ---
86
+ existing_texts = set()
87
+ if os.path.exists(OUTPUT_FILENAME):
88
+ print(f"Loading existing entries from {OUTPUT_FILENAME} to avoid duplicates...")
89
+ try:
90
+ with open(OUTPUT_FILENAME, 'r', encoding='utf-8') as infile:
91
+ for line in infile:
92
+ try:
93
+ data = json.loads(line)
94
+ if 'text' in data:
95
+ existing_texts.add(data['text'])
96
+ except (json.JSONDecodeError, KeyError) as e:
97
+ print(f" Skipping invalid line: {line.strip()} - Error: {e}")
98
+ print(f"Loaded {len(existing_texts)} unique existing entries.")
99
+ except Exception as e:
100
+ print(f"Error loading existing file: {e}. Starting fresh.")
101
+ else:
102
+ print(f"Output file {OUTPUT_FILENAME} not found. Starting fresh.")
103
+
104
+ # --- Main Generation Logic ---
105
+ total_entries_generated_this_run = 0
106
+ num_duplicates_skipped_this_run = 0
107
+ total_api_calls = len(input_descriptions) * len(user_prompts_for_style)
108
+ current_api_call = 0
109
+
110
+ print(f"\nStarting data generation for {len(input_descriptions)} descriptions...")
111
+ print(f"Will make {total_api_calls} individual API calls via Ollama.")
112
+ print(f"Output will be appended to: {OUTPUT_FILENAME}")
113
+ print(f"Using Ollama model: {MODEL_NAME}, Temperature: {GENERATION_TEMPERATURE}")
114
+ print(f"Pausing {SLEEP_TIME_PER_API_CALL} seconds between API calls.")
115
+
116
+ try:
117
+ with open(OUTPUT_FILENAME, 'a', encoding='utf-8') as f:
118
+ # Iterate through each description
119
+ for i, description in enumerate(input_descriptions):
120
+ print(f"\n--- Processing Description {i+1}/{len(input_descriptions)} ---")
121
+ print(f"Input: {description[:100]}...") # Print snippet
122
+
123
+ # Iterate through each prompt style for the current description
124
+ for style_key, user_question in user_prompts_for_style.items():
125
+ current_api_call += 1
126
+ print(f" Generating '{style_key}' style response ({current_api_call}/{total_api_calls})...")
127
+
128
+ # Format the prompt
129
+ pasted_desc_safe = description.replace('{', '{{').replace('}', '}}')
130
+ formatted_prompt = single_prompt_template.format(
131
+ user_question_style=style_key,
132
+ user_question=user_question,
133
+ input_description=pasted_desc_safe
134
+ )
135
+
136
+ # Prepare payload for Ollama API
137
+ payload = {
138
+ "model": MODEL_NAME,
139
+ "prompt": formatted_prompt,
140
+ "temperature": GENERATION_TEMPERATURE,
141
+ "num_predict": 2048, # Max tokens to generate
142
+ "stop": ["\n\n"] # Optional stop sequence
143
+ }
144
+
145
+ # Call the Ollama API (with retries)
146
+ api_call_successful = False
147
+ retry_count = 0
148
+ max_retries = 3
149
+ response_text = None
150
+
151
+ while not api_call_successful and retry_count <= max_retries:
152
+ try:
153
+ print(f" Sending request to Ollama (Attempt {retry_count + 1})...")
154
+ start_time = time.time()
155
+
156
+ # Make the POST request to Ollama
157
+ response = requests.post(
158
+ OLLAMA_API_URL,
159
+ json=payload,
160
+ headers={"Content-Type": "application/json"}
161
+ )
162
+
163
+ end_time = time.time()
164
+ print(f" Ollama API call took {end_time - start_time:.2f} seconds.")
165
+
166
+ # Check for HTTP errors
167
+ response.raise_for_status()
168
+
169
+ # Parse the JSON response
170
+ response_data = response.json()
171
+
172
+ # Extract response content (Ollama's format differs from OpenRouter)
173
+ if response_data.get('response'):
174
+ response_text = response_data['response']
175
+ print(f" DEBUG - Last 50 chars of response: {response_text[-50:] if response_text else 'None'}")
176
+
177
+ # Check if response looks like expected format
178
+ if response_text and "<bos>" in response_text:
179
+ api_call_successful = True
180
+ else:
181
+ print(" WARNING: Received response doesn't contain '<bos>'. Might be malformed.")
182
+ print(f" Raw Response Snippet: {response_text[:200]}...")
183
+ # Extract what looks like the format we want
184
+ if "<start_of_turn>" in response_text:
185
+ # Try to extract just the formatted part
186
+ pattern = r"<bos>.*?<end_of_turn>"
187
+ match = re.search(pattern, response_text, re.DOTALL)
188
+ if match:
189
+ response_text = match.group(0)
190
+ api_call_successful = True
191
+ print(" Found partial match in response, extracted usable content.")
192
+ else:
193
+ raise ValueError("Response format incorrect")
194
+ else:
195
+ raise ValueError("Response format incorrect")
196
+ else:
197
+ print(" WARNING: Received unexpected response structure from Ollama.")
198
+ print(f" Raw response data: {response_data}")
199
+ response_text = None
200
+ raise ValueError("Invalid Ollama response structure")
201
+
202
+ except requests.exceptions.RequestException as e:
203
+ retry_count += 1
204
+ print(f" ERROR during Ollama API call (Attempt {retry_count}): {e}")
205
+
206
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** (retry_count - 1))
207
+
208
+ if retry_count > max_retries:
209
+ print(f" Max retries reached for Description {i+1}, Style '{style_key}'. Skipping.")
210
+ response_text = None
211
+ break
212
+ print(f" Retrying in {wait_time:.2f} seconds...")
213
+ time.sleep(wait_time)
214
+ except Exception as e:
215
+ retry_count += 1
216
+ print(f" ERROR processing request/response (Attempt {retry_count}): {e}")
217
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** (retry_count - 1))
218
+ if retry_count > max_retries:
219
+ print(f" Max retries reached for Description {i+1}, Style '{style_key}'. Skipping.")
220
+ response_text = None
221
+ break
222
+ print(f" Retrying in {wait_time:.2f} seconds...")
223
+ time.sleep(wait_time)
224
+
225
+ # Process and Save Response
226
+ if api_call_successful and response_text:
227
+ cleaned_text = response_text.strip()
228
+ print(f" DEBUG - Response starts with: {cleaned_text[:50]}...")
229
+ print(f" DEBUG - Response ends with: ...{cleaned_text[-50:]}")
230
+
231
+ # Ensure response has both starting and ending tags
232
+ if "<bos>" in cleaned_text:
233
+ # Check for end tag and add if missing
234
+ if "<end_of_turn>" not in cleaned_text:
235
+ print(" WARNING: Response missing <end_of_turn> tag. Adding it for consistency.")
236
+ cleaned_text = cleaned_text + "<end_of_turn>"
237
+
238
+ # Check for duplicates
239
+ if cleaned_text not in existing_texts:
240
+ data_entry = {"text": cleaned_text}
241
+ json.dump(data_entry, f, ensure_ascii=False)
242
+ f.write('\n')
243
+ existing_texts.add(cleaned_text)
244
+ total_entries_generated_this_run += 1
245
+ print(f" Saved 1 new entry for '{style_key}'.")
246
+ else:
247
+ num_duplicates_skipped_this_run += 1
248
+ print(f" Skipped 1 duplicate entry for '{style_key}'.")
249
+ else:
250
+ print(f" WARNING: Invalid format detected in final generated entry for '{style_key}'.")
251
+ print(f" Failed validation. Raw text: {cleaned_text[:150]}...")
252
+
253
+ elif not api_call_successful:
254
+ print(f" Skipping save for '{style_key}' due to API call failures after retries.")
255
+
256
+ # Rate Limiting Pause (after each API call)
257
+ if current_api_call < total_api_calls: # Don't sleep after the very last call
258
+ time.sleep(SLEEP_TIME_PER_API_CALL)
259
+
260
+ except KeyboardInterrupt:
261
+ print("\n--- KeyboardInterrupt detected. Stopping script early. ---")
262
+ except Exception as e:
263
+ print(f"\nAn unexpected error occurred outside the main loop: {e}")
264
+ import traceback
265
+ traceback.print_exc()
266
+
267
+ finally:
268
+ print(f"\n--- Run Finished ---")
269
+ print(f"Total API calls attempted: {current_api_call}")
270
+ print(f"Entries generated in THIS run: {total_entries_generated_this_run}")
271
+ print(f"Duplicates skipped in THIS run: {num_duplicates_skipped_this_run}")
272
+ print(f"Total unique entries now in {OUTPUT_FILENAME}: {len(existing_texts)}")
generate_synthetic_data_openrouter.py ADDED
@@ -0,0 +1,302 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests # Use requests for HTTP calls
2
+ import os
3
+ import json
4
+ import time
5
+ import re
6
+ import sys
7
+ import math
8
+ from dotenv import load_dotenv
9
+
10
+ load_dotenv()
11
+ # --- Configuration ---
12
+ # !!! GET YOUR API KEY FROM openrouter.ai !!!
13
+ OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY") # Use environment variable
14
+ # Or paste directly: OPENROUTER_API_KEY = "sk-or-v1-..."
15
+ if not OPENROUTER_API_KEY:
16
+ raise ValueError("Please set the OPENROUTER_API_KEY environment variable.")
17
+
18
+ # Set your site URL and app name (optional, but recommended by OpenRouter)
19
+ # Replace with your actual info or leave as is
20
+ SITE_URL = "http://localhost" # Or your app's URL
21
+ APP_NAME = "SyntheticDataGen"
22
+
23
+ # Choose a model available on OpenRouter (Check openrouter.ai/models)
24
+ # Use the OpenRouter format: provider/model-name
25
+ MODEL_NAME = "meta-llama/llama-4-maverick:free"
26
+ # Other examples: "openai/gpt-4o", "anthropic/claude-3-haiku", "google/gemini-pro"
27
+
28
+ # Your list of detailed input descriptions (Paste the 20 descriptions here)
29
+ input_descriptions = [
30
+ # === PASTE THE 20 DESCRIPTIONS FROM THE PREVIOUS ANSWER HERE ===
31
+
32
+ "Rice - Blast: The image shows multiple leaves of a rice plant during the tillering stage. Many leaves exhibit diamond-shaped lesions, approximately 1-2 cm long, with grayish-white centers and distinct dark brown borders. Some lesions are merging, causing larger blighted areas. No insects are visible.",
33
+ "Cotton - Bollworm: Close-up view of a developing cotton boll. Several small (<1cm), pinkish-white larvae are seen boring into the boll surface near the bracts. Small entry holes with some dark, granular frass (excrement) are visible. The surrounding leaves appear relatively healthy.",
34
+ "Maize - Nitrogen Deficiency: Image displays the lower leaves of a young maize plant. The leaves show distinct yellowing starting at the leaf tip and progressing down the midrib in a characteristic V-shape pattern. The leaf margins remain relatively green initially. Upper leaves appear greener.",
35
+ "Tomato - Early Blight: Several lower leaves of a mature tomato plant are shown. They exhibit circular to irregular dark brown lesions, 0.5-1.5 cm in diameter. Many lesions display characteristic concentric rings, giving a 'target board' appearance. Significant yellowing (chlorosis) surrounds the lesions, and some affected leaves are starting to wither.",
36
+ "Wheat - Stripe Rust: Picture shows the upper leaves of a wheat plant nearing the flag leaf stage. Numerous small, bright yellow-orange pustules are arranged in distinct, long stripes running parallel to the leaf veins. When touched, a yellowish powder (spores) rubs off.",
37
+ "Chili - Leaf Curl Virus: The image focuses on the growing tip of a chili plant. The young leaves are severely deformed, showing upward curling, puckering, and twisting. The leaves appear thickened and brittle, and veins are sometimes noticeably thicker. Plant growth appears stunted overall. Tiny whitefly insects may be subtly present but are not the primary focus.",
38
+ "Groundnut - Tikka Disease (Leaf Spot): Several leaves of a groundnut plant are displayed. They are covered with numerous circular, dark brown to black spots, about 2-8 mm wide. Each spot is surrounded by a distinct yellow halo. Some older spots on lower leaves might lack the clear halo.",
39
+ "Soybean - Healthy: Image shows a section of a healthy, vigorously growing soybean plant canopy during the vegetative stage. The leaves are uniformly deep green, fully expanded, and show no signs of spotting, yellowing, mosaic patterns, insect feeding, or wilting. Stems appear sturdy.",
40
+ "Mango - Anthracnose on Fruit: A near-ripe mango fruit is shown. It displays several sunken, irregular-shaped black spots on the peel. Some spots appear to be coalescing into larger dark lesions. In humid conditions, pinkish spore masses might be visible in the center of older spots (though not clearly visible here).",
41
+ "Sugarcane - Early Shoot Borer Damage: Focus is on the base of young sugarcane tillers (shoots). The central whorl of leaves is dried up, creating a characteristic \"dead heart\" symptom, which can be easily pulled out. Small bore holes may be visible near the base of the affected shoot.",
42
+ "Potato - Late Blight: Image shows potato leaves with large, irregular, water-soaked lesions, often starting at the leaf margins or tips. The lesions rapidly turn dark brown to black. A fuzzy white fungal growth (sporangiophores) might be visible on the underside of the leaves near the edge of the lesions, especially in moist conditions.",
43
+ "Brinjal (Eggplant) - Aphid Infestation: Close-up shows the underside of several young brinjal leaves and stems. They are heavily infested with clusters of small, pear-shaped insects, mostly greenish or blackish in color. Some shiny, sticky honeydew secretions might be visible on the leaf surfaces below the infestation.",
44
+ "Okra (Lady's Finger) - Yellow Vein Mosaic Virus: The image shows several leaves of an okra plant. The veins of the leaves are prominently yellow, and the yellow network extends into the surrounding green leaf tissue, creating a distinct mosaic pattern. The leaves might be slightly reduced in size.",
45
+ "Cabbage - Diamondback Moth Larvae: Several outer leaves of a cabbage head are shown. Small (around 1 cm), greenish larvae with a slightly tapered body are visible, actively feeding on the leaf tissue. They have created irregular holes or \"window panes\" (where only one layer of the leaf epidermis is left). Some silken threads might be present.",
46
+ "Pigeon Pea (Tur/Arhar) - Pod Borer Damage: Image shows several developing pigeon pea pods. Some pods have distinct circular bore holes on the surface. One pod is broken open slightly, revealing a visible greenish or brownish larva feeding inside on the developing seeds. Frass might be present near the holes.",
47
+ "Mustard - White Rust: Focus is on mustard leaves and potentially flower stalks. White, blister-like pustules, somewhat raised and initially smooth, are scattered on the underside of the leaves. Corresponding yellow spots may be visible on the upper leaf surface. Affected flower parts can become swollen and distorted.",
48
+ "Apple - Powdery Mildew: Young leaves and shoot tips of an apple sapling are shown. They are covered with a characteristic white to grayish powdery fungal growth. Affected leaves may appear distorted, curled, or stunted.",
49
+ "Grapes - Downy Mildew: Image displays the upper surface of grape leaves showing distinct, angular, yellowish-green spots (\"oil spots\"). On the underside of the leaf, corresponding to these spots, a white, downy fungal growth is visible, especially under humid conditions.",
50
+ "Papaya - Mealybug Infestation: Close-up on the stem and leaf petioles of a papaya plant, especially near the growing point. Clusters of white, cottony masses are visible, covering small, oval, soft-bodied insects (mealybugs). Ants may also be present, attracted to the honeydew secreted by the mealybugs.",
51
+ "Banana - Sigatoka Leaf Spot: Several mature banana leaves are shown. They exhibit numerous small, elliptical spots parallel to the leaf veins. Initially yellowish-green, the spots enlarge, become dark brown or black, and often develop a grayish center surrounded by a dark border and a yellow halo. Severe spotting leads to leaf necrosis."
52
+ # === END OF DESCRIPTION LIST ===
53
+ ]
54
+
55
+ OUTPUT_FILENAME = "agri_synthetic_data_generated_openrouter.jsonl" # Changed filename
56
+ # Consider slightly higher temperature for variation across runs, but check quality
57
+ GENERATION_TEMPERATURE = 0.7 # Temperature control works similarly
58
+ # Adjust sleep time based on OpenRouter rate limits (vary per model and user tier)
59
+ SLEEP_TIME_PER_API_CALL = 2 # Start with 2, adjust if you hit rate limits
60
+
61
+ # OpenRouter API Endpoint
62
+ OPENROUTER_API_URL = "https://openrouter.ai/api/v1/chat/completions"
63
+
64
+ # --- Modified Prompt Template with More Explicit Instructions ---
65
+ single_prompt_template = """
66
+ You are an AI assistant generating fine-tuning data for an agricultural diagnosis chatbot.
67
+ Your task is to generate a response in a specific format based on the provided plant symptom description and the user's implied question style.
68
+
69
+ Follow these instructions STRICTLY:
70
+
71
+ 1. Analyze the User Question Style: Understand the level of detail requested by the user prompt ({user_question_style}).
72
+ 2. Analyze the Input Description: Carefully read the provided plant symptom description.
73
+ 3. Generate the Response: Create a helpful and accurate response that directly addresses the user's question style, based *only* on the information in the input description.
74
+ 4. Format the Output EXACTLY as specified below and make sure to include ALL the tags, especially the final <end_of_turn> tag:
75
+
76
+ <bos><start_of_turn>user
77
+ {user_question}
78
+ {input_description}<end_of_turn>
79
+ <start_of_turn>model
80
+ [YOUR GENERATED RESPONSE HERE BASED ON THE DESCRIPTION AND QUESTION STYLE]<end_of_turn>
81
+
82
+ IMPORTANT: Make sure your response includes all required tags, especially ending with <end_of_turn>
83
+
84
+ --- Input Data ---
85
+
86
+ User Question Style: {user_question_style}
87
+ User Question: {user_question}
88
+ Input Description:
89
+ {input_description}
90
+
91
+ --- Generate the Output Below ---
92
+ """
93
+
94
+ # Define the different user question styles (Remains the same)
95
+ user_prompts_for_style = {
96
+ "detailed": "Analyze the provided description of plant symptoms and provide a detailed technical diagnosis and recommendations.",
97
+ "short": "Briefly identify the problem shown in the description and give the main point.",
98
+ "easy": "Explain what might be wrong based on this description, using simple words, and what I should do first."
99
+ }
100
+
101
+
102
+ # --- Load Existing Data (Remains the same) ---
103
+ existing_texts = set()
104
+ if os.path.exists(OUTPUT_FILENAME):
105
+ print(f"Loading existing entries from {OUTPUT_FILENAME} to avoid duplicates...")
106
+ try:
107
+ with open(OUTPUT_FILENAME, 'r', encoding='utf-8') as infile:
108
+ for line in infile:
109
+ try:
110
+ data = json.loads(line)
111
+ if 'text' in data:
112
+ existing_texts.add(data['text'])
113
+ except (json.JSONDecodeError, KeyError) as e:
114
+ print(f" Skipping invalid line: {line.strip()} - Error: {e}")
115
+ print(f"Loaded {len(existing_texts)} unique existing entries.")
116
+ except Exception as e:
117
+ print(f"Error loading existing file: {e}. Starting fresh.")
118
+ else:
119
+ print(f"Output file {OUTPUT_FILENAME} not found. Starting fresh.")
120
+
121
+ # --- Main Generation Logic (Revised for OpenRouter with debug info) ---
122
+ total_entries_generated_this_run = 0
123
+ num_duplicates_skipped_this_run = 0
124
+ total_api_calls = len(input_descriptions) * len(user_prompts_for_style)
125
+ current_api_call = 0
126
+
127
+ print(f"\nStarting data generation for {len(input_descriptions)} descriptions...")
128
+ print(f"Will make {total_api_calls} individual API calls via OpenRouter.")
129
+ print(f"Output will be appended to: {OUTPUT_FILENAME}")
130
+ print(f"Using OpenRouter model: {MODEL_NAME}, Temperature: {GENERATION_TEMPERATURE}")
131
+ print(f"Pausing {SLEEP_TIME_PER_API_CALL} seconds between API calls.")
132
+
133
+ # Prepare headers for OpenRouter API calls
134
+ headers = {
135
+ "Authorization": f"Bearer {OPENROUTER_API_KEY}",
136
+ "HTTP-Referer": SITE_URL, # Optional, but recommended
137
+ "X-Title": APP_NAME, # Optional, but recommended
138
+ "Content-Type": "application/json"
139
+ }
140
+
141
+ try:
142
+ with open(OUTPUT_FILENAME, 'a', encoding='utf-8') as f:
143
+ # Iterate through each description
144
+ for i, description in enumerate(input_descriptions):
145
+ print(f"\n--- Processing Description {i+1}/{len(input_descriptions)} ---")
146
+ print(f"Input: {description[:100]}...") # Print snippet
147
+
148
+ # Iterate through each prompt style for the current description
149
+ for style_key, user_question in user_prompts_for_style.items():
150
+ current_api_call += 1
151
+ print(f" Generating '{style_key}' style response ({current_api_call}/{total_api_calls})...")
152
+
153
+ # --- Format the SINGLE prompt (Remains the same) ---
154
+ pasted_desc_safe = description.replace('{', '{{').replace('}', '}}')
155
+ formatted_prompt = single_prompt_template.format(
156
+ user_question_style=style_key,
157
+ user_question=user_question,
158
+ input_description=pasted_desc_safe
159
+ )
160
+
161
+ # --- Structure messages for the API (Remains the same) ---
162
+ messages = [
163
+ {
164
+ "role": "user",
165
+ "content": formatted_prompt
166
+ }
167
+ ]
168
+
169
+ # --- Prepare payload for OpenRouter ---
170
+ payload = {
171
+ "model": MODEL_NAME,
172
+ "messages": messages,
173
+ "temperature": GENERATION_TEMPERATURE,
174
+ "max_tokens": 2048, # Set a higher limit to avoid truncation
175
+ # "top_p": 0.9, # Optional: Set top_p if needed
176
+ # "stream": False # We want the full response, not streaming
177
+ }
178
+
179
+ # --- Call the OpenRouter API (with retries) ---
180
+ api_call_successful = False
181
+ retry_count = 0
182
+ max_retries = 3
183
+ response_text = None
184
+ while not api_call_successful and retry_count <= max_retries:
185
+ try:
186
+ print(f" Sending request to OpenRouter (Attempt {retry_count + 1})...")
187
+ start_time = time.time()
188
+
189
+ # Make the POST request
190
+ response = requests.post(
191
+ OPENROUTER_API_URL,
192
+ headers=headers,
193
+ json=payload # Send data as JSON
194
+ )
195
+
196
+ end_time = time.time()
197
+ print(f" OpenRouter API call took {end_time - start_time:.2f} seconds.")
198
+
199
+ # Check for HTTP errors (4xx or 5xx)
200
+ response.raise_for_status()
201
+
202
+ # Parse the JSON response
203
+ response_data = response.json()
204
+
205
+ # Extract response content
206
+ if (response_data.get('choices') and
207
+ isinstance(response_data['choices'], list) and
208
+ len(response_data['choices']) > 0 and
209
+ response_data['choices'][0].get('message') and
210
+ response_data['choices'][0]['message'].get('content')):
211
+
212
+ response_text = response_data['choices'][0]['message']['content']
213
+ print(f" DEBUG - Last 50 chars of response: {response_text[-50:] if response_text else 'None'}")
214
+
215
+ # Basic check if the response looks like the start of our format
216
+ if response_text and response_text.strip().startswith("<bos>"):
217
+ api_call_successful = True
218
+ else:
219
+ print(" WARNING: Received response doesn't start with '<bos>'. Might be malformed.")
220
+ print(f" Raw Response Snippet: {response_text[:200]}...")
221
+ raise ValueError("Response format incorrect")
222
+ else:
223
+ # Handle unexpected response structure
224
+ print(" WARNING: Received unexpected response structure from OpenRouter.")
225
+ print(f" Raw response data: {response_data}")
226
+ response_text = None
227
+ raise ValueError("Invalid OpenRouter response structure") # Trigger retry
228
+
229
+ except requests.exceptions.RequestException as e:
230
+ retry_count += 1
231
+ print(f" ERROR during OpenRouter API call (Attempt {retry_count}): {e}")
232
+ # Check for specific status codes if needed (e.g., 429 Too Many Requests)
233
+ if hasattr(e, 'response') and e.response and e.response.status_code == 429:
234
+ print(" Rate limit likely exceeded.")
235
+ # Increase wait time significantly for rate limits
236
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** retry_count) * 5
237
+ else:
238
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** (retry_count - 1))
239
+
240
+ if retry_count > max_retries:
241
+ print(f" Max retries reached for Description {i+1}, Style '{style_key}'. Skipping.")
242
+ response_text = None
243
+ break # Break the retry loop for this specific call
244
+ print(f" Retrying in {wait_time:.2f} seconds...")
245
+ time.sleep(wait_time)
246
+ except Exception as e: # Catch other potential errors (like ValueError from format check)
247
+ retry_count += 1
248
+ print(f" ERROR processing request/response (Attempt {retry_count}): {e}")
249
+ wait_time = SLEEP_TIME_PER_API_CALL * (2 ** (retry_count - 1))
250
+ if retry_count > max_retries:
251
+ print(f" Max retries reached for Description {i+1}, Style '{style_key}'. Skipping.")
252
+ response_text = None
253
+ break # Break the retry loop
254
+ print(f" Retrying in {wait_time:.2f} seconds...")
255
+ time.sleep(wait_time)
256
+
257
+ # --- Process and Save Response (Modified with more flexible validation) ---
258
+ if api_call_successful and response_text:
259
+ cleaned_text = response_text.strip()
260
+ print(f" DEBUG - Response starts with: {cleaned_text[:50]}...")
261
+ print(f" DEBUG - Response ends with: ...{cleaned_text[-50:]}")
262
+
263
+ # More flexible check to troubleshoot
264
+ if cleaned_text.startswith("<bos>"):
265
+ if "<end_of_turn>" not in cleaned_text:
266
+ print(" WARNING: Response missing <end_of_turn> tag. Adding it for consistency.")
267
+ cleaned_text = cleaned_text + "<end_of_turn>"
268
+
269
+ if cleaned_text not in existing_texts:
270
+ data_entry = {"text": cleaned_text}
271
+ json.dump(data_entry, f, ensure_ascii=False)
272
+ f.write('\n')
273
+ existing_texts.add(cleaned_text)
274
+ total_entries_generated_this_run += 1
275
+ print(f" Saved 1 new entry for '{style_key}'.")
276
+ else:
277
+ num_duplicates_skipped_this_run += 1
278
+ print(f" Skipped 1 duplicate entry for '{style_key}'.")
279
+ else:
280
+ print(f" WARNING: Invalid format detected in final generated entry for '{style_key}'.")
281
+ print(f" Failed validation. Raw text: {cleaned_text[:150]}...")
282
+
283
+ elif not api_call_successful:
284
+ print(f" Skipping save for '{style_key}' due to API call failures after retries.")
285
+
286
+ # --- Rate Limiting Pause (after each API call) ---
287
+ if current_api_call < total_api_calls: # Don't sleep after the very last call
288
+ time.sleep(SLEEP_TIME_PER_API_CALL)
289
+
290
+ except KeyboardInterrupt:
291
+ print("\n--- KeyboardInterrupt detected. Stopping script early. ---")
292
+ except Exception as e:
293
+ print(f"\nAn unexpected error occurred outside the main loop: {e}")
294
+ import traceback
295
+ traceback.print_exc()
296
+
297
+ finally:
298
+ print(f"\n--- Run Finished ---")
299
+ print(f"Total API calls attempted: {current_api_call}")
300
+ print(f"Entries generated in THIS run: {total_entries_generated_this_run}")
301
+ print(f"Duplicates skipped in THIS run: {num_duplicates_skipped_this_run}")
302
+ print(f"Total unique entries now in {OUTPUT_FILENAME}: {len(existing_texts)}")