--- license: apache-2.0 library_name: transformers tags: - code - jupyter - agent - data-science - qwen - thinking base_model: Qwen/Qwen3-4B-Thinking-2507 datasets: - jupyter-agent/jupyter-agent-dataset language: - en - code pipeline_tag: text-generation --- # Jupyter Agent Qwen3-4B Thinking ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650ed7adf141bc34f91a12ae/ZyF9foqe5SLECwkq0dOpT.png) **Jupyter Agent Qwen3-4B Thinking** is a fine-tuned version of [Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) specifically optimized for **data science agentic tasks** in Jupyter notebook environments. This model can execute Python code, analyze datasets, and provide step-by-step reasoning with intermediate computations to solve realistic data analysis problems. - **Model type:** Causal Language Model (Thinking) - **Language(s):** English, Python - **License:** Apache 2.0 - **Finetuned from:** [Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) ## Key Features - **Jupyter-native agent** that lives inside notebook environments - **Code execution** with pandas, numpy, matplotlib, and other data science libraries - **Step-by-step reasoning** with intermediate computations and thinking traces - **Dataset-grounded analysis** trained on real Kaggle notebook workflows - **Tool calling** for structured code execution and final answer generation ## Performance On the [DABStep benchmark](https://huggingface.co/spaces/adyen/DABstep) for data science tasks: | Model | Easy Tasks | Hard Tasks | |-------|------------|------------| | Qwen3-4B-Thinking-2507 (Base) | 44.0% | 2.1% | | **Jupyter Agent Qwen3-4B Thinking** | **70.8%** | **3.4%** | **State-of-the-art performance** for small models on realistic data analysis tasks. ## Model Sources - **Repository:** [jupyter-agent](https://github.com/huggingface/jupyter-agent) - **Dataset:** [jupyter-agent-dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset) - **Blog post:** [Jupyter Agents: training LLMs to reason with notebooks](https://huggingface.co/blog/jupyter-agent-2) - **Demo:** [Jupyter Agent 2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2) ## Usage ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "jupyter-agent/jupyter-agent-qwen3-4b-thinking" # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # Prepare input prompt = "Analyze this sales dataset and find the top 3 performing products by revenue." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # Generate response generated_ids = model.generate( **model_inputs, max_new_tokens=16384 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() ``` ### Decoding Thinking and Content For thinking models, you can extract both the reasoning and final response: ```python try: # Find the end of thinking section () index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("Thinking:", thinking_content) print("Response:", content) ``` ### Agentic Usage with Tool Calling The model works best with proper scaffolding for tool calling: ```python tools = [ { "type": "function", "function": { "name": "execute_code", "description": "Execute Python code in a Jupyter environment", "parameters": { "type": "object", "properties": { "code": { "type": "string", "description": "Python code to execute" } }, "required": ["code"] } } }, { "type": "function", "function": { "name": "final_answer", "description": "Provide the final answer to the question", "parameters": { "type": "object", "properties": { "answer": { "type": "string", "description": "The final answer" } }, "required": ["answer"] } } } ] # Include tools in the conversation messages = [ { "role": "system", "content": "You are a data science assistant. Use the available tools to analyze data and provide insights." }, {"role": "user", "content": prompt} ] ``` ## Training Details ### Training Data The model was fine-tuned on the [Jupyter Agent Dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset), which contains: - **51,389 synthetic notebooks** (~0.2B tokens, total 1B tokens) - **Dataset-grounded QA pairs** from real Kaggle notebooks - **Executable reasoning traces** with intermediate computations - **High-quality educational content** filtered and scored by LLMs ### Training Procedure - **Base Model:** Qwen3-4B-Thinking-2507 - **Training Method:** Full-parameter fine-tuning (not PEFT) - **Optimizer:** AdamW with cosine learning rate scheduling - **Learning Rate:** 5e-6 - **Epochs:** 5 (optimal based on ablation study) - **Context Length:** 32,768 tokens - **Batch Size:** Distributed across multiple GPUs - **Loss:** Assistant-only loss (`assistant_loss_only=True`) - **Regularization:** NEFTune noise (α=7) for full-parameter training ### Training Infrastructure - **Framework:** [TRL](https://github.com/huggingface/trl) with [Transformers](https://github.com/huggingface/transformers) - **Distributed Training:** DeepSpeed ZeRO-2 across multiple nodes - **Hardware:** Multi-GPU setup with SLURM orchestration ## Evaluation ### Benchmark: DABStep The model was evaluated on [DABStep](https://huggingface.co/spaces/adyen/DABstep), a benchmark for data science agents with realistic tasks involving: - **Dataset analysis** with pandas and numpy - **Visualization** with matplotlib/seaborn - **Statistical analysis** and business insights - **Multi-step reasoning** with intermediate computations The model achieves **26.8% improvement** over the base model and **11.1% improvement** over scaffolding alone. DABstep Easy Score We can also see, that the hard score can increase too even though our dataset is focused on easier questions. DABstep Hard Score ## Limitations and Bias ### Technical Limitations - **Context window:** Limited to 32K tokens, may struggle with very large notebooks - **Tool calling format:** Requires specific scaffolding for optimal performance - **Dataset domains:** Primarily trained on Kaggle-style data science tasks - **Code execution:** Requires proper sandboxing for safe execution ### Potential Biases - **Domain bias:** Trained primarily on Kaggle notebooks, may not generalize to all data science workflows - **Language bias:** Optimized for English and Python, limited multilingual support - **Task bias:** Focused on structured data analysis, may underperform on unstructured data tasks ### Recommendations - Use in **sandboxed environments** like [E2B](https://e2b.dev/) for safe code execution - **Validate outputs** before using in production systems - **Review generated code** for security and correctness - Consider **domain adaptation** for specialized use cases ## Ethical Considerations - **Code Safety:** Always execute generated code in secure, isolated environments - **Data Privacy:** Be cautious when analyzing sensitive datasets - **Verification:** Validate all analytical conclusions and insights - **Attribution:** Acknowledge model assistance in data analysis workflows ## Citation ```bibtex @misc{jupyteragentqwen3thinking, title={Jupyter Agent Qwen3-4B Thinking}, author={Baptiste Colle and Hanna Yukhymenko and Leandro von Werra}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-thinking} } ``` ## Related Work - **Dataset:** [jupyter-agent-dataset](https://huggingface.co/datasets/jupyter-agent/jupyter-agent-dataset) - **Non-thinking version:** [jupyter-agent-qwen3-4b-instruct](https://huggingface.co/jupyter-agent/jupyter-agent-qwen3-4b-instruct) - **Base model:** [Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) - **Benchmark:** [DABStep](https://huggingface.co/spaces/adyen/DABstep) *For more details, see our [blog post](https://huggingface.co/blog/jupyter-agent-2) and [GitHub repository](https://github.com/huggingface/jupyter-agent).*