Travis Muhlestein PRO
TravisMuhlestein
AI & ML interests
all AI & ML Interests
Recent Activity
posted
an
update
2 days ago
From AI demos to production systems: what breaks when agents become autonomous?
A recurring lesson from production AI deployments is that most failures are system failures, not model failures.
As organizations move beyond pilots, challenges increasingly shift toward:
• Agent identity and permissioning
• Trust boundaries between agents and human operators
• Governance and auditability for autonomous actions
• Security treated as a first-class architectural constraint
This recent Fortune article highlights how enterprises are navigating that transition, including work with AWS’s AI Innovation Lab.
Open question for the community:
What architectural patterns or tooling are proving effective for managing identity, permissions, and safety in autonomous or semi-autonomous agent systems in production?
Context: https://fortune.com/2025/12/19/amazon-aws-innovation-lab-aiq/
posted
an
update
about 1 month ago
Calibrating LLM-as-a-Judge: Why Evaluation Needs to Evolve
As AI systems become more agentic and interconnected, evaluation is turning into one of the most important layers of the stack. At GoDaddy, we’ve been studying how LLMs behave when used as evaluators—not generators—and what it takes to trust their judgments.
A few highlights from our latest engineering write-up:
🔹 Raw LLM scores drift and disagree, even on identical inputs
🔹 Calibration curves help stabilize model scoring behavior
🔹 Multi-model consensus reduces single-model bias and variance
🔹 These techniques support safer agent-to-agent decision making and strengthen our broader trust infrastructure (ANS, agentic systems, etc.)
If you're building agents, autonomous systems, or any pipeline that relies on “AI judging AI,” calibration isn’t optional — it's foundational.
👉 Full write-up: Calibrating Scores of LLM-as-a-Judge
https://www.godaddy.com/resources/news/calibrating-scores-of-llm-as-a-judge
Would love feedback from the HF community:
How are you calibrating or benchmarking model evaluators in your own workflows?
posted
an
update
about 1 month ago
🚀 GoDaddy ANS API Now Live — Bringing Verifiable Identity to the Agent Ecosystem
We just launched the Agent Name Service (ANS) API) publicly, along with the new ANS Standards site, extending decades of GoDaddy internet-scale trust into the emerging world of autonomous agents. ANS provides cryptographically verifiable identity, human-readable names, and policy metadata for agents — designed to work across frameworks like A2A, MCP, and future agent protocols.
What’s new:
🔹ANS API is open to all developers — generate a GoDaddy API key and start testing registration, discovery, and lifecycle ops.
🔹ANS Standards Site is live — includes the latest spec, architecture, and implementation guidance.
🔹Protocol-agnostic adapter layer — supports interoperability without vendor lock-in.
Why it matters:
As autonomous agents continue to proliferate, we need neutral, verifiable identity to prevent spoofing, trust rot, and fragmented ecosystems. ANS brings DNS-like discovery and PKI-based validation to the agent economy.
🔗 Links
Standards & docs: https://www.agentnameregistry.org/
API keys: https://developer.godaddy.com/keys
Repo: https://github.com/godaddy/ans-registry
PR: https://aboutus.godaddy.net/newsroom/news-releases/press-release-details/2025/GoDaddy-advances-trusted-AI-agent-identity-with-ANS-API-and-Standards-site/default.aspx
Would love to hear thoughts from the community:
What should a universal agent identity layer guarantee — and what should it avoid?