AI Platform Excellence: Calendly's New Product Catalyst
How treating AI as infrastructure enables Calendly to ship AI products at high velocity.
Table of contents
Introduction: A Pivotal Year
2026 represents a pivotal moment for Calendly. The company is executing on a product vision grounded in customer value, driven by a fundamental shift in how software gets built. Since August 2025, Calendly's AI platform team has established and scaled an agentic platform that already serves multiple products, with more in the pipeline. This infrastructure doesn't just enable new capabilities—it fundamentally changes how quickly the company can iterate on customer value.
This post isn't about evangelizing AI technology (though the work described wouldn't be possible without it!). Instead, this post explores the tangible benefits Calendly's engineering organization has realized from an AI-first engineering strategy, and how the team kept the transformation practical rather than theoretical. The key insight that made this possible: at product-oriented companies, AI is an engineering and platform direction, not a research direction.
Practical AI vs. Research AI
Before diving deeper, it's important to distinguish between two very different types of work that often get conflated under the "AI" label.
Research AI is exploration. The output is insight: new methods, better models, academic papers, benchmarks, prototypes, and proof that something is theoretically possible. The constraints are intentionally loose because the goal is discovering what's worth building. A prototype may enter a beta phase, but it isn’t designed for scale, because in this mode, success is simply demonstrating that something can work.
Practical AI is product engineering. The output is a capability: something that can be owned, shipped incrementally, measured continuously, debugged at 2am during an incident, rolled back safely when needed, and extended by the next engineer who encounters it. The constraints are unavoidable: latency requirements, cost controls, security and privacy guarantees, uptime commitments, incident response protocols, and the reality that users will interact with the system in unexpected ways. In this mode, success is confidence that the team can make changes to, and maintain, the system every week without fear.
Knowing which mode a team is in is straightforward. If they can't answer "what happens when the model is down?", or if they're measuring success by "we got it to work," they're doing research. If they're measuring success by "we can evolve this safely and rapidly," they're doing practical engineering.
This post focuses squarely on the practical mode: how Calendly's team treated AI as infrastructure and built a system where multiple products can ship without reinventing reliability each time.
The Observation: AI as Platform Engineering
In many organizations, "AI-first" gets associated with endless prototyping, perpetual research backlogs, and high-risk strategies. Calendly's approach has been very different. By treating AI as platform engineering—centralized, cross-cutting capabilities designed for extension and reuse—the team has shipped multiple production AI surfaces while simultaneously improving the metrics developers care about: pull request throughput, change confidence, release safety, local iteration speed, and incremental delivery capability.
Both of these outcomes are measurable and developer-focused, which is precisely what differentiates AI-first engineering at Calendly. The team tracks developer experience with explicit metrics, regular retrospectives, and triaged improvement areas, much the same way product health gets monitored.
AI-First Engineering at Calendly
AI-first engineering at Calendly centers on three principles:
AI work becomes scalable when it becomes infrastructure.
Emphasize building things that are repeatable, testable, observable, and reusable.
If a capability can't be reused across products, it's probably not a platform primitive yet—it's a prototype.
These principles led the team to establish a shared AI platform with horizontal capabilities. Multiple products can now build on this foundation without reinventing safety protocols, reliability patterns, or iteration frameworks.
The simplified architecture follows a consistent pattern. Product surfaces send requests with intent and context to an AI framework gateway. That gateway coordinates validation, context building, tool registry access, model provisioning, and telemetry. Each of these components has clear boundaries and responsibilities.
The exact model, separation of concerns, and connection topology matter less than designing for maintainability. Once there are well-defined boundaries—for example, distinct stages for context building, tool calling, model invocation, validation, and telemetry—teams can ship faster because everyone knows what they're building. Concerns are separated at each component, and the number of potential failure modes is reduced.
Observability: The Foundation of Confidence
A critical prerequisite for scaling AI products has little to do with AI itself: observability. It’s not a nice-to-have, it’s foundational to building dependable, maintainable systems. The faster a team can diagnose why a system produced a given outcome, the faster they can ship—they trust their ability to pinpoint and resolve issues quickly.
Telemetry that Answers Questions
In traditional software systems, bugs can often be reproduced locally. In AI systems, fixing bugs requires answering questions:
What context did the model have?
What tools did it call, and what responses did it receive?
What policy path did the request take?
What did validation reject?
Did a latency regression correlate with a new tool call pattern?
If these questions can't be answered, the problem isn't an "AI problem"—it's an operability problem. So Calendly's team designs traces to answer the questions that matter to engineers:
Did failure occur in inputs (intent/context), tools, the model, or outputs?
Was it a hard failure or a soft failure?
What components were running, at what times, and for how long?
Trace waterfalls for agentic systems executing plan-act loops precisely capture the behavior of the system—what ran in sequence, what ran in parallel, and what waited on what. In an agentic workflow, that concurrency is often where complexity hides, and it’s where incident diagnosis falls apart without observability. This level of detail enables debugging AI systems with the same rigor that traditional debuggers provide for conventional software.
Questions to Answer During Incidents
When something breaks in an AI system, the worst-case scenario isn't "the model was wrong." It's "we have no idea why this broke, and no safe way to change it." In reality, the team needs to answer several questions within minutes:
What changed? Code, prompt configuration, tool schema, model provider, or routing flag?
Who is impacted? Tenant, surface, request type, locale, or permissions?
Where did the failure originate? Context build, tool call, inference, validation, or formatting?
Which metrics are impacted? Quality, latency, cost, security or safety?
What's the safest mitigation? Disable a tool, reduce context, route to a different provider, adjust validators, or fall back gracefully?
Recent Case Study
A recent example illustrates the value of comprehensive tracing. The team discovered that runs failing input validation were executing tool calls unexpectedly. Because traces captured the validator and model running in parallel, engineers quickly identified the root cause: insufficient execution state management. As a result, tool calls were executed even when inputs failed validation. The team was able to fall back to blocking execution while implementing better state management.
Good traces reveal tool call counts per request, tool call duration distributions, retry and cancellation patterns, and processing state over time. This level of detail transforms incident response efforts from guesswork into systematic problem-solving.
Shipping Changes Safely at High Velocity
This is where the practical versus research distinction becomes most apparent. If a team can't ship changes safely, they can't scale. Several habits have proven valuable for Calendly's AI platform team:
Treat prompts and configuration like code. They're versioned, code-reviewed, benchmarked, and diffable. If a change affects behavior, it deserves the same scrutiny as code changes.
Ship incrementally. Use canaries, staged rollouts, and feature flags per surface or tenant. No big-bang deployments.
Establish rollback criteria before shipping. Before deploying, decide which metrics force a rollback: latency thresholds, validation failure rates, user friction indicators, etc.
Design for fallback. A safe "I can't do that yet" response is better than a confident wrong answer. Fallbacks should be measurable, not silent.
Use strict typing and static analysis. This is invaluable for enforcing type validation and contract checks, especially at component boundaries.
Evaluations: Testing for Behavioral Compliance
Evaluations—"evals"—are straightforward benchmarks that test for behavioral compliance under simple metrics. The best evals can be tested deterministically, but for subjective measures like quality or completeness, using an LLM as a judge is acceptable. Calendly’s team keeps scoring as simple as possible, with preference to binary judgments, and uses alignment sets to ensure each eval judge is calibrated to human raters.
These aren't research benchmarks, and the goal isn't achieving a perfect score. Evals serve as regression tests for behavioral changes. They protect against unintended consequences.
The platform employs offline evals on curated sets and online evals on sampled production traffic to test for quality regressions and compare behavior across versions. The goal is answering "what broke?" without waiting for customers to report issues. Online evals track metrics such as the rate of outputs flagged for refusal, providing early warning signals of degraded behavior.
Closing the Loop
Dogfooding as a Competitive Advantage
One of the most underappreciated advantages of building AI business tools inside a product company is dogfooding. When taken seriously, it delivers compound value. Dogfooding does more than surface bugs faster—it helps accelerate product development using the product itself.
Calendly’s AI Notetaker, for instance, has become integral to how our engineering teams work. Here are some notable use cases:
Get context quickly on complex work without scheduling additional meetings.
Track and complete action items, even for meetings team members couldn't attend.
Surface insights over time to identify growth opportunities and recurring challenges.
Accelerate onboarding and knowledge transfer by making team context queryable.
Prepare for smooth handoffs when team members join or depart by aggregating the team's meeting history.
The team has begun enabling AI systems to reflect on their limitations and suggest their own capability improvements. This transforms real system friction into product insight, shortening iteration cycles and improving signal quality.
Agents can now indicate when they lack the right tool for a task, proactively identifying opportunities to expand their use cases. Data validation layers surface “desire paths,” converting hallucinations into structured requests for schema updates and making the system progressively more AI-friendly. Meanwhile, Notetaker attends every team meeting—capturing context, tracking progress in real time, and ensuring critical insights are never lost.
Integration with Company Values
AI-first will manifest differently for each business, but the working model described here isn't accidental. It is an intentional approach that reflects Calendly's company values:
Focus on Impact: Choose work that compounds across products instead of chasing novelty.
One Team: Build paved paths so other teams can ship without dependencies.
Customer Experience Obsessed: Treat customer experience measures like quality, latency, and hallucinations as product defects and invest accordingly.
Find A Way: Ship incrementally under constraints, learn fast, and continuously improve the system.
Conclusion
This is the core lesson from Calendly’s AI platform journey: at product companies, scale comes from treating AI as platform engineering. This means investing in centralized infrastructure, building functionality that compounds, and shipping incremental changes with high confidence.
It’s how a small team of machine learning engineers can build multiple AI products simultaneously—without sacrificing the developer experience metrics that often erode during major platform transitions.
This success has very little to do with “using AI”. It came from leaning into who Calendly already is—a product-led company—and building a foundation where scale is a natural byproduct of the architecture itself.
Next Steps
The team is launching and scaling several product features over the coming months while exploring new directions. They're continuing to add capabilities to LLM-enabled offerings and actively working to expand the ecosystem and integrations to meet customers where they work. Planned work includes further system performance improvements and an initiative to automate prompt engineering, reducing friction around model updates. As of this writing, the team is hiring.
Acknowledgments
Engineering Enablement deserves recognition for assembling an excellent monorepo template and core infrastructure. Product Security and QA have provided continuous feedback and testing. Product, Design, and Analytics teams own customer research and strategic direction. Leadership creates the conditions for success by bringing these teams together and providing autonomy for self-governance. DX provides amazing tools for continuously improving how work gets done. Calendly's customers provide incredible feedback and purpose. And the open source community makes all of this work possible.
Related Articles
Don't leave your prospects, customers, and candidates waiting
Calendly eliminates the scheduling back and forth and helps you hit goals faster. Get started in seconds.


