For a while, it looked like data science might lose some of its prominence.
Generative AI made it easier for teams to create products with powerful models they did not train themselves. Through platforms such as Amazon Bedrock, product teams could prototype AI assistants, copilots, and agent workflows without starting from a blank modeling problem. The work of creating AI started to look more like connecting models to tools and data.
But building an AI system is not the same as understanding it.
AI systems are useful, but variable. Their behavior depends on more than the prompt or the model.
James Blair, senior product manager at Posit, says the challenge lies in the number of variables involved. “AI systems are inherently non-deterministic. So many different pieces contribute to the eventual output, from model selection to system prompts to tool availability. Even agent turns with the same initial prompt can follow distinctly different journeys.”
In this environment, data science becomes more important than ever. Organizations need statistical rigor to measure AI behavior, validate changes, and decide whether systems are ready for production.
- The rise of non-deterministic and agent-based systems
- Why statistical rigor becomes essential
- How teams approach testing and measurement in practice
- Why data scientists are well-positioned to lead
- The hidden gap between experimentation and production
- Why testing and measurement require the right infrastructure
- Building repeatable workflows for real-world systems
- Why consistency and reproducibility matter
- From analysis to reliable systems
- What leaders should prioritize next
The rise of non-deterministic and agent-based systems
Traditional software is built around predictability. Given the same input under the same conditions, the system should return the same output.
Agent-based AI systems work differently.
An agent may receive a task, use a tool, pull in data, revise its plan, and generate a response. Each step creates another place for behavior to vary, and a small change early in the workflow can alter the final output.
Evaluating an agent means examining the decisions behind the response, from source selection and policy handling to escalation and unsupported claims.
The challenge moves from “Can we build this?” to “Do we understand what this system is doing?”
Why statistical rigor becomes essential
When AI outputs vary, teams need a way to separate acceptable variation from real failure. In customer service, for example, different wording may be fine as long as the agent reaches the right outcome without missing required steps or relying on the wrong information.
Making that distinction requires more than spot checks. Teams need representative evaluation sets, clear success criteria, and a structured way to account for uncertainty. They need enough evidence to tell whether a change reflects real improvement or whether a failure pattern deserves attention.
Blair describes this as an experiment loop: change something in the agent, observe the outcome, and repeat. The hard part is observing those outcomes consistently and meaningfully.
Data scientists are valuable in that process because they know how to design tests, interpret noisy results, and avoid overreacting to anecdotal wins or losses. Without that rigor, organizations can mistake motion for progress when a prompt update improves a few examples while an aggregate dashboard metric masks the failure mode that carries the most business risk.
How teams approach testing and measurement in practice
A polished final answer can hide an unreliable execution path. Many teams begin with broad measures of quality or hallucination risk, but those metrics rarely explain what to fix. A system can sound helpful while still choosing the wrong tool or relying on a source that should not govern the response.
Blair cautions that teams often focus too heavily on the final output, even though that output is only the last step in a longer chain. Logging and measuring intermediate steps, including tool calls and decisions, gives teams a better view of what contributed to the result.
Simon P. Couch, senior software engineer at Posit, adds that teams first need to “see” what went wrong. That requires tooling that lets them drill into evaluation traces, along with “the will to actually sit down and read a bunch of traces.” To see this kind of evaluation workflow in practice, watch the on-demand webinar, From Wrangling to Insight: Human-in-the-Loop AI for Analytics, where Blair and Shun Mao, Senior Partner Solutions Architect at AWS, demonstrate how data scientists can use Positron and Amazon Bedrock to interrogate AI outputs at each step of the pipeline.
Review work turns vague concerns into measurable criteria. Reading traces and reviewing outputs may reveal that an agent handles routine requests well but breaks down when a policy exception appears. From there, teams can create failure categories, test cases, and scoring criteria tied to the actual job the agent is supposed to do.
Application-specific metrics are more useful because they connect system behavior to the work the system is supposed to perform.
Why data scientists are well-positioned to lead
Data scientists have spent years working with many of the challenges that make AI evaluation difficult, from incomplete evidence and imperfect labels to metrics that can mislead.
They know that samples can be biased, that averages can hide important failures, and that not every result is meaningful.
Those skills become increasingly useful as AI systems generate large volumes of traces, outputs, scores, and edge cases. Blair notes that data scientists are especially valuable here because they know how to design representative evaluation sets, choose metrics that reflect system goals, and separate meaningful signal from noise.
Their role can span the full evaluation cycle, from identifying failure patterns and designing test sets to comparing prompts and models through controlled experiments. They also help create metrics tied to business outcomes and decide which changes are safe to ship.
As more organizations build on foundation models, data scientists are moving into a different kind of AI work. They may not be training the base model, but they are helping determine whether a system is ready for real use. How well does it work? For whom? Under what conditions? What changed after the last update?
The hidden gap between experimentation and production
AI prototypes can move quickly because the conditions are forgiving. Teams usually work with a small test set, a clear sense of intended behavior, and fewer production surprises.
A prototype can prove an idea is worth pursuing. It does not show how the system will behave once real users and changing production conditions enter the picture.
Production asks whether the system still works when conditions change, and whether each update improves performance or introduces regressions.
Blair says teams often prototype without creating evaluations alongside the system. When they try to scale, “there is no reliable way to know if changes made to the system are improvements or regressions.”
Avoiding that pattern requires process, not only tooling. Evaluation has to be part of the build cycle from the beginning, with a consistent workflow for defining and reporting results. Without that structure, early analysis stays trapped in one-off experiments.
Why testing and measurement require the right infrastructure
A serious evaluation program can outgrow manual workflows quickly. As test sets grow and system changes pile up, teams need a reliable way to store traces and rerun evaluations after meaningful updates.
At that point, evaluation becomes an infrastructure problem as much as an analysis problem. Teams need enough compute for repeated tests, plus a secure and stable environment that can carry workflows across teams.
On AWS, teams can use services such as Amazon SageMaker for scalable compute and Amazon Bedrock for access to foundation models, with AWS-native security, networking, and monitoring controls supporting enterprise deployment. That infrastructure makes it practical to rerun evaluations across enough scenarios to understand how a system behaves without forcing teams to re-architect their workflows as requirements change.
On the data science side of the same workflow, Posit gives teams a code-first platform where data scientists can programmatically create, manage, and analyze system evaluations in familiar R and Python environments.
Together, Posit and AWS connect the evaluation work data scientists do in code with the infrastructure needed to execute it consistently at scale.
As prompts, models, and surrounding data change, AI systems can drift. Regular evaluations give teams a continuous integration/continuous delivery (CI/CD)-style discipline for catching regressions before they reach users.
Building repeatable workflows for real-world systems
Teams need a consistent way to define metrics and compare results across experiments. A one-off notebook or spreadsheet may work during exploration, but it is not enough for production systems that change over time.
Code-based workflows help teams move from manual review to structured evaluation, with tests and reporting that can be reused and inspected. Data scientists also need a place to do that work in R and Python. Posit serves as that development layer, giving teams a workspace where evaluation logic and analysis can live in code instead of scattered one-off processes.
The workflow often starts with close inspection. A data scientist reviews traces, finds a failure pattern, and turns that finding into a test that can be rerun after meaningful changes.
Why consistency and reproducibility matter
Reproducibility is a tricky word in AI because these systems are often non-deterministic. An agent may not take the same path twice, even with the same starting prompt.
For non-deterministic systems, reproducibility means a consistent evaluation process that produces trustworthy evidence, even when individual outputs vary.
Reproducibility depends on keeping the evaluation setup consistent, from the data and model settings to the environment where tests run. Teams also need durable artifacts, including Quarto documents, dashboards, and software outputs, so they can show what was tested and what the results were.
Data science gives teams a practical way to handle variability. Instead of relying on a single run, teams can use repeated evaluations to estimate average performance and quantify uncertainty. With that evidence, teams can compare results across runs and see whether a change reflects real improvement, expected variation, or a regression.
While tools cannot eliminate non-determinism, implementing the right workflow actively makes the evaluation process inspectable and trustworthy. Posit gives data scientists a code-based environment for rerunning tests, preserving artifacts, and reviewing how results were produced, while AWS provides the cloud infrastructure to run those workflows at scale.
From analysis to reliable systems
Reports matter when they help teams find problems earlier, test changes safely, and improve systems over time.
The process starts with a deliberate decision to measure before broad release. According to Blair, organizations with stronger testing practices put trusted evaluations in place before agents are used widely. In many cases, defining the evaluation set also clarifies the agent’s purpose, its boundaries, and the conditions that should count as failure.
Continuous measurement helps teams catch problems earlier and test changes more safely. A reliable AI system does not have to produce identical answers every time, but it does need to stay within expected bounds, complete the right work, and fail in ways teams can detect and address.
Data science provides the methods for doing that work.
What leaders should prioritize next
Leaders should start by asking whether evaluations are being created alongside AI systems or added after launch. They should also know whether teams can examine the traces and decisions behind final outputs.
Data science skills should also be part of the evaluation process. AI adoption depends on teams that can design sound experiments, build representative test sets, and quantify uncertainty through meaningful metrics.
Finally, leaders should look for consistency across tools, environments, and workflows. If every team evaluates AI differently, the organization will struggle to compare results or scale what works.
The next phase of AI will belong to organizations that can prove their systems work in the real world, even as both the systems and the world around those systems evolve at an ever-increasing pace.
See how Posit and AWS work together to give data science teams the infrastructure and tools to evaluate AI systems at scale. Explore the Posit + AWS solution


