NVIDIA's New AI Agent Masters Data Science Workflows, Tops Industry Benchmark

A research team at NVIDIA has developed an artificial intelligence system that approaches data analysis with the methodical reasoning of a human data scientist. The project, called the KGMON Data Explorer, has taken first place on the demanding DABStep benchmark, outperforming previous models from Google and AntGroup while completing tasks thirty times faster than a standard baseline.

The system addresses a persistent gap in AI research: most language models are trained on text, leaving them poorly equipped to handle structured, numerical data in spreadsheets and databases. This new agent, built on NVIDIA's NeMo Agent Toolkit, doesn't just answer questions—it plans and executes multi-step analytical workflows, generating and running its own code to explore datasets.

"The breakthrough was structuring the process like a human expert would," explained a member of the Kaggle Grandmasters research team behind the project. "Instead of solving each new problem from scratch, the system spends an initial learning phase studying sample tasks to build a toolkit of reusable functions. Once that library is established, a lighter, faster model can use those tools to solve new problems almost instantly."

On the DABStep benchmark, which focuses on complex financial data queries, this approach proved decisive. While simpler tasks saw comparable performance across models, the NVIDIA agent excelled on difficult, multi-step problems, achieving a score of nearly 90% where a powerful Claude Opus baseline scored 67%. Critically, it produced this result in 20 seconds per task, compared to 10 minutes for the baseline.

The architecture employs different agent designs for different jobs. For open-ended data exploration, it uses a ReAct agent paired with a Jupyter notebook tool, creating an interactive analysis loop. For structured question-answering, a tool-calling agent orchestrates a suite of specialized utilities. An offline review phase, using more powerful models, audits the system's work and feeds improvements back into the process without slowing down live inference.

This separation of learning, execution, and review mirrors professional practice. The agent first builds robust, generalized code, then applies it efficiently, and finally learns from its own outputs. The result is an AI that doesn't just calculate—it reasons, plans, and refactors its own approach, setting a new standard for automated data analysis.

Source: Hugging Face Blog