Cisco Secure AI Factory draws on Splunk Observability

Artificial intelligence is reshaping every industry, and unlocking its full potential requires infrastructure that is robust, scalable, secure and observable. As organizations expand their AI initiatives, managing complex workloads and ensuring consistent performance becomes mission critical.

This is where Cisco AI PODs, the fundamental elements of Cisco Secure AI Factory with NVIDIAcombined with the deep visibility of Splunk Observability Cloud, delivers a powerful solution for building and running modern AI environments.

Cisco AI POD: The Foundation for AI Innovation

Cisco AI POD are modular, flexible and scalable AI infrastructures designed to accelerate the profitability of AI projects. They enable organizations to quickly deploy production-grade AI environments, but to keep those environments performing optimally, teams need a complete view of performance and health.

How can you detect issues as early as possible, resolve them effectively, and focus on achieving business results instead of spending time solving urgent production issues? This is where observability becomes essential.

Splunk Observability: Your Eyes and Ears Inside AI PODs

Splunk Observability Cloud provides end-to-end visibility into every layer of Cisco AI PODs, from physical infrastructure to Kubernetes to the AI applications layer.

It’s not just about data collection. Splunk turns metrics, traces, and logs into actionable insights, helping teams detect, troubleshoot, and resolve issues in seconds.

We’re excited to introduce a new Splunk dashboard purpose-built for observability across the entire AI POD stack.

What the new Splunk dashboard brings to Cisco AI PODs

Unified monitoring of Kubernetes clusters – Get a single view of all Kubernetes clusters, including Red Hat OpenShift running on AI PODs.
In-depth host-level insights – Monitor the performance of each Cisco UCS server, including CPU, memory, disk, and network utilization.
AI POD Infrastructure Dashboard – Track critical metrics such as GPU utilization, GPU memory utilization, network power and performance, integrating data from Cisco Intersight and Cisco Nexus.
Advantage of Streaming Analytics – Leverage Splunk’s real-time streaming analytics to achieve faster detection and near-instant “time to glass.”

Although Cisco AI PODs provide a modular, scalable infrastructure for enterprise AI, each AI POD can also be monitored individually. This allows teams to gain detailed insight into the unique performance metrics and workloads of a specific deployment. Here are some screens from the Splunk dashboard for AI PODs to help you visualize the monitoring capabilities. By aggregating the number of input and output tokens processed by the large language model (LLM) running on an AI POD, Splunk is able to calculate an approximate cost for token usage over time:

Splunk also pulls metrics from Cisco Intersight, to provide visibility into active alarms related to the monitored AI POD, as well as key UCS metrics such as UCS host power, temperature, and fan speed:

The Nexus Dashboard provides an overview of the interfaces configured on each Nexus switch, transmission errors and outages, and data transferred between storage and compute nodes:

A real-world scenario: diagnosing LLM latency

Imagine an application running on a Cisco AI POD using an LLM for user queries. As a result, application response times increase. Here’s how Splunk Observability Cloud solves this problem in minutes:

Alert triggered – Splunk detects high response times and triggers an alert.
Trace analysis – The service map highlights that most latencies occur in /v1/chat/completions calls to the LLM.
Infrastructure view – The AI POD dashboard shows that only one of the four available GPUs is active and fully utilized.
Actionable insights – You reconfigure the workload to use all GPUs, instantly restoring performance.

The NVIDIA Connection: Powering Intelligent Workloads

Splunk Observability also monitors key components of NVIDIA AI Enterprise, including the NVIDIA NIM Operator and NVIDIA NIM microservices for LLM inference, ensuring optimal performance of the NVIDIA software stack.

FedRAMP and Government Readiness: Splunk’s Current Path to Achieving Moderate FedRAMP for Splunk Observability

Splunk remains a trusted partner in government digital transformation, enabling agencies to deliver secure, resilient and intelligent services through cloud and customer-managed solutions. Building on the success of Splunk Cloud Platform, authorized FedRAMP High and DoD Impact Level 5, and listed on the StateRAMP Authorized Product List (dba GovRAMP), Splunk continues to invest in expanding our FedRAMP program to meet the changing needs of the public sector. As before announcementSplunk Observability Cloud has already received an “In Progress” designation and is awaiting full authorization to operate at the Moderate level from the FedRAMP Program Management Office. Splunk remains committed to supporting the security and mission success of all our government customers.

Observability: a cornerstone of Cisco Secure AI Factory with NVIDIA

In Cisco Secure AI Factory with NVIDIA, observability is not optional: it is fundamental.

By providing deep, real-time insights into infrastructure and applications, Splunk Observability Cloud improves:

Operational efficiency
Resource Optimization
Reliability and availability
Security posture

This comprehensive visibility is essential for building, operating, and securing complex AI pipelines at scale.

Conclusion

Cisco AI PODs provide the robust, scalable infrastructure required for today’s demanding AI workloads. When combined with Splunk Observability Cloud, organizations gain unparalleled visibility and control, enabling rapid troubleshooting, optimal performance, and faster innovation.

Splunk Observability is a central pillar of Cisco Secure AI Factory with NVIDIA, enabling businesses to build and run AI with confidence, speed and security.