New

AI Infrastructure & Operations Engineer

Cisco Systems, Inc.
$212,300.00 to $275,800.00
life insurance, vision insurance, parental leave, paid holidays, sick time, 401(k)
United States, California, San Jose
170 W Tasman Dr (Show on map)
Feb 07, 2026
The application window is expected to close on: 03/06/2026 Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received. Meet the Team Join Cisco's AI Platform Team to shape the future of enterprise AI infrastructure. We're building next-generation systems that power intelligent, secure, and scalable AI solutions across Cisco's ecosystem-from advancing Small Language Model capabilities and building Agentic frameworks to ensuring trustworthy AI deployment at scale. You'll join as a core member of our team, with the opportunity to influence our culture, technical direction, and operational excellence. In this role, you'll work with other infrastructure and operations engineers to own the infrastructure that powers AI model training and deployment across multiple GPU clusters (AWS, GCP, and Cisco IT), build operational tooling, and design systems that handle hundreds to thousands of GPUs. If you're passionate about production-grade AI infrastructure and enterprise-scale reliability, this is where you'll make it happen. Your Impact As an AI Infrastructure & Operations Engineer, you will work with the infrastructure team to own the reliability, scalability, and operational excellence of our multi-cluster GPU infrastructure that powers enterprise AI model development. Build and operate multi-cluster GPU infrastructure: Stand up, configure, and operate GPU clusters across AWS, GCP, and Cisco IT, scaling from tens to hundreds (eventually thousands) of GPUs with high throughput and cost-efficiency. Own platform reliability and operations: Establish SLOs, implement monitoring/alerting, build operational runbooks, and drive incident response to ensure enterprise-grade uptime. Optimize costs and utilization: Implement cost optimization strategies (Spot instances, fractional GPU configs, autoscaling) and scheduling policies to maximize cluster utilization and ROI. Build infrastructure automation: Develop automation for cluster provisioning, deployment pipelines, and lifecycle management using Infrastructure-as-Code (Terraform) and CI/CD best practices. Enable distributed AI workloads: Configure and optimize networking for multi-node training (RDMA, EFA, NCCL), implement storage abstractions for large datasets, and ensure high-bandwidth GPU communication. Ensure security and compliance: Implement multi-tenant GPU isolation, namespace-level security policies, and access control mechanisms for enterprise workloads. Support model training workflows: Partner with ML engineers and researchers to ensure infrastructure supports their needs-from custom runtimes to storage performance and network bandwidth. Minimum Qualifications BS/MS in Computer Science, Engineering, or related technical field with 5+ years of experience in infrastructure engineering, DevOps, SRE, or platform engineering, or equivalent practical experience. Strong experience with Kubernetes orchestration, container technologies (Docker), and cloud-native infrastructure patterns. Hands-on experience managing production infrastructure on at least one major cloud provider (AWS, GCP, or Azure). Proficiency with Infrastructure-as-Code tools such as Terraform, OpenTofu, CloudFormation. Experience with observability tools: Prometheus, Grafana, or similar for metrics, logging, and alerting. Strong scripting and automation skills in Python, Bash, or Go. Understanding of networking fundamentals: VPCs, load balancers, DNS, firewalls, and cross-cloud connectivity. Experience with CI/CD pipelines and automation (GitHub Actions, GitLab CI, Jenkins, or ArgoCD). Ability to troubleshoot complex distributed systems issues. Preferred Qualifications Experience with GPU infrastructure and AI/ML workloads: Ray clusters, Kubeflow, MLflow, or similar platforms. Hands-on experience with GPU orchestration: configuring NVIDIA GPUs (A100, H100), managing GPU drivers and CUDA runtimes. Knowledge of distributed training networking: RDMA, InfiniBand, EFA, NCCL. Experience with multi-cloud infrastructure: cross-cloud networking, unified storage abstractions, disaster recovery. Familiarity with cost optimization strategies: Spot instances, Reserved Instances, Savings Plans, and FinOps practices. Experience building SRE practices: SLOs/SLIs, on-call rotations, incident management, and operational runbooks. Track record of scaling infrastructure from prototype to production. Why Cisco? At Cisco, we're revolutionizing how data and infrastructure connect and protect organizations in the AI era - and beyond. We've been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint. Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you'll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere. We are Cisco, and our power starts with you. Message to applicants applying to work in the U.S. and/or Canada: The starting salary range posted for this position is $212,300.00 to $275,800.00 and reflects the projected salary range for new hires in this position in U.S. and/or Canada locations, not including incentive compensation, equity, or benefits. Individual pay is determined by the candidate's hiring location, market conditions, job-related skillset, experience, qualifications, education, certifications, and/or training. The full salary range for certain locations is listed below. For locations not listed below, the recruiter can share more details about compensation for the role in your location during the hiring process. U.S. employees are offered benefits, subject to Cisco's plan eligibility rules, which include medical, dental and vision insurance, a 401(k) plan with a Cisco matching contribution, paid parental leave, short and long-term disability coverage, and basic life insurance. Please see the Cisco careers site to discover more benefits and perks. Employees may be eligible to receive grants of Cisco restricted stock units, which vest following continued employment with Cisco for defined periods of time. U.S. employees are eligible for paid time away as described below, subject to Cisco's policies: 10 paid holidays per full calendar year, plus 1 floating holiday for non-exempt employees 1 paid day off for employee's birthday, paid year-end holiday shutdown, and 4 paid days off for personal wellness determined by Cisco Non-exempt employees* receive 16 days of paid vacation time per full calendar year, accrued at rate of 4.92 hours per pay period for full-time employees Exempt employees participate in Cisco's flexible vacation time off program, which has no defined limit on how much vacation time eligible employees may use (subject to availability and some business limitations) 80 hours of sick time off provided on hire date and each January 1st thereafter, and up to 80 hours ofunused sick timecarried forwardfrom one calendar yearto the next Additional paid time away may be requested to deal with critical or emergency issues for family members Optional 10 paid days per full calendar year to volunteer For non-sales roles, employees are also eligible to earn annual bonuses subject to Cisco's policies. Employees on sales plans earn performance-based incentive pay on top of their base salary, which is split between quota and non-quota components, subject to the applicable Cisco plan. For quota-based incentive pay, Cisco typically pays as follows: .75% of incentive target for each 1% of revenue attainment up to 50% of quota; 1.5% of incentive target for each 1% of attainment between 50% and 75%; 1% of incentive target for each 1% of attainment between 75% and 100%; and Once performance exceeds 100% attainment, incentive rates are at or above 1% for each 1% of attainment with no cap on incentive compensation. For non-quota-based sales performance elements such as strategic sales objectives, Cisco may pay 0% up to 125% of target. Cisco sales plans do not have a minimum threshold of performance for sales incentive compensation to be paid. The applicable full salary ranges for this position, by specific state, are listed below: New York City Metro Area: $212,300.00 - $317,100.00 Non-Metro New York state & Washington state: $193,800.00 - $282,100.00 * For quota-based sales roles on Cisco's sales plan, the ranges provided in this posting include base pay and sales target incentive compensation combined. ** Employees in Illinois, whether exempt or non-exempt, will participate in a unique time off program to meet local requirements.