New

Software Engineer II

Microsoft
United States, Washington, Redmond
Oct 28, 2025
Overview Be at the forefront of Microsoft's AI revolution. The CoreAI organization at Microsoft builds the end-to-end Azure AI stack that powers Microsoft's AI innovation and differentiation. We operate the global Azure AI infrastructure that runs some of the largest AI workloads on the planet. We don't just value different perspectives - we seek them out and bring them together to better serve our customers. Within CoreAI, the Azure SRE Agent Platform, designs, builds, and operates production AI agents that keep Azure's app platforms healthy, fast, and secure. This team thrives in a very agile environment: short cycles, thin slices, feature flags, progressive delivery, and constant learning. We pair SRE fundamentals (SLOs, automation, incident response) with agentic systems (planning/execution loops, tool orchestration, evaluators, safety guardrails). If you like turning fuzzy problem statements into code that ships this week, you'll fit right in. We are seeking a Software Engineer II to help advance these capabilities in a fast, iterative environment. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. ResponsibilitiesDesign & implementation: Contribute to the architecture and delivery of SRE agents and platform services - author design docs, build features, threat models, and rollout plan for scoped features. Applied AI for reliability: Build LLM-powered detection, triage, mitigation, and post-incident learning loops; integrate evaluation frameworks and safety guardrails. SRE fundamentals at scale: Define SLIs/SLOs and error budgets; connect them to alerting, release gates, and agent action limits to reduce MTTR and change-fail rate. Progressive delivery: Implement feature flags, canaries, and staged rollouts; run shadow/A/B experiments with bakes in evaluations using the safe rollouts. Runbooks-as-code: Convert on-call procedures and "pager" into policy, and automated mitigations; maintain clear playbooks and tooling. Operations ownership: Participate in on-call, mitigate live incidents, and drive post-incident reviews with iterative hardening. Optimize, debug, and establish best practices for performance, cost, and latency across agents and platform components. Conduct code and design reviews to ensure adherence to standards and resolve issues proactively using telemetry and diagnostics. Stay updated on AI, SRE, and Kubernetes advancements and relevant regulations while fostering collaboration across teams to meet customer and partner needs.