logo

Back

Senior Reliability Engineer

RS21

Albuquerque, NM

Full-time

$145,000 - $175,000 a year

Position Summary

The Senior Reliability Engineer supports RS21's space systems programs by owning the reliability, deployment, and operational continuity of cloud and hybrid infrastructure supporting real-time satellite data processing, telemetry pipelines, and ML-driven anomaly detection systems. This role is the engineering backbone behind RS21's operational space platforms, ensuring that systems deployed into defense and commercial satellite environments are stable, observable, and recoverable under real-world operational conditions.

At the Senior level, this person leads the SRE and DevOps practice for assigned space programs independently. They design monitoring and alerting architecture, own the deployment pipeline from code commit to operational environment, define SLOs and error budgets, and partner with software and data engineering teams to ensure systems are built to operate reliably from day one. They understand the constraints of classified and ops-floor environments and apply those constraints practically in every architectural and operational decision.

This role works closely with software engineers, data engineers, ML practitioners, and government stakeholders across RS21's DoD space systems portfolio, including deployments supporting AFRL, Space Force, and satellite operations floor environments. It requires someone who can hold both the engineering rigor of a senior SRE and the operational pragmatism required to deploy into highly regulated, mission-critical settings. Key Responsibilities Reliability Engineering & SRE Practice

  • Define and maintain SLOs, SLAs, and error budgets for RS21's space systems platforms, in collaboration with engineering and government stakeholders.
  • Lead incident response for operational platform failures, including triage, root cause analysis, blameless post-mortems, and follow-through on corrective actions.
  • Architect and implement monitoring, alerting, and observability solutions using CloudWatch, CloudTrail, and custom telemetry pipelines that reflect the operational realities of satellite systems.
  • Continuously improve system reliability through load testing, failure injection, chaos engineering practices, and proactive capacity planning.
  • Ensure operational requirements including latency, throughput, and sustainment are reflected in platform architecture and delivery plans from the earliest design stages.

Cloud Architecture & Deployment, Space Systems

  • Design, implement, and maintain cloud and hybrid deployment architectures for RS21's space systems platforms, including real-time ML inference pipelines, telemetry ingestion systems, and anomaly detection services.
  • Own the deployment pipeline for space systems software across AWS GovCloud and on-premise or edge-adjacent environments connected to satellite operations floors.
  • Architect containerized workloads using Docker and Kubernetes, including Helm chart development, cluster management, and workload scheduling for latency-sensitive satellite data processing.
  • Contribute to and enforce infrastructure-as-code practices using Terraform or CDK, ensuring all infrastructure is versioned, auditable, and reproducible.
  • Support classified and operationally sensitive deployments, applying zero-trust architecture principles and STIG compliance requirements throughout.

Security, Compliance & Accreditation

  • Lead security architecture reviews for cloud and hybrid infrastructure supporting DoD space programs, applying zero-trust principles and hardening against STIG and FedRAMP requirements.
  • Support ATO processes, RMF documentation, and accreditation activities in collaboration with security, legal, and government partners.
  • Implement IAM policies, cross-account access controls, and audit logging architectures using AWS IAM, CloudTrail, and Macie.
  • Ensure all deployment environments maintain continuous compliance posture and flag deviations proactively before they affect accreditation status.

CI/CD & DevOps Practice

  • Design, implement, and maintain CI/CD pipelines for space systems software using GitHub Actions, GitLab CI, or Azure DevOps, including automated testing, security scanning, and deployment gate controls.
  • Establish and enforce branching strategies, deployment promotion gates, and rollback procedures appropriate to operationally sensitive space environments.
  • Partner with software and data engineering teams to embed reliability and security practices into the development lifecycle rather than treating them as post-deployment concerns.
  • Lead the adoption of DataOps and MLOps pipeline standards for RS21's ML-based anomaly detection and predictive maintenance systems deployed in satellite contexts.

Telemetry & Data Pipeline Operations

  • Own the operational reliability of real-time data pipelines ingesting satellite telemetry, including Kinesis, MSK/Kafka, Lambda, and custom streaming architectures.
  • Monitor and optimize pipeline performance, latency, and throughput to meet the real-time processing requirements of satellite operations floor environments.
  • Collaborate with data engineers and ML practitioners to ensure that model inference infrastructure is reliably provisioned, monitored, and recoverable.
  • Develop runbooks and operational playbooks for telemetry pipeline failure scenarios, ensuring operations floor teams can respond without engineering escalation.

Stakeholder Engagement & Operational Translation

  • Translate operational needs and constraints from satellite operations floor teams into clear infrastructure and deployment requirements for engineering teams.
  • Partner with government stakeholders to communicate system health, deployment status, and risk posture clearly and in terms relevant to their operational context.
  • Contribute to program status reporting on infrastructure reliability, deployment readiness, and SLO performance.

Mentorship & Engineering Standards

  • Mentor junior and core engineers on SRE practices, cloud security, observability design, and operational discipline.
  • Contribute to internal engineering standards for deployment, monitoring, and reliability across RS21's broader cloud practice.
  • Lead architecture and design reviews for reliability-sensitive components across assigned programs.

Qualifications Required

  • Bachelor's degree or equivalent experience in computer science, systems engineering, or a related technical field.
  • 5+ years of experience in site reliability engineering, DevOps, or cloud infrastructure roles, with at least 2 years supporting operationally sensitive or regulated environments.
  • Deep experience with AWS services relevant to reliability, security, and operations: CloudWatch, CloudTrail, IAM, Lambda, ECS, EKS, Kinesis, MSK, and related services.
  • Strong proficiency with Docker and Kubernetes, including Helm chart development and cluster management.
  • Experience designing and maintaining CI/CD pipelines using GitHub Actions, GitLab CI, or Azure DevOps.
  • Solid understanding of infrastructure-as-code using Terraform, CDK, or equivalent.
  • Demonstrated experience with SLO definition, error budget management, and blameless post-mortem culture.
  • Familiarity with zero-trust architecture, STIG compliance, and FedRAMP requirements in cloud deployments.
  • Active security clearance or ability to obtain one. Top Secret preferred.

Preferred

  • AWS certifications: DevOps Engineer Professional, Solutions Architect Associate or Professional, Security Specialty, or Advanced Networking Specialty.
  • Experience supporting ATO processes, RMF documentation, or deployment into classified operational environments.
  • Background in DoD, Space Force, AFRL, or satellite operations environments.
  • Experience with real-time telemetry ingestion and streaming pipeline operations supporting ML inference.
  • Familiarity with MLOps practices and the operational requirements of deployed ML anomaly detection systems.
  • CompTIA Security+ or CISSP certification, particularly for DoD 8570 compliance contexts.
Similar Roles:

This listing was sourced from the company’s public careers page. If you'd like it removed or updated, please email contact@trueroles.com.