Skip to main content

Platform Engineer

Weekly Rate: $8,500/week

Overview

The Platform Engineer builds and maintains the infrastructure platform that enables development teams to deploy and operate their services efficiently. This role focuses on creating self-service infrastructure, automation tooling, and reliability engineering to support the HYROX system across global events. Rather than being a separate "DevOps team," this role works embedded with development teams to enable true DevOps practices.

This position combines infrastructure expertise with software engineering practices, ensuring that platform capabilities evolve alongside application requirements while maintaining reliability and security standards essential for global competition deployment.

Key Responsibilities

Platform Architecture and Self-Service Infrastructure - Build and maintain self-service infrastructure platforms that enable development teams to deploy and operate services independently. Design platform APIs and interfaces that abstract complexity while providing flexibility and control.

Infrastructure as Code and Automation - Implement Infrastructure as Code using Terraform, Ansible, and similar tools to ensure reproducible, version-controlled infrastructure. Create comprehensive automation that reduces manual operations and enables rapid, reliable deployments.

Container Orchestration and Microservices - Design and manage Kubernetes-based container platforms that support the microservices architecture. Implement service mesh technologies for advanced networking, security, and observability capabilities.

Developer Experience and Tooling - Create developer tooling and automation that accelerates development cycles and improves productivity. Build CI/CD pipelines that enable safe, frequent deployments with automated testing and rollback capabilities.

Observability and Monitoring Platform - Implement comprehensive observability platforms including metrics, logging, and distributed tracing. Ensure teams have visibility into system behavior, performance, and health through intuitive dashboards and alerting.

Site Reliability Engineering - Apply SRE principles to ensure system reliability through error budgets, SLOs, and automated incident response. Design systems for failure with automatic recovery mechanisms and graceful degradation.

Security and Compliance Automation - Embed security into the platform through automated scanning, policy enforcement, and compliance checking. Implement zero-trust networking and secrets management solutions.

Performance and Cost Optimization - Continuously optimize infrastructure performance while managing costs through FinOps practices. Implement auto-scaling, resource optimization, and cost allocation systems.

Disaster Recovery and Business Continuity - Design and implement disaster recovery strategies including backup, replication, and failover procedures. Ensure the platform can withstand and recover from various failure scenarios.

Capacity Planning and Scaling - Plan for growth through data-driven capacity planning and predictive scaling. Ensure the platform can handle peak loads during major HYROX events without degradation.

Required Skills

Platform Engineering Excellence demonstrates strong experience building robust infrastructure platforms that enable development teams to deploy and operate services efficiently. They have designed self-service systems that reduce operational overhead while maintaining reliability standards needed for global competition environments. Their experience includes scaling infrastructure to support rapid development cycles while avoiding technical debt accumulation that could impact long-term maintainability.

Container Orchestration and Infrastructure shows comprehensive knowledge of Kubernetes and Docker with solid understanding of service mesh concepts, operators, and cluster management practices. They have managed complex microservices architectures that maintain system resilience when individual components experience failures. Their Infrastructure as Code experience using Terraform and Ansible enables consistent, reproducible deployments across multiple environments and geographic regions.

DevOps and Cloud Platform Proficiency combines practical CI/CD pipeline development experience with cloud platform knowledge across AWS, GCP, or Azure that supports continuous deployment practices. They understand cloud-native architecture principles and can effectively utilize managed services while maintaining deployment flexibility. Their monitoring and observability experience provides the system visibility needed to maintain performance under varying operational conditions.

Security and Reliability Practices involves solid understanding of security best practices including zero-trust networking principles and secrets management approaches that protect sensitive athlete data. They have implemented Site Reliability Engineering concepts that help maintain high system availability even when individual components experience unexpected failures. This expertise helps ensure platform reliability meets the demanding requirements of live competitive events where system availability is critical.

Phase Allocation

The Platform Engineer engages at half capacity during Beta phase to establish infrastructure foundations and developer tooling. Full-time involvement during Gamma and Delta phases ensures robust platform development and reliability engineering implementation. The role maintains significant presence during Full Release to support production operations and continuous platform improvement.

PhaseWeekly RateAllocationDuration
Alpha-0%-
Beta$4,250/week50%12 weeks
Gamma$8,500/week100%8 weeks
Delta$8,500/week100%10 weeks
Full Release$6,375/week75%12 weeks

Deliverables

Self-Service Platform Infrastructure. Comprehensive platform capabilities enabling development teams to provision resources, deploy applications, and manage services independently through automated workflows. This infrastructure reduces operational overhead while maintaining security and compliance standards through policy-as-code implementations.

Infrastructure as Code Templates. Reusable Terraform modules and Kubernetes manifests that standardize infrastructure deployment across environments and regions. These templates ensure consistency, repeatability, and version control for all infrastructure components while enabling rapid provisioning of new environments.

Developer Tooling and Automation. Integrated development environment configurations, CI/CD pipelines, and automated testing frameworks that accelerate development velocity. These tools provide seamless workflows from code commit through production deployment while maintaining quality gates and security scanning throughout the pipeline.

Observability Platform Implementation. Comprehensive monitoring, logging, and tracing infrastructure providing full visibility into system behavior and performance. This platform enables proactive issue detection, rapid troubleshooting, and data-driven optimization decisions through centralized metrics collection and intelligent alerting.

Reliability Engineering Framework. Site Reliability Engineering practices including error budgets, service level objectives, and chaos engineering implementations. This framework ensures system reliability through systematic approaches to incident management, capacity planning, and continuous improvement based on production learnings.

Platform API Documentation. Complete documentation of platform services, APIs, and self-service capabilities enabling teams to effectively utilize platform features. This documentation includes integration guides, best practices, and troubleshooting procedures that reduce support burden and accelerate onboarding.

Cost Optimization Reports. Regular analysis of infrastructure spending with recommendations for cost reduction without compromising performance or reliability. These reports identify optimization opportunities through resource right-sizing, reserved capacity planning, and architectural improvements that reduce operational expenses.

Incident Response Procedures. Structured runbooks and escalation procedures for handling production incidents with minimal service disruption. These procedures define clear roles, communication protocols, and recovery strategies that ensure rapid resolution while maintaining stakeholder visibility throughout incident lifecycle.

Success Criteria

Rapid Deployment Capability. Achievement of sub-5-minute deployment times from code commit to production availability through optimized CI/CD pipelines. This includes automated testing, security scanning, and progressive rollout strategies that maintain system stability while enabling rapid feature delivery.

Ultra-High Availability. Platform maintains 99.99% availability across all critical services, translating to less than 52 minutes of downtime annually. This reliability is achieved through redundant architectures, automated failover mechanisms, and proactive capacity management that prevents service degradation.

Incident Response Excellence. All production incidents are resolved within one hour of detection, with automated alerting ensuring rapid problem identification. Post-incident reviews drive continuous improvement in platform resilience and response procedures, reducing both incident frequency and resolution time.

Security Compliance Achievement. Zero security breaches throughout the project lifecycle with successful completion of all security audits and penetration tests. Platform security is maintained through automated vulnerability scanning, secrets management, and continuous compliance monitoring across all environments.

Global Deployment Success. Platform successfully supports deployments across multiple geographic regions with consistent performance and reliability. Multi-region architectures ensure low latency for global users while maintaining data sovereignty compliance and disaster recovery capabilities.

Comprehensive Test Automation. Greater than 80% automated test coverage across all platform components including unit, integration, and end-to-end tests. This automation ensures consistent quality validation while enabling rapid development cycles and confident production deployments.