Member of Technical Staff – DevOps, Infrastructure Engineering

Job not on LinkedIn

October 11

Apply Now
Logo of FirstPrinciples Holding Company

FirstPrinciples Holding Company

B2B • Enterprise • Finance

FirstPrinciples Holding Company focuses on building and scaling a portfolio of successful commercial businesses. The company leverages strategic insight and operational expertise to maximize value and growth within its portfolio companies. FirstPrinciples aims to provide sustainable solutions that drive long-term success for its partners and stakeholders.

51 - 200 employees

🤝 B2B

🏢 Enterprise

💸 Finance

📋 Description

• Architect, automate, and scale the infrastructure for large-scale model training and research workflows. • Design and run large-scale pre-training experiments for both dense and MoE architectures. • Architect hybrid infrastructure solutions that span cloud and on-premises HPC environments. • Automate configuration management and drift detection using tools like Ansible, Salt, or Chef. • Build systems that reduce operational toil and establish guardrails for researchers. • Build and own comprehensive CI/CD pipelines for training workflows, evaluation jobs, internal tools, and services with rollback capabilities. • Develop tooling for developer workflows including reproducible builds, ephemeral environments, secrets management, and cluster resource allocation. • Create self-service infrastructure patterns that empower researchers and engineers. • Design infrastructure that accelerates experimentation while maintaining reliability and reproducibility. • Manage and extend HPC environments including GPU clusters, InfiniBand networks, job schedulers (Slurm/Kubernetes hybrid), and container orchestration. • Operate containerized and scheduled workloads efficiently across Docker, Kubernetes, and Slurm environments. • Optimize cluster scheduling and resource allocation for high-performance GPU workloads. • Debug GPU driver quirks, Slurm job issues, and InfiniBand networking hiccups. • Implement comprehensive monitoring, logging, and alerting across all infrastructure layers. • Establish SLOs/SLIs for infrastructure reliability and create observability dashboards for long-horizon training runs. • Build observability stacks for system health and job-level performance. • Proactively detect and resolve infrastructure issues before they impact research workflows. • Implement and manage secrets management and identity security solutions. • Champion security best practices, IAM policies, and compliance standards. • Document best practices, create runbooks, and evangelize DevOps culture across the organization. • Mentor teammates on infrastructure patterns, automation techniques, and operational excellence.

🎯 Requirements

• Bachelor's or Master's degree in Computer Science, Engineering, or related field. • 6-10+ years in DevOps, Infrastructure, or SRE roles with proven hands-on systems engineering experience (not just certification-based). • Deep Unix/Linux administration expertise including kernel tuning, networking, storage, and process control. • Advanced Infrastructure-as-Code experience with Terraform, Pulumi, or CloudFormation. • Expertise building CI/CD systems and reproducible build pipelines (GitHub Actions, GitLab CI, Jenkins, etc.). • Hands-on experience with AWS (EC2, S3, IAM, VPC, etc.) and cloud infrastructure management. • Cluster orchestration and job scheduling experience with Kubernetes and Slurm. • Strong monitoring and observability stack experience (Prometheus, Grafana, ELK/EFK, OpenTelemetry). • Demonstrated success scaling infrastructure for high-performance or GPU workloads. • Track record of managing GPU-accelerated clusters or HPC infrastructure. • Experience in automating workflows that reduced toil and scaling deployments safely. • Strong programming skills in at least one compiled/systems language (Python, Go, or Rust) plus Bash fluency. • Ability to work cross-functionally. Strong communicator who can simplify complex topics for diverse audiences. • Entrepreneurial & mission-driven, comfortable in a fast-growing, startup-style environment, and motivated by the ambition of tackling one of the greatest scientific challenges in history. • Demonstrated passion for physics and for making scientific knowledge accessible and impactful.

🏖️ Benefits

• Join us at FirstPrinciples and be a part of a transformative journey where science drives progress and unlocks the potential of humanity.

Apply Now
Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com