The AI community building the future.
machine learning • natural language processing • deep learning
51 - 200
March 19
The AI community building the future.
machine learning • natural language processing • deep learning
51 - 200
• Design, develop, deploy, and maintain reliable and scalable infrastructure that enables efficient training workloads. • Manage large compute clusters for AI Training and development. • Create tooling and infrastructure that abstract compute and storage in ML workflows • Measure and optimize system performance. • Monitor and troubleshoot infrastructure issues, ensuring high availability and performance of AI workloads. • Recommend improvements to enhance system efficiency and performance. • Work closely with AI software engineering teams to ensure infrastructure can handle all system requirements.
• 7+ years of experience in a DevOps or infrastructure Engineer role building machine learning infrastructure and working with large GPU clusters. • Knowledge of cloud providers such as AWS, GCP, infra-as-code frameworks, and observability tools. • Familiarity with Python Scientific stack, Pytorch. • Experience with data structures, data modeling, and database management as well as object and file storage systems. • Strong communication, collaboration, and documentation skills. • Experience with Linux, Git, containers, networking, and command line tools. • Strong programming skills in Python, Golang, and/or Rust.
• Flexible working hours and remote options • Health, dental, and vision benefits for employees and dependents • 12 weeks of parental leave (20 for birthing mothers) and unlimited paid time off • Reimbursement for relevant conferences, training, and education • Company equity as part of compensation package
Apply Now