Site Reliability Engineer (SRE) - LLM and Machine Learning

December 20, 2023

Apply Now

Loading...

techruiter.

we are your answer to building teams that have the power to transform your company. we are techruiter.

Tech Recruitment • Product Recruitment • Science Recruitment • Consulting • Talent Acquisition

11 - 50

Description

• Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability. • Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services. • Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance. • Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence. • Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency. • Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems. • Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimization. • Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

• Bachelor's or Master's degree in Computer Science, Information Technology, or a related field. • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure. • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes). • Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines. • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack). • Scripting and automation skills (e.g., Python, Bash). • Excellent problem-solving and troubleshooting skills. • Strong communication and collaboration skills.

Benefits

• Excellent salary and benefits package • Opportunity to work with cutting-edge technology • Collaborative and innovative work environment

Apply Now
Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com
Jobs by Title
Remote Account Executive jobsRemote Accounting, Payroll & Financial Planning jobsRemote Administration jobsRemote Android Engineer jobsRemote Backend Engineer jobsRemote Business Operations & Strategy jobsRemote Chief of Staff jobsRemote Compliance jobsRemote Content Marketing jobsRemote Content Writer jobsRemote Copywriter jobsRemote Customer Success jobsRemote Customer Support jobsRemote Data Analyst jobsRemote Data Engineer jobsRemote Data Scientist jobsRemote DevOps jobsRemote Engineering Manager jobsRemote Executive Assistant jobsRemote Full-stack Engineer jobsRemote Frontend Engineer jobsRemote Game Engineer jobsRemote Graphics Designer jobsRemote Growth Marketing jobsRemote Hardware Engineer jobsRemote Human Resources jobsRemote iOS Engineer jobsRemote Infrastructure Engineer jobsRemote IT Support jobsRemote Legal jobsRemote Machine Learning Engineer jobsRemote Marketing jobsRemote Operations jobsRemote Performance Marketing jobsRemote Product Analyst jobsRemote Product Designer jobsRemote Product Manager jobsRemote Project & Program Management jobsRemote Product Marketing jobsRemote QA Engineer jobsRemote SDET jobsRemote Recruitment jobsRemote Risk jobsRemote Sales jobsRemote Scrum Master + Agile Coach jobsRemote Security Engineer jobsRemote SEO Marketing jobsRemote Social Media & Community jobsRemote Software Engineer jobsRemote Solutions Engineer jobsRemote Support Engineer jobsRemote Technical Writer jobsRemote Technical Product Manager jobsRemote User Researcher jobs