HPC Support Engineer

Job not on LinkedIn

October 9

Apply Now
Logo of Lambda

Lambda

Artificial Intelligence • SaaS • Hardware

Lambda is a company that provides cloud-based solutions and hardware for AI development. They offer on-demand GPU clusters for multi-node training and fine-tuning, as well as inference endpoints and APIs. Their products include the Lambda GPU Cloud, which features NVIDIA's latest generation of infrastructure for enterprise AI, and customizable GPU workstations and desktops designed for AI and deep learning. Lambda also offers a one-line installation and managed upgrade path for machine learning tools like PyTorch, TensorFlow, and NVIDIA CUDA. By focusing on enabling AI developers, Lambda provides both public and private cloud services with access to powerful NVIDIA Tensor Core GPUs.

51 - 200 employees

🤖 Artificial Intelligence

☁️ SaaS

🔧 Hardware

💰 $39.7M Venture Round on 2022-11

📋 Description

• Engage directly with customers to deeply understand their challenges, ensuring a personalized, and effective support experience. • Dive into complex software and hardware issues, providing timely and efficient solutions. • Craft comprehensive documentation of solutions and contribute to enhancing support procedures, ensuring continuous improvement in service quality. • Identify common customer pain points and collaborate closely with engineering teams to develop innovative solutions, constantly improving the overall customer experience. • Collaborate in the development of new and existing products, contributing your expertise to shape the future of deep learning cloud and HPC infrastructure. • Take escalations from your peers while looking for opportunities to train and educate them in the process. • Work cross functionally on project work, focusing on creating and improving support tooling. • Be expected to participate in a rotating on-call schedule where you’ll be responsible for major incidents and major customer alerts and issues.

🎯 Requirements

• 7+ years in cloud support operations or systems engineering. • Strong experience with public cloud platforms (AWS, Azure, GCP) or GPU cloud providers. • Very strong understanding and experience with Linux (Ubuntu) system administration • Proven experience in HPC environments, showcasing your expertise in Linux cluster administration, with strong preference for Kubernetes and/or Slurm for cluster orchestration • Proficiency with monitoring/logging tools (Prometheus, Grafana, Datadog). • Strong skills in log analysis, debugging kernel-level issues, and performance profiling. • Experience with CUDA, NCCL, NVLink, MIG, GPUDirect RDMA. • Experience with high throughput networking technologies(IB/RoCE) • Experience with virtualization and container (Docker, Kubernetes) technologies. • Knowledge of distributed AI/ML or HPC workloads. • Knowledge of TCP/IP, VPN, and firewalls in cloud environments. • Ability to work independently and mentor junior support engineers.

🏖️ Benefits

• Health, dental, and vision coverage for you and your dependents • Wellness and Commuter stipends for select roles • 401k Plan with 2% company match (USA employees) • Flexible Paid Time Off Plan that we all actually use

Apply Now

Similar Jobs

October 7

Onto Innovation

1001 - 5000

Technical Product Support Engineer delivering hardware support for Onto Innovation semiconductor tools. Engaging with customers and improving support structures while providing training to engineers.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

📞 Support Engineer

🦅 H1B Visa Sponsor

October 7

PTC

5001 - 10000

🏢 Enterprise

Provide exceptional technical support for PTC’s Kepware product. Collaborate with global teams to resolve customer issues efficiently within Pacific Time business hours.

🇺🇸 United States – Remote

💵 $60k - $80k / year

⏰ Full Time

🟠 Senior

🔴 Lead

📞 Support Engineer

🦅 H1B Visa Sponsor

October 4

Rain

201 - 500

💳 Fintech

🤝 B2B

👥 HR Tech

Senior IT Support Engineer providing top-tier IT support for a remote-first fintech startup. Managing cloud infrastructure and collaboration tools for a globally distributed workforce.

🇺🇸 United States – Remote

⏰ Full Time

🟠 Senior

📞 Support Engineer

🦅 H1B Visa Sponsor

Cloud

DNS

Firewalls

Jamf

Linux

TCP/IP

October 3

Panoptyc

2 - 10

🤖 Artificial Intelligence

🔐 Security

🔧 Hardware

Hardware Support Engineer managing field devices including cameras and relays for retail security company. Ensuring reliable operations and supporting installations across multiple locations.

🇺🇸 United States – Remote

💵 $30 / hour

⏰ Full Time

🟡 Mid-level

🟠 Senior

📞 Support Engineer

Developed by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com