Data Center Reliability Engineer

🕒 June 2

Apply Now
Find Similar Remote Jobs

📊 Check your resume score for this job

Improve your chances of getting an interview by checking your resume score before you apply.

Logo of Phaidra

Phaidra

51 - 200 employees

🤖 Artificial Intelligence

⚡ Energy

☁️ SaaS

Artificial Intelligence • Energy • SaaS

Phaidra is a company providing artificial intelligence controls to optimize mission-critical facilities such as data centers and industrial plants. Their closed-loop AI control service enhances plant stability, energy efficiency, and sustainability by reducing downtime, increasing productivity, and lowering CO2 emissions. Unlike traditional control systems, Phaidra's AI-driven controls continuously learn and improve over time without the need for new hardware. The system provides real-time optimization and integrates with existing control systems, enhancing safety and operational stability while providing full transparency of performance data. Phaidra uses cutting-edge deep reinforcement learning techniques to deliver exceptional results in some of the world's toughest challenges, including significant energy savings in Google's data centers.

📋 Description

• Utilize existing data ingestion and delivery platforms to "teach" models to understand the physical world, filling a critical expertise gap in the data center industry. • Use telemetry tools to analyze sensor data across mechanical (chillers, pumps) and electrical (UPS, switchgear, power feeds) systems to identify "failure signatures" for LLM-driven monitoring tool. • Act as a primary user of platforms, identifying gaps in current mechanisms and collaborating with Engineering to influence future features and data quality. • Translate raw telemetry into "SME-level" logic and directions used by the LLM tool to guide data center operators in real-time. • Cultivate deep domain expertise in all facets of data center infrastructure. • Move from shadowing peers to directly supporting customers, using the platform to provide clear, data-backed direction on complex problems. • Oversee pilot projects to test how AI-driven SME tool interprets real-world stressors, ensuring the output is operationally realistic, accurate, and actionable. • Remain agile and proactive in a fast-moving team environment.

🎯 Requirements

• 2–3 years of professional relevant experience • Bachelor’s degree in Mechanical Engineering, Electrical Engineering, Control Theory, or a related field that provides a foundation in physical systems and thermodynamics. • A deep, innate interest in using data to diagnose how and why systems fail. You are a "tinkerer" who prefers solving real-world problems over theoretical research. • Strong Python skills and experience with data manipulation libraries (Pandas/NumPy) to perform custom analysis outside of standard tooling. • Ability to explain complex diagnostic findings clearly and persuasively to both technical peers and non-domain stakeholders. • A proven ability to look at a problem without preconceived notions and figure out solutions either independently or via team collaboration. • Demonstrated commitment to Transparency, Collaboration, and Ownership—especially in environments where reliability and learning from failure are paramount.

🏖️ Benefits

• Fast-paced, team-oriented environment where your work directly shapes the company’s direction. • We are a 100% remote company. • Competitive compensation & meaningful equity. • Outsized responsibilities & professional development. • Training is foundational; functional, customer immersion, and development training. • Medical, dental, and vision insurance (exact benefits vary by region). • Unlimited paid time off, with a required minimum of 20 days per year. • Paid parental leave (exact benefits vary by region). • Flexible stipends to support your workspace, well-being, and continued professional development. • Company MacBook.

Apply Now

Similar Jobs

🕒 June 2

Senior DevOps Engineer automating, optimizing delivery pipeline for defense systems. Leveraging CI/CD, IaC, and cloud technologies to enhance operational efficiency.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 June 2

Torq

51 - 200

🤖 Artificial Intelligence

🔒 Cybersecurity

DevOps Engineer managing AI-native autonomous SecOps platform processes and collaborating with global teams. Identifying efficiencies and delivering automation in a fast-paced environment.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 June 2

Torq

51 - 200

🤖 Artificial Intelligence

🔒 Cybersecurity

DevOps Engineer managing production environments and collaborating with global teams at a fast-growing cybersecurity company. Championing automation and optimizing reliability in a cutting-edge tech environment.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 June 2

Torq

51 - 200

🤖 Artificial Intelligence

🔒 Cybersecurity

DevOps Engineer automating and optimizing software development processes for a cybersecurity firm. Collaborating with global teams to enhance production environments and streamline workflows.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🕒 June 2

Torq

51 - 200

🤖 Artificial Intelligence

🔒 Cybersecurity

DevOps Engineer responsible for automation and optimization at cybersecurity startup. Collaborating globally and empowering development teams in a fast-moving environment.

🇺🇸 United States – Remote

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)