Site Reliability Engineering (SRE) is an alternative approach to the traditional split of Technical Operations and Product Development teams, pioneered by Google. Our SRE mission is to protect and improve the software and systems of Direct Supply - with an emphasis on availability, latency, performance, and scalability.
Much of our support and software development focuses on optimizing existing systems and reducing work through automation. We have full access and authority to fix, extend, and scale the systems, code, and processes to keep our internal and SaaS products working reliably. The ideal candidate is a curious detective and engineer who uses their engineering and communication skills to find root causes, propose solutions, and collaborate with others to implement improvements.
As a Staff Software Engineer on the SRE team, you will have the opportunity to increase the quality and reliability of software across the business while using your strengths in coding, analysis, and system design. You’ll also help spread SRE patterns and practices into engineering and operations through collaboration and mentoring. Never satisfied with the status quo, you’ll continue to proactively find and implement improvements with your partners throughout the business.
What You’ll Do and Impact:
- Design and implement software and processes to improve Direct Supply’s software system availability, scalability, latency, and efficiency.
- Act as a liaison and mentor to other engineers to influence and grow SRE patterns and practices: monitoring & alerting (SLOs and observability), scalability, handling failure gracefully and automatically recovering, making releases non-eventful, and other aspects of resilient and reliable software engineering.
- Collaborate on the creation of new designs, architectures, standards, guidelines and approaches for software development and production operations that increase reliability and maintainability.
- Remove operational toil and alert noise through analysis, automation, and software development.
- Perform analysis on software performance in addition to capacity planning and demand forecasting.
- Handle on call duties as scheduled in the rotation with other members of the SRE team. On-call engineers have a backup, as well as flexible hours and locations to ensure work-life balance.
- Lead response to system alerts when on-call and take necessary actions to keep systems healthy. During larger incidents, communicate to appropriate stakeholders and engage other engineers to ensure speedy resolution. Facilitate blameless post-mortems.
- Conduct operations support including the execution of software releases, production data updates, and utilize system expertise and data to answer user questions around system function.
What You’ll Need:
- Bachelor’s degree in Computer Science, Computer Engineering or Software Engineering or experience commensurate
- Demonstrated ability to learn, apply, and mentor others on new software technologies quickly
- 4-10+ years software engineering or systems engineering experience
Additional Preferred Skills:
- Experience designing, analyzing, or troubleshooting complex systems, using data from application logs, diagnostic tools, and industry-standard monitoring tools such as New Relic/Datadog/Sumo Logic
- Systematic problem-solving approach
- Experience applying SRE patterns and principles to improve customer outcomes
- Knowledge in networking, cloud computing (AWS), MS SQL, PostgreSQL, or software debugging