Site Reliability Engineer (San Francisco or Geneva) @DiemAssociation
DevOps/Site Reliability Engineer
We need an experienced DevOps/Site Reliability Engineer to join our team to help us build a global payment system on the Diem Blockchain. As a DevOps/Site Reliability Engineer, you’ll work to deeply understand our current state and roadmap. You’ll use your extensive experience in Site Reliability Engineering to help optimize the Diem network for growth and adoption with members and end consumers. You will monitor, manage, and maintain the Diem network; work with our members to ensure efficient deployment, operations, performance, and SLAs of Validator nodes; coordinate responses to critical incidents; and coordinate planned new releases.
- Manage and maintain network health, including:
- Monitor top-level network statistics, including finality time, transaction costs, and mempool backlog
- Coordinate response to critical incidents, including incidents that impact all Validators, possible hard forks, and ongoing exploits
- Ensure each Validator is meeting performance and availability SLAs
- Coordinate planned new releases, including defining release frequency, coordinating pre-release testing, and coordinating release timing and expectations with Validators
- Manage Validator deployment and operations, including ensuring up-to-date documentation for Validator hardware requirements, Validator deployment across all environments, Validator key management and common Validator operations
- Build services and infrastructure that monitor the health, uptime, and reliability of our blockchain infrastructure.
- Work closely with other engineers and team members. Our team is highly collaborative and we strive to create shared understanding of systems.
- Contribute to architecture discussions. We need a strong technical leader that can help drive and own whole systems.
- Incorporate monitoring, alerting, and observability into systems so that we can debug, diagnose, and fix problems.
- Be part of an on-call rotation for production events or outages.
- 10+ years of Site Reliability Engineering / DevOps experience
- Fluency with Linux and at least one programming language
- Expertise with at least one of AWS, GCP, or Azure
- Excellent written and verbal communication skills, especially in a matrixed-type organization
- Self-motivated and able to thrive in rapidly evolving and entrepreneurial environments
- Detail-oriented with strong ability to analyze strategically at a high-level
- Resourceful and creative thinker with strong problem-solving skills
- Desire to build something that will positively impact billions of people
- Experience with Kubernetes and Terraform
- Proficiency with Rust
- Knowledge of blockchain technology and experience deploying blockchain infrastructure
- Cryptocurrency knowledge
To apply, please visit https://www.diem.com/en-us/careers/ Diem is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, or veteran or disability status.
Your application has been successfully submitted.