Aviatrix Systems, Inc. seeks Principal Engineer - Site Reliability Engineer in Santa Clara, CA.
Job Duties:
Ensure uptime for crucial services and systems based on business required SLOs. Minimize service disruptions through proactive monitoring, capacity planning and fault-tolerant design. Design and architect complex, scalable and reliable systems. Develop and implement automation tools and frameworks to automate routine tasks to reduce human error and to streamline and improve operational processes to increase efficiency. Define, build, deploy, maintain, and extend our observability and monitoring tools to enhance system reliability and availability. Maintain an effective on-call rotation to ensure 24/7 coverage. Respond to incident response procedures to swiftly address and mitigate service disruptions. Help define and monitor Service Level Indicators (SLIs) and Service Level Objectives to set clear expectations for system performance. Work closely with product engineering to ensure service-level objectives and reliability targets are met. Respond to escalations by troubleshooting complex system and application incidents, perform root cause analysis, implement necessary corrective actions. Stay up to date with the latest industry trends, emerging technologies. Iterate on best practices to increase the quality & velocity of development and deliverables. Must be available to work projects at various, unanticipated sites throughout the United States. 100% telecommuting permitted.
Minimum Requirements:
Bachelor's degree in Computer Science, Engineering, or related field (or foreign equivalent) followed by 8 years of progressive, post-baccalaureate experience in the job offered or in a software development/system reliability engineering-related occupation.
Alternative Requirements:
Master's degree in Computer Science, Engineering, or related field (or foreign equivalent) followed by 5 years of experience in the job offered or in a software development/system reliability engineering-related occupation.
Special Requirements:
Deploying and maintaining highly available, fault-tolerant systems at scale on cloud platforms including AWS, Azure, and GCP using cloud-native technologies; designing and developing automation tools and frameworks using Golang and Python programming languages; managing and troubleshooting Kubernetes application lifecycles, developing custom operators, and optimizing infrastructure using applicable tool-chains; implementing Infrastructure-as-Code (IaC) solutions using Terraform and Terragrunt for infrastructure provisioning, configuration management, and scaling in cloud environments; building and maintaining observability systems using monitoring solutions including Prometheus, Grafana, and Victoria Metrics, and logging solutions such as Elasticsearch, Logstash, and Kibana; performing incident response procedures, root cause analysis, and troubleshooting of complex system and application issues in Linux production environments with 24/7 on-call rotation responsibilities.
Position is eligible for 100% remote work. Salary $264,514.00 - $274,514.00. Standard company benefits.
To apply: Visit https://www.jobpostingtoday.com/application/32293/apply
JOBS.NOW Note: To tap into these hidden job opportunities, it's crucial to adhere strictly to the application process outlined in each job ad. At JOBS.NOW, we ensure that every listing includes detailed employer instructions. Follow them precisely to be considered for these unique positions!
The "Log Application" button simply allows you to log the application for your records - JOBS.NOW does not submit any applications to employers directly. Remember to still apply through the method indicated in the job ad (mail, email, or via link).
Please note that JOBS.NOW is an independent website and does not post this listings on behalf of any employers nor do we receive any compensation for these listings. All listings are sourced via media or internet channels required by the PERM process.