Senior Site Reliability Engineer


We are looking for a Senior Site Reliability Engineer. In this role, you will be responsible for:

Running the production environment to provide the highest levels of uptime, performance, and reliability.

Identify toil in the day-to-day operations and automate whatever can be automated

Work with development teams to make sure the applications are production-ready, scalable, reliable, and observable from day zero

Identify and drive opportunities to improve automation for code deployment, management, and visibility of application services

Establish end-to-end monitoring and alerting on all critical components within the platform

Participate in the on-call rotation, supporting the platform and production applications

Manage end-to-end availability and performance of critical services and build automation

Perform root cause analysis on issues, and participate in blameless post-mortems so we can learn from incidents and automate them out of recurrence

Independently troubleshoot complex systems and environments including applications, microservices, DNS, and networking components

Create load test scenarios and streamline their execution so performance regressions can be caught pre-production

Enable developers and product teams to move rapidly with features without sacrificing reliability, availability, and overall performance of our systems

Participate in architecture reviews and work cross-functionally with Engineering teams on operational readiness and tactical day-to-day scenarios

Work with engineering teams to better address needs and enable more effective and efficient developer throughput

Identify performance bottlenecks and triage with Engineering teams to design and implement a secure and performant solution

Guide development teams towards security, reliability, and availability best practices during the SDLC

Daily and Monthly Responsibilities

Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding

Partner with development teams to improve services through rigorous testing and release procedures

Participate in system design consulting, platform management, and capacity planning

Create sustainable systems and services through automation and uplifts

Balance feature development speed and reliability with well-defined service level objectives and service-level indicators to honor SLAs

If you’re looking for a real challenge in terms of mission criticality, multi-geographic region deployments, diversity of managed services, and the chance to work with cutting edge technologies like Kubernetes, Kafka, Serverless, ArgoCD and more, then this might be the position for you!


Qualifications / Experience / Technical Skills

Experience administering Kubernetes-based microservices, ingress controllers, web servers (nginx), and databases (Postgres, MySql, MongoDB; Desirable - Redis, Clickhouse)

Strong experience with AWS technologies such as EKS, ELB, RDS, S3/EBS/Glacier and VPC

Experience architecting highly scalable, fault tolerant, secure, and available systems within the AWS ecosystem

Strong troubleshooting experience in the realm of networking fundamentals, web applications, and DNS

Hands-on experience developing automation to streamline development processes

Experience working with modern CI/CD tools such as CircleCI, ArgoCD, CodeShip, GitHub Actions, or similar solutions

Experience with Infrastructure as Code tools (e.g. Terraform, CloudFormation)

Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript.

Soft Skills / Personal Characteristics

BS or MS from a top-notch CS program (or equivalent experience)

5+ years professional experience in hands-on engineering roles (DevOps/SRE);

3+ years operating high-traffic production environments in public clouds: AWS, GCP, or Azure

Python programming experience in production environments

Experience with modern cloud environments: containerization, infrastructure-as-code, devops, CI/CD pipelines and general automation

Hands on experience with network security, databases systems and related tools

English speaking and writing

Preferred Experience

Operating Kubernetes clusters in a compliance regulated environment

Experience performing stress-testing, failure analysis, and load testing apps

Experience with cloud and infrastructure security regulations & compliance programs: SOC2, ISO27001, HIPAA, GDPR, CCPA

Experience with ML Ops: Spark, TensorFlow, GPUs

Job Type
Full Time
74 days ago

Similar Jobs from Partners

More Jobs

Full Time Administrative Assistant

Palen Kimball / Saint Paul
18 hours ago


Project Manager - Building Automation and Controls

Systems Contractors / Ashburn
18 hours ago


Biomedical Scientist Haematology Band 7- BMS

Atlantis Medical Ltd / Tonbridge
74 days ago


Senior Site Reliability Engineer

Workato / Barcelona, Barcelona, Spain
74 days ago


Senior Site Reliability Engineer

Workato / Berlin, Berlin, Germany
74 days ago


ParallelDesk News

Stellenmarkt des Kölner Studierendenwerks - Deine Stadt. Dein Job.

Srini | 21 December 2022

How do I make an HTTP request in Javascript?

Paralleldesk | 20 December 2022

Remote Jobs a Brief

Srini | 19 December 2022

US Green Card Immigration

Srini | 19 December 2022

Covid Conspiracy

Srini | 19 December 2022