Tech

Site Reliability Engineer (Remote)

Remote   |   Full Time

Experience: Minimum 4 years

Roles and responsibilities

  • Deploy and maintain applications.
  • Design, build, manage and operate the infrastructure and configuration of SaaS applications with a focus on automation and infrastructure as code.
  • Evaluate performance trends and expected changes in demand and capacity, and establish the appropriate scalability plans
  • Ensure that SLAs are met in executing operational tasks
  • Be on-call to respond to infrastructure failures.
  • Debug infrastructure related production issues across services and multiple levels of stack.
  • Setup automation to prevent similar incidents from happening.
  • Configure "smart" monitoring to get early warning before failure points.
  • Maintain a change log for every action and help build a knowledge base of failures and solutions.
  • Regular reporting of performance benchmarks for production systems. Plan to Scale up/down when needed
  • Responding to infrastructure alerts
  • Configure monitoring/alerts where needed.
  • Fine tuning of monitoring thresholds and reducing false alerts
  • Maintaining an audit log of changes
  • To support a customer/internet facing application that needs to be up 24x7.
  • Monitoring Database clusters.

What candidate should know:

  • A strong knowledge of AWS Technologies and a willingness to self-teach with change.
  • Systematic problem-solving approach, combined with a strong sense of ownership and drive.
  • Experience in Design, creation, and provisioning of infrastructure.
  • Experience working within an Agile/Scrum SDLC
  • Experience with Continuous Delivery and Deployment Automation (Our env: Ansible, Gitlab, Git/Github, Artifactory, Terraform)
  • Solid experience using configuration management frameworks (e.g. Chef, Puppet)
  • Experience in Building and managing Virtualized systems (Containers/Docker)
  • Develop comprehensive monitoring solutions to provide full visibility to the different platform components using tools and services like Kubernetes, Grafana, ELK, Datadog, New Relic and other similar tools.
  • Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)
  • An understanding of capacity planning and how to set appropriate limits to optimize cost and performance.
  • Knowledge of identifying system scale, backoff or other throughput challenges to help prevent incidents or resolve them quickly.
  • Identify and troubleshoot any availability and performance issues at multiple layers of deployment, from hardware, operating environment, network, and application.
  • Experience with performing to metric, SLI/SLO/SLA(s)
  • History with product behavior, edge cases, failure modes, negative boundary behaviors, load mishaps, etc.., to stop issues before they enter production.
  • Firm grasp of at least one modern programming language, beyond advanced scripting (Shell, Perl, Python)
  • Experience writing automation tools eagerness to automate all the things
  • An understanding of capacity planning and how to set appropriate limits to optimize resources.
  • Working knowledge of information security issues.
  • Certified AWS Solution Architect Associate / Professional preferred

Submit Your Application

You have successfully applied
  • You have errors in applying