Job Summary:
We are seeking a senior Infrastructure Support Engineer to maintain technical excellence and operational efficiency in cloud environments.
The ideal candidate will help clients transition to agile, value-focused practices, emphasizing shared responsibility and continuous improvement. They will monitor infrastructure performance, respond promptly to incidents, and maintain resources aligned with modern standards and sustainable practices.
Job Responsibilities
* Monitor the operations of shipped products and services using "Eyes on glass/Follow the sun" engagement models.
* Track product/service operations against key performance indicators defined by the business and take necessary actions in response to deviations.
* Collaborate with the Service Reliability Engineering (SRE) team and client stakeholders to define and document incident response scenarios, creating runbooks.
* Reduce human effort in day-to-day operations by automating tasks using the latest technology stacks and improving team efficiency.
* Participate in Level 2 and Level 3 support tasks, troubleshooting and resolving incidents. Set up war rooms for incident response, collaborating with tech leads, SRE leads, and development teams.
* Prepare incident Root cause analysis (RCA) and postmortem reports, explaining analyses and outlining preventive measures.
* Implement service/product reliability improvements by writing infrastructure/observability configuration code in collaboration with SRE engineers.
Technical Skills
* Hands-on experience with CI/CD tools such as Jenkins, CircleCI, or Gitlab for executing deployments.
* Knowledge of Infrastructure as Code (IAC) tech stacks like Terraform, Ansible, ARM, or Cloudformation for provisioning and managing infrastructure.
* Working experience with observability tools for logging, monitoring, tracing, and alerting, e.g., Datadog/Prometheus/Grafana, ELK/EFK/Splunk.
* Experience across a range of AWS products.
* Hands-on experience with container ecosystem tech stacks like Docker, Kubernetes, Openshift, etc.
* Understanding of system performance tuning and scaling, highly available systems, and disaster recovery concepts.
* Experience operating Linux OS like RHEL or Debian-Based OS, familiar with common Linux OS operations and commands.
* Support experience with backend storage solutions like SQL and NoSQL databases, caching solutions, and networking configuration and security.
* Familiarity with fundamental API concepts like request, response, headers, authentication, JSON payloads, etc.
Professional Skills
* Strong communication and articulation skills, proficient in English.
* Collaborative people skills, working closely with cross-functional teams.
* Ability to work under pressure during production incidents.
* Strong analysis, deduction, and reasoning skills, identifying patterns in data and drawing conclusions.
* Drive and ownership to deliver work when called upon.
* Willingness to be part of a rotation- and need-based 24x7 available team.