Atlassian Cloud Storage Engineering (ACSE) is composed of infrastructure teams tasked with developing and upkeeping the persistent data stores used by Atlassian's product and platform teams. The Managed Search team has set ambitious goals, including enhancing Search operational efficiencies company-wide, improving cluster reliability, cutting down the Total Cost of Ownership (Search TCO) across Atlassian, and enhancing Trust (Security & Compliance).In pursuit of these goals, the Managed Search team is developing a self-hosted search platform for use within Atlassian, aiming to achieve the ACSE vision to improve the clock speed of Atlassian by providing reliable, secure and cost-effective storage solutions.This position is for a Principal Engineer on the ACSE Kratos (Managed Search team ~15 engineers) reporting to Senior Engineering Manager. This role will require deep, hands-on operational work to run high-quality search infrastructure, outstanding collaboration skills to work effectively within a distributed team, and engage with a broad range of internal customers, solid industry knowledge and technical curiosity to assess when best to build, and great design and hands-on development skills to build automation and peripheral tooling.As a Principal Engineer, you will contribute to the architectural and technical direction of the Managed Search team, help set the standard for engineering practices and provide mentoring to more junior team members. You will also work with teams across Atlassian to provide guidance around search solutions, identify cross-cutting areas where the platform can be enhanced and design platform capabilities.Here, you'll collaborate with and provide guidance to experienced and inquisitive engineers to build the infrastructure that enables thousands of Atlassians to deploy and operate search applications in the cloud.Responsibilities & Activities:Design, implementation and operation of new and existing Managed Search components. For example:Operating search clusters at high load.Deep understanding of managing high numbers of clusters for reliability, such as ensuring reliable version upgrades and effective cluster configuration management.Building tooling and automation to facilitate the provisioning and operation of increasing numbers of clusters.Advanced understanding of cluster capacity management to ensure optimal performance and resource allocation within a system.Understanding of encryption at rest including KMS / data key management and BYOK.Engagement with product teams – for example Search Platform / JSIS / CPUS – to:Support and guide them as they onboard the service to the self-hosted search platform.Adapt the platform to cater for their use cases – without compromising other customers.Tuning and hardening the clusters based on a deep understanding of their data and query patterns.Contributing to the definition of appropriate SLAs that are suitable for customers and realistic for the Managed Search team.Contributions towards technical leadership within the team.Determining and understanding priorities based on the broader view of Managed Search within Atlassian.Driving & documenting key technical decisions.Identifying opportunities & mitigating risks based on deep knowledge of the Managed Search systems, as well as broad knowledge of adjacent systems and underlying infra.Key Results Areas:Quality: The Managed Search team will be key to both customer-facing functionality and internal business-critical workflows, so the platform's reliability and quality are essential metrics.Scale: The Managed Search team must be able to scale out clusters as customers' workloads increase and add clusters as the number of customers increases.Adoption: The Managed Search team is responsible for building a platform that is desirable to its consumers, engaging with customers to build trust in its product, and shipping in a sufficiently timely and incremental manner to enable dev teams to build on their components.Technical Requirements:Deep Elasticsearch / OpenSearch skills, including operating and tuning large clusters, implementing backup and recovery mechanisms, predicting and preventing cluster issues via monitoring, diagnosing and fixing unhealthy clusters, and implementing preventative solutions to avoid repeat failures.Experience with a range of AWS services, their advantages and limitations, and understanding when to use specific services.Experience building operationally mature systems with appropriate logging, monitoring, SLAs, alerting, and runbooks.A high standard for quality software engineering (CI / CD, testing).Experience progressively and safely rolling out changes to complex live systems.Experience with Java / Kotlin.Experience with Docker, Kubernetes.Knowledge of Golang.Experience with Micros or PaaS platform.Less Technical Requirements:Must be used to ownership of large deliverables and complex problems.Must be a top-notch team worker.Experience working with remote teams.Experience engaging with and building trust amongst internal customers.Experience with incident management processes.Experience participating in 24 / 7 on-call rosters (and willingness to do so on this team).Non-hero attitude: Rather than celebrating a heroic effort to resolve an incident, I prefer engaging in engineering practices that avoid the incidents in the first place.
#J-18808-Ljbffr