Site Reliability Engineer



Software Engineering
Taipei City, Taiwan
Posted on Friday, May 19, 2023
【Job Description】

We are looking for a Site Reliability Engineer (SRE) to make sure our cloud-based commerce platform is up and running and healthy.

As a SRE for iKala Commerce, you will be responsible for everything from our cloud infrastructure and operating systems to developing tools for code deployment and service monitoring. You will also review our code and system design and partner with developers to build our applications.

The SRE role is an integral member of our product development team. You will be a part of the team that makes crucial decisions about how to manage and scale complex, high-performance distributed systems. You will also provide your own perspective on our backend systems and constantly develop innovative ways to improve the way we manage the underlying infrastructure. Our ideal candidate should be able to develop applications on his/her own, but more eager to accelerate the whole team by building systems to improve performance and operational efficiency.

Ultimately, you should be involved in all stages of software development to define and improve our SLOs, SLAs & SLIs.

Our current tech stack include:
GCP, Kubernetes, Helm, Terraform, Stackdriver, Grafana, Prometheus, Elastic.

  1. Designing & implementing infrastructure for collecting metrics, crunching data and improving service monitoring to detect problems before they're visible to our customers.
  2. Building systems to automate our server lifecycle, from configuration management, CI/CD to server bootstrap and decommission.
  3. Troubleshooting, performing root cause analysis, and resolving production issues from the application and network layers all the way down to the system level.
  4. Participating in solution design and advising other developers when building new features so that they're scalable, maintainable, and performing well.
  5. Improving the observability of our applications through monitoring, alerting, logging, tracing and profiling, and building such observability features into a common platform.
  6. Practicing sustainable incident response and blameless postmortems.
  7. Proactively identifying and reducing issues through design, testing, and implementation of software-based solutions.
  1. BS/MS degree in Computer Science, Engineering or equivalent practical experience.
  2. 3+ years with UNIX/Linux systems.
  3. 1+ years of experience in software development, and familiar with shell script or one particular language.
  4. 3+ years of experience operating and building software in cloud environments including GCP or AWS.
  5. Experience in system / relational database administration.
  6. Experience with configuration management software such as Terraform, Ansible, Puppet, or Chef.
  7. 1+ years of production experience with Docker & Kubernetes.

Apply Now Back to Job list