Sobre a empresa:
Avenue Code é uma empresa de consultoria de e-commerce com sede em São Francisco, com três escritórios adicionais no Brasil (Porto Alegre, Belo Horizonte e São Paulo). Nós somos uma empresa 100% financiada com capital privado, rentável, e têm sido uma trajetória de crescimento sólido por vários anos. Nós nos preocupamos profundamente com nossos clientes, nossos parceiros e nossos consultores. Preferimos a palavra “parceiro” ao “fornecedor”, e nosso investimento em relacionamentos profissionais é um reflexo dessa filosofia. Orgulhamo-nos da nossa perspicácia técnica, da nossa capacidade colaborativa de resolução de problemas e do calor e profissionalismo dos nossos consultores.
About the opportunity:
We are looking for a Site Reliability Engineer who can bring the team and site stability to the next level. This candidate will help ensure site stability through proactive and reactive analysis of site performance as well as contribute the necessary tools, data, and recommendations. This role is expected to be a partner in our journey, so we are looking for someone who can identify opportunities in tools, process, execution, and delivery.
- Proactively address potential stability / performance challenges that could impact site performance
Serve as a primary point responsible for the overall health, performance, and capacity of services and infrastructure serving the website and Mobile apps;
- Own and run post SEV1 incident reviews and follow up fixes/changes;
- Solve availability/performance problems and build software-based solutions to prevent re-occurrences
Design and develop automation frameworks and test suites to enforce SRE techniques and tools;
- Analyze and identify performance bottlenecks and make recommendations;
- Implement metrics, monitoring, incident response and capacity planning processes;
- Collaborate closely with Developers/Solution Architects to ensure a designed solution responds to non-functional requirements such as availability, performance and maintainability;
- Collaborate closely with Performance Engineering to review and understand Performance tests and results;
- Collaborate closely with Monitoring and alerting teams to identify possible metrics within the application and ensure coverage across;
- Build Self-Service tools for the SRE and for other operations’ groups that automate manual processes and toil;
- Build and maintain runbooks to troubleshoot applications, infrastructure and services that are in use;
- Develop tools to effectively monitor custom applications in a large-scale ecommerce environment.
- Experience as a Site Reliability Engineer or equivalent role;
- Experience operating highly-automated, mission-critical 24/7 production systems;
- UI debugging tools, web performance measurement ;
- Good knowledge of client-side rendering;
- Experience troubleshooting issues with any tools available (heap dumps, thread dumps, monitoring, database queries, etc.)
- Experience with automation tools (Ansible or Terraform, Chef, Puppet);
- Understanding of full stack as well as microservice based eCommerce applications Understanding of monitoring frameworks utilized (New Relic, Splunk, ELK, Stackdriver, Pagerduty);
- Familiarity with platform components such as hypervisors/VMs, AWS/GCP, GKE, load balancers, cloud products, container systems, compute, storage;
- BS in Computer Science, Engineering, or related discipline;
- Strong organizational, analytical and critical thinking skills;
- Strong Communication written and verbal.