Company Description

Redzone is the #1 Connected Workforce Solution for manufacturers big and small. We work to improve efficiency in plants, provide coaching for best practices, and enable the front-line worker to improve the quality of their work and their work life by providing them with tools, processes, and collaboration tools to keep their manufacturing lines running smoothly and efficiently.

At Redzone we focus on the customer experience, listening to the customer, and providing solutions that create great outcomes. We are a combination of great leadership, years of manufacturing experience, and an incredible technology team that all work together to create great products.

This role is fully remote, but must be based in Mexico. With full work authorization already in effect. No Visa sponsorship available.

Job Description

We are expanding our Site Reliability Engineering (SRE) team and seeking a highly skilled and passionate Senior SRE to join us. As a member of our growing SRE function, you will play a critical role in ensuring the reliability, scalability, and performance of our mission-critical services that power our customer experience. This is an exciting opportunity to shape our SRE practices, drive automation, and significantly impact our products operational excellence.

What Youll Do:

Drive Operational Excellence: Design, implement, and maintain highly available, scalable, and resilient systems that deliver exceptional customer experience.
Datadog Expert: Be one of the go-to experts for Datadog. You will be responsible for defining, implementing, and enforcing best practices for monitoring, alerting, logging, tracing, and synthetic testing across our entire AWS environment. This includes deep hands-on configuration, dashboarding, troubleshooting, and optimization within Datadog.
Software Development for Reliability: Develop robust, well-tested, and maintainable software and tooling to automate operational tasks, create self-service capabilities for engineering teams, and enhance system reliability. This will involve building applications, not just scripts.
Toil Reduction Champion: Identify and eliminate toil through automation, process improvements, and systematic problem-solving. Work proactively to shift our operational focus from reactive firefighting to proactive engineering.
Incident Management & Post-Mortems: Contribute to and evolve our incident response framework, participating in on-call rotations (using OpsGenie). Lead blameless post-mortems, extracting actionable insights and driving systemic improvements to prevent recurrence.
Reliability Metrics & Goals: Collaborate with engineering teams to define, implement, and track Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets. Use these metrics to drive continuous improvement and make data-driven decisions about reliability investments.
Infrastructure as Code: Leverage and contribute to our infrastructure as code (IaC) efforts, moving towards a fully automated environment using Terraform and GitHub Actions.
System Design & Architecture: Provide SRE expertise in system design reviews, influencing architectural decisions to build reliability, observability, and scalability into our services from the ground up.
Knowledge Sharing & Mentorship: Document processes, build runbooks, and share your expertise with both the SRE team and broader engineering organization. Help foster an SRE culture of shared ownership and continuous learning.

Qualifications

What Youll Bring:

5+ years of direct Site Reliability Engineering (SRE) experience or equivalent experience in a production engineering role focused on system reliability.
Deep expertise and hands-on experience with Datadog. Proven ability to implement, manage, and optimize Datadog for comprehensive monitoring (APM, infrastructure, logs, synthetics, RUM), alerting, and troubleshooting in complex cloud environments.
Strong software development proficiency in Python (required). Demonstrated ability to build applications, tools, and automation frameworks beyond simple scripting.
Experience with Golang (desired).
Solid understanding of cloud-native architectures and best practices, specifically within AWS (EKS, Load Balancers, Aurora RDS Serverless Postgres, S3, Secrets Manager, MSK, Bedrock, SageMaker, Route53).
Experience with containerization and orchestration technologies, particularly Kubernetes (EKS).
Familiarity with CI/CD pipelines and tools (Jenkins, GitHub Actions).
A strong understanding of distributed systems concepts, networking, and security principles.
Experience with incident management processes and tools.
Excellent problem-solving skills, with a methodical and data-driven approach to troubleshooting complex systems.
Strong communication and collaboration skills, with the ability to work effectively with diverse engineering teams.
A proactive mindset, with a passion for automation, continuous improvement, and blameless culture.

Bonus Points (Nice to Have):

Experience defining and working with SLOs, SLIs, and Error Budgets.
Familiarity with other observability tools or concepts beyond Datadog.
Experience with feature flagging platforms like LaunchDarkly.

Additional Information

Why Join Us?

Be a key member of a growing SRE team and help shape our operational future.
Work on challenging problems at the intersection of software engineering, operations, and customer experience.
Opportunity to significantly reduce toil and drive impactful automation.
Collaborate with talented engineers in a supportive and learning-oriented environment.
Your health and well being are important to us. We provide programs that help you strike a healthy work-life balance.
Opportunity to join a growing business, launching into its next phase of expansion and transformation.
Collaborative culture of smart and hard-working people who support one another to get the job done.
An atmosphere of growth and opportunity, where idea-sharing is always prioritized over level or hierarchy.
Compensation packages based on experience and desired skill set

About QAD and QAD Redzone:

QAD Inc. is a leading provider of adaptive, cloud-based enterprise software and services for global manufacturing companies. Global manufacturers face ever-increasing disruption caused by technology-driven innovation and changing consumer preferences. In order to survive and thrive, manufacturers must be able to innovate and change business models at unprecedented rates of speed. QAD calls these companies Adaptive Manufacturing Enterprises.

QAD Redzone helps to enable QAD’s vision for the Adaptive Enterprise. Labor productivity improvements directly impact efficiency. Productive and empowered employees increase the effective capacity of your plant and accelerate time to productivity for new employees giving manufacturers the agility to increase production beyond what was previously possible without having to invest in production equipment or new plants, and reduce the amount and impact of employee attrition. Empowered employees with a growth mindset take extreme ownership of challenges that impact their production goals, creating resilience in the face of disruption.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.

## LI-Remote