Job Description
Title: Senior Site Reliability Engineer (SRE)
Location: Remote
About January
At January, we’re transforming the lives of borrowers by bringing humanity to consumer finance. Our data-driven products empower financial institutions to streamline collections and help borrowers regain financial stability and control over their lives. We’re not just expanding access to credit — we’re restoring dignity and paving the way for millions to achieve financial freedom.
About the Role
As a Senior Site Reliability Engineer (SRE) , you will establish SRE practices from the ground up — ensuring reliability, scalability, and performance as January scales from thousands to millions of borrowers. You’ll architect resilient infrastructure, design modern observability solutions, and build sustainable on-call processes that evolve with our rapid growth.
Your work will directly address scaling challenges including database optimization, async workflow infrastructure, and data pipeline reliability — enabling the engineering team to ship confidently and efficiently.
Key Responsibilities
- Lead incident response and develop sustainable on-call practices, including runbooks, blameless postmortems, and continuous improvement to reduce MTTR.
- Build and maintain self-service observability tools (Datadog, Prometheus, ELK) for proactive monitoring and troubleshooting.
- Create and maintain Infrastructure as Code (IaC)using Terraform or CloudFormation for consistent, secure AWS environments.
- Partner with development teams to architect resilient, scalable infrastructure for critical components like databases, networking, async workflows, and data pipelines.
- Design and implement robust CI/CD pipelines (GitHub Actions) with advanced deployment strategies (blue/green, canary).
- Drive best practices in reliability and performance early in the design phase to future-proof January’s systems.
Required Skills & Experience
- Proven experience leading incident response and postmortem processes for high-availability production systems.
- Deep expertise in designing highly available architectures (EC2, Fargate, auto-scaling, health checks, graceful degradation).
- Strong experience with AWS cloud infrastructure and IaC tools (Terraform, CloudFormation).
- Hands-on experience with CI/CD automation using GitHub Actions or equivalent tools.
- Proficiency in observability and monitoring stacks ( Datadog, Prometheus, ELK ).
- Solid scripting/programming skills in Python (for automation, tooling, and debugging).
- Excellent communication and documentation skills, with the ability to collaborate across engineering and platform teams.
Requirements
Tools & Technologies
- Cloud: AWS
- IaC: Terraform, CloudFormation
- CI/CD: GitHub Actions
- Monitoring: Datadog, Prometheus, ELK
- Languages: Python
- Infrastructure: EC2, Fargate
Additional Details
- Remote role (NYC-based preferred for hybrid collaboration).
- Opportunity to build and own the entire SRE practice for a growing FinTech startup.
- Fast-paced, innovative environment working on AI-forward consumer finance products.
Job Tags
Contract work, Remote work,