Overview
What You’ll Build
Multi-cloud infrastructure using Terraform and Terragrunt, managing K8S (EKS) clusters, networking, databases, and security at scale.
CI/CD pipelines and deployment automation for a microservices architecture with 14+ Kubernetes workloads deployed via Helm.
Observability and reliability platform using Datadog (APM, logs, metrics, NPM) ensuring production SLAs for enterprise customers.
Authorization mechanisms for multiple interfaces (API, MCP, SDKs).
Security-first infrastructure including secrets management (External Secrets Operator + AWS Secrets Manager), WAF policies, IAM/Pod Identity, and network segmentation.
Requirements
Strong problem-solving mindset with ability to tackle complex infrastructure challenges.
Autonomous player who can take ownership and drive solutions independently.
Overall 5+ years in DevOps / Infrastructure / SRE roles with startup experience.
Strong Kubernetes experience: EKS/GKE cluster management, Helm charts, Gateway API, scaling strategies.
Solid Terraform/Terragrunt experience: module design, state management, multi-environment configurations.
AWS experience: EKS, VPC, ALB, RDS, ElastiCache, S3, ECR, IAM, Secrets Manager, WAF, Bedrock.
GCP experience — an advantage: GKE, VPC, Cloud Armor, Cloud DNS, Cloud Load Balancing.
CI/CD pipeline design and maintenance (GitHub Actions).
Monitoring and observability: Datadog or equivalent (Prometheus, Grafana).
Networking fundamentals: VPC design, security groups, DNS, TLS/cert management.
Production system experience: incident response, capacity planning, disaster recovery.
Super important — Get shit done attitude, curiosity, and proactive mindset.
Nice to Have
Experience supporting AI/ML workloads (LLM inference, GPU scheduling, model serving).
MongoDB operations (Kubernetes Operator, backup/restore).
WebSocket infrastructure (Soketi/Pusher).
Cost optimization across multi-cloud environments.
On-prem / hybrid deployment experience.
Tech Stack
Terraform, Terragrunt, AWS, GCP, Kubernetes (EKS/GKE), Helm, Docker, GitHub Actions, Datadog, PostgreSQL, MongoDB, Redis, Traefik, cert-manager, External Secrets Operator, Python
Dev Stack
Cursor, Claude Code, Warp
Tell them you heard about the position from Nefesh B’Nefesh. Please do not repost position.