Your primary responsibility is hands-on infrastructure automation and operations for an Agentic AI platform and its supporting services.
Cloud Infrastructure & Automation
· Design, implement, and maintain cloud infrastructure in Azure using Terraform with reusable modules and environment-based configurations
· Build and manage Azure resources including AKS, VMs, VNets, Load Balancers, Key Vault, API Management, Azure Container Registry, storage, and database-related services
· Implement scalable and secure networking topologies, including hub-spoke architecture, private endpoints, firewalls, routing, and WAF
· Support infrastructure readiness for future multi-region setup and disaster recovery scenarios
· Define and improve backup, restore, and recovery processes for critical infrastructure and databases
CI/CD, GitOps & Platform Enablement
· Design and maintain automated CI/CD pipelines in Azure DevOps using Pipeline as Code principles
· Implement multi-stage YAML pipelines with approvals, environments, variables, secrets, and deployment strategies
· Enable automated container build and deployment workflows using Docker, ACR, AKS, and Helm
· Develop reusable pipeline templates for consistent delivery practices across teams
· Support gradual transition toward GitOps-based deployment workflows, preferably using ArgoCD
· Maintain deployment configuration in Git where appropriate and help improve traceability of infrastructure and application changes
Kubernetes & Runtime Operations
· Operate and scale AKS clusters, including node pool management, network policies, autoscaling, and cluster security
· Deploy microservices and supporting components using Helm
· Support runtime reliability, troubleshooting, resource optimization, and incident investigation
· Improve operational readiness of services running in Kubernetes environments
Databases, Backups & Reliability
· Support PostgreSQL-based application infrastructure, including access, connectivity, backups, restore validation, and operational reliability
· Understand how backend services use Prisma, including schema changes, migrations, and database interaction patterns
· Help improve backup strategy, recovery procedures, and documentation for critical services
· Contribute to disaster recovery planning, including RTO/RPO considerations and future multi-region readiness
· Support systems using Kafka or similar event-driven components where applicable
Security, Compliance & Observability
· Implement industry best practices for cloud and DevOps security, including least privilege, identity federation, secret governance, and artifact signing
· Apply security guardrails across Azure and delivery pipelines, including automated scanning and policy-as-code using Azure Policies
· Set up and maintain monitoring, logging, dashboards, and alerting using Azure Monitor, Log Analytics, Application Insights, Grafana, Prometheus, Alloy, Loki, and OpenTelemetry
· Improve visibility into system health, application performance, infrastructure usage, and deployment stability
Collaboration & Internal Tooling
· Collaborate closely with software engineering and AI teams to enable fast and reliable development workflows
· Participate in architectural discussions around infrastructure scalability, reliability, disaster recovery, and operational readiness
· Contribute to documentation for infrastructure architecture, Terraform modules, pipelines, backup/restore processes, GitOps workflows, and operational runbooks
Nice to have
· Experience with AWS/GCP
· Experience with integrating LLM-powered systems or high-throughput AI/ML pipelines
· Experience with GitHub Actions or other CI/CD platforms
· Experience with event-driven systems: Azure Event Grid, Service Bus, Kafka
· Automation using Python, TypeScript, or Bash
· Strong experience with GitOps, preferably ArgoCD
· Experience with PostgreSQL administration, backup automation, restore testing, and performance troubleshooting
· Familiarity with Prisma ORM from an infrastructure/DevOps perspective
· Experience designing backup and disaster recovery strategies
· Experience with multi-region cloud architecture and high-availability systems
· Experience with cloud cost optimization and FinOps practices
· Familiarity with on-call operations, incident response, and SRE practices
Інформуватимемо про розвиток платформи й нові функції, щоб ваш пошук був ефективнішим та зручнішим