Senior DevOps Engineer job at Vertex Technologies in Cairo, Egypt
وصف الوظيفة والمتطلبات:
Job Description
Responsibilities:
- Designing and deploying scalable, multi-tenant cloud (AWS or Azure) and hybrid/on-premises architectures tailored to diverse client needs, including specialized infrastructure for AI and machine learning workloads.
- Understanding business requirements, evaluating architectural trade-offs, and translating them into cost-effective, production-ready technical solutions.
- Developing declarative scripts and modules for automating infrastructure provisioning, configuration management, and environment replication.
- Designing, implementing, and optimizing GitOps-driven CI/CD pipelines to achieve automated, self-healing software delivery cycles for both application code and AI assets (models, prompts, and evaluation datasets).
- Building and maintaining comprehensive observability (monitoring, logging, and tracing) systems to ensure proactive anomaly detection, including tracking LLM performance metrics (latency, token usage, and drift).
- Ensuring system security and protection by integrating security guardrails (SAST/DAST, container scanning, prompt injection defense, and data anonymization) directly into the delivery pipeline (DevSecOps).
- Designing and implementing robust disaster recovery (DR), failover procedures, and high-availability strategies across multi-region setups.
- Automating the deployment of system updates, patches, and zero-downtime microservices and AI model endpoint releases.
- Adhering to corporate security, data privacy (GDPR/HIPAA/SOC2), and industry-standard regulatory rules and compliance practices, with specific guardrails for AI data handling and model usage.
- Providing technical leadership, architectural guidance, and mentorship to developers, data scientists, system engineers, and cross-functional client teams.
- Supporting team infrastructure, unblocking development workflows, and rapidly resolving complex configuration, network, and automation issues across multi-cloud environments.
- Ensuring high availability, scalability, elasticity, and maximum resilience against infrastructure and service component failures.
- Staying up-to-date on the latest cloud-native technologies, CNCF ecosystem projects, FinOps (specifically managing unpredictable cloud AI/GPU spend), and industry best practices to drive continuous innovation.
What We Offer:
- Long-term career stability with a competitive salary paid in USD.
- Conditions for steady career development.
- Development supported by dedicated mentors and a variety of programs focused on expertise and innovation.
- Private medical insurance provided after successful completion of the probationary period
- A well-equipped and cozy office supports comfort and productivity across all project stages.
- Welcoming atmosphere and a friendly corporate culture.
Requirements
What we expect from you:
- Practical administration experience with Linux/UNIX and Windows systems (mandatory, at least 3+ years in a senior or lead capacity).
- Strong understanding of modern web architectures, microservices, distributed systems, and networking protocols.
- Practical experience in DevOps/Platform Engineering roles involving end-to-end infrastructure development and client-facing delivery (mandatory, at least 3+ years).
- Practical database administration and optimization experience with relational, non-relational (NoSQL), and vector databases (e.g., Pinecone, Milvus, Qdrant, or pgvector) used in AI applications.
- Deep understanding and production experience with Infrastructure as Code (IaC) principles, focusing on modularity, reusability, and state management.
- Automation experience with enterprise configuration management tools like Ansible, or modern alternatives/code-driven IaC (e.g., Pulumi).
- Experience in designing, deploying, and managing environments in AWS or Azure using advanced, automated GitOps/IaC workflows (Terraform, OpenTofu, CloudFormation, or Bicep/ARM).
- Practical skills in automating code compilation, artifact management, and continuous deployment using GitHub Actions, GitLab CI, Jenkins, or cloud-native tooling (ArgoCD, Flux).
- Experience implementing automated code testing and compliance shifts in the CI process, extending to continuous evaluation pipelines for LLM-backed applications (using frameworks like Ragas or Langfuse).
- Proficiency in containerization and cloud-native orchestration using Docker, Kubernetes (EKS/AKS), Helm, and ingress management.
- Experience deploying, scaling, and managing service meshes, microservices releases, and containerized AI model deployment frameworks (e.g., vLLM, Triton Inference Server, Hugging Face TGI).
- Experience with enterprise artifact repository managers (JFrog Artifactory, Nexus, or cloud-native container registries).
- Advanced scripting and programming skills in Python (essential for AI ecosystems), Bash, Go, or PowerShell for building custom automation tools.
- Expert knowledge of Git, including advanced branching strategies (GitFlow, Trunk-Based Development), repository management, and managing version control for application code, configuration, and prompt templates.
- Proven experience implementing DevSecOps, secrets management (HashiCorp Vault, AWS Secrets Manager), and identity access management (IAM).
- Experience implementing cloud financial management (FinOps), with a strong focus on tracking and optimizing high-cost AI infrastructure and API token spending.
- Great communication and consultancy skills, with the ability to articulate technical concepts clearly to both technical teams and non-technical client stakeholders.
Will be a plus:
- Deep knowledge of Linux/Windows OS internals, low-level troubleshooting, kernel tuning, and advanced performance diagnostics.
- Deep knowledge of Enterprise Networking (VPC peering, SD-WAN, VPNs) and Cloud Security Architecture (Zero Trust models, WAF, DDoS mitigation).
- Experience in the end-to-end design, business justification, documentation, and implementation of complex, large-scale enterprise architectural solutions.
- Hands-on experience building or operating Retrieval-Augmented Generation (RAG) pipelines and managing LLM-backed agent orchestration frameworks (e.g., LangChain, AutoGen).
- Active professional-level certifications (e.g., AWS Certified Solutions Architect Professional, Azure Solutions Architect Expert, CKA/CKAD, or cloud AI/Machine Learning specializations).
🚀 مهتم بهذه الوظيفة؟
لمشاهدة التفاصيل والتقديم، اضغط على زر “التقدم للوظيفة”.