DevOps Engineer | Azure | CI/CD | Big Data Platforms

MANIGANDAN B Automation, Reliability, and Production Stability at Enterprise Scale

4+ years of experience designing, automating, and supporting enterprise deployment and data platforms using Azure DevOps, Linux, Hadoop, and cloud-based tooling. I specialize in CI/CD pipeline engineering, production reliability, and operational automation in environments where uptime, release stability, and SLA performance directly affect the business.

50%

Faster Releases

99.8%

Pipeline Success

35%

Less Alert Noise

8 FTE/wk

Automation Savings

See Projects Contact

About

I am a DevOps Engineer with 4+ years of experience in Big Data and enterprise platform environments. Currently based in Utrecht, I support a Netherlands mortgage platform where I own CI/CD workflows, production operations, and reliability improvements across Linux, Hadoop, and Azure DevOps ecosystems. My focus is automation, stability, and reducing manual operational effort in high-availability environments.

Who I Am

I'm a DevOps engineer who believes that reliability is a business enabler. I've spent 4+ years helping teams move faster without burning out on-call engineers. I do this by automating toil, building observability that actually catches problems early, and creating deployment workflows that teams trust.

What I enjoy solving: Reducing the friction between "developers want to ship" and "operators need stability." Whether it's pipeline bottlenecks, alert noise, or environment consistency—I design systems that scale and don't need constant manual care.

Current role: Platform and DevOps Engineer at Infosys (Utrecht, Netherlands), supporting critical mortgage platform operations across 40+ Linux servers, Azure DevOps pipelines, and enterprise monitoring stacks.

Looking for: Platform Engineer or Senior DevOps roles where I can own reliability end-to-end, mentor teams on SRE practices, and drive infrastructure standardization in fast-moving environments.

Experience

Platform and DevOps Engineer | Infosys Limited | Jan 2022 - Present
Utrecht / Amersfoort, Netherlands (Current - 6 months onsite), Chennai, India (Earlier tenure)

Netherlands Mortgage Program (OPK): Owned production platform reliability for financial-grade mortgage processing system serving enterprise customers. Managed SLA compliance (99.9%+), incident response, and platform standardization.
Linux Platform Operations: Managed RHEL/CentOS production infrastructure across 40+ servers, including patching strategies, security hardening, performance tuning, and lifecycle management.
CI/CD Pipeline Architecture: Engineered multi-stage Azure DevOps and Jenkins pipelines across Dev, Test, and Production environments. Standardized deployment workflows with compliance gates, automated rollback, and health checks—reducing release cycle time by 30% (4-6 hrs → 2.5 hrs).
Infrastructure Automation: Built 12+ Ansible playbooks for configuration standardization, credential rotation, drift prevention, and environment consistency across 8 environments—eliminating 90% of manual configuration errors.
Operational Toil Reduction: Developed Python and Bash automation for log analysis, health checks, alert correlation, and incident triage. Saved 8+ FTE hours weekly and reduced alert noise by 35%.
Observability Platform: Architected unified monitoring across Grafana, Splunk, Dynatrace, and Azure Monitor with intelligent alert correlation. Improved MTTD by 25% and MTTR by 18%.
Incident Leadership: Led incident response, root cause analysis, and postmortem facilitation for 50+ production incidents. Maintained 99.9% SLA uptime and established escalation playbooks.

Current Project

I currently support the OPK project in a financial-grade modernization and reliability program. My role spans three delivery streams with both hands-on execution and production ownership.

CI/CD Specialist Team: Designed and implemented end-to-end pipelines to deploy data changes to on-prem servers through Ansible with controlled, auditable release flow.
Re-engineering Team: Support development teams for production changes, lead on-call shift coordination, and drive critical SLA adherence during priority incidents.
Platform Engineering Team: Manage Linux platform operations across Ataccama and Cloudera ecosystems (including Hadoop), including node provisioning, patching, and upgrade activities.
Build Python and Bash automations for health checks, operational diagnostics, incident response acceleration, and better handover documentation quality.
Improve observability with Grafana, Splunk, Dynatrace, and Azure Monitor to reduce MTTD and MTTR across critical workloads.
Build and maintain Terraform modules and execute plan/apply workflows for consistent Azure infrastructure provisioning.
Built and trained AI-assisted operational workflows to speed up project delivery, improve decision quality, and reduce manual effort across repetitive engineering tasks.

Featured Projects

Case studies focused on problem, solution, architecture, and measurable results.

⭐ FLAGSHIP PROJECT

CI/CD and Configuration Standardization

Enterprise-Grade Pipeline Standardization & SLA Architecture

Problem: Inconsistent deployments across teams, 4–6 hour manual releases, 40% of incidents from config drift, zero compliance automation.

My Role as Platform and DevOps Engineer: Owned the redesign of deployment workflows, release controls, and environment consistency across 8 enterprise environments.

Solution: Architected a standardized Azure DevOps and Jenkins delivery platform with Ansible-based configuration automation, release gates, rollback safety, and integrated health validation.

Tools: Azure DevOps, Jenkins, Ansible (12+ playbooks), Terraform, Splunk, SonarQube, HashiCorp Vault

Impact: Faster releases, stronger compliance, fewer drift-related incidents, and higher deployment confidence across production teams.

50% Faster Releases

99.8% Pipeline Success

-70% Config Drift

$150K Annual Savings

Azure DevOpsAnsible IaCComplianceProduction

Observability and Incident Response

Unified Observability Platform

Problem: Fragmented monitoring across 4 stacks (Grafana, Splunk, Dynatrace), 15+ min incident detection, manual cross-platform correlation, 30% false positives.

My Role as Platform and DevOps Engineer: Led observability consolidation for critical Linux workloads and designed the alerting model used by on-call teams.

Solution: Built a unified monitoring layer with shared dashboards, centralized logs, intelligent alert correlation, and runbook-linked incident context.

Tools: Grafana, Splunk, Dynatrace, Azure Monitor, Fluentd, Prometheus, AlertManager, Python

Impact: Faster detection, lower noise, better on-call efficiency, and more consistent incident handling for SLA-sensitive systems.

-25% Detection Time

-18% Recovery Time

99.9% SLA Uptime

-35% Alert Noise

GrafanaObservabilityIncident ResponseSRE

Automation and Toil Reduction

Runbook Automation & Alert Correlation

Problem: 200+ daily alerts (70% false positives), manual runbooks causing 2+ hour response, inconsistent incident recovery, 8 FTE hours/day wasted in triage.

My Role as Platform and DevOps Engineer: Drove the automation design for repetitive operational tasks and standardized first-response workflows for support teams.

Solution: Built a Python correlation engine, Bash runbooks, PagerDuty integration, and controlled auto-remediation to reduce manual triage effort.

Tools: Python, Bash, PagerDuty, Splunk, Git, Ansible, AlertManager

Impact: Lower alert fatigue, faster recovery, improved first-time resolution, and measurable reduction in operational toil.

-35% Alert Volume

-67% Response Time

+45% 1st Resolution

8 FTE /week saved

PythonAutomationIncident MgmtToil Reduction

DevOps and Platform Skills

Hands-on tooling used to improve delivery speed, platform stability, and operational efficiency.

Cloud and DevOps Delivery

Azure DevOps, Jenkins, Git, and YAML pipelines used to standardize releases, enforce gates, and improve deployment reliability across multi-environment delivery flows.

Big Data Platforms

Hadoop, Hive, and Cloudera operations supporting data platform stability, job execution reliability, cluster health, and enterprise production support.

Automation and Scripting

Python, Bash, Ansible, and AWX used to reduce manual intervention, automate runbooks, standardize configuration, and cut repeat operational effort.

Monitoring and Reliability

Grafana, Splunk, Dynatrace, Azure Monitor, and Prometheus applied to detect issues earlier, reduce alert noise, and support RCA and SLA adherence.

Infrastructure and Configuration

Linux administration, Terraform, ARM templates, and configuration management used to keep environments consistent and reduce drift-related incidents.

Incident Management and Support

Production support, ITIL workflows, escalation handling, postmortems, and stakeholder coordination in systems with 99.9% uptime expectations.

Security and Compliance

Release validation, secrets handling, audit logging, and compliance checks integrated into delivery pipelines for safer production changes.

Certifications

Microsoft Certified: Azure Data Engineer Associate (DP-203) - Sep 2023
In Progress: RHCSA and Azure Administrator (AZ-104)

Education

MBA, Business Administration (Operations and Technology Management), 2022-2024
B.Sc, Mathematics, 2018-2021

Languages

English: Professional proficiency
Tamil: Native
Dutch: Beginner (actively learning)

Activities and Interests

Hobbies: Trekking, swimming, cycling, and dog training.
National Cadet Corps (NCC): Active participation for nearly 3 years (2018-2021), with 12 camps attended including Army and trekking camps.
Developed strong discipline, team coordination, endurance, and leadership through structured training and camp activities.

AI Lab

I use AI in practical DevOps and platform workflows to reduce response time, improve communication quality, and increase release confidence.

AIOps LLM-Assisted RCA Prompt Engineering Copilot for Delivery Automation Agents

Incident AI

Incident Summary Copilot

Converts noisy incident timelines into concise stakeholder-ready summaries with clear impact, current status, and next actions.

Reliability AI

RCA Draft Assistant

Speeds up first-draft RCA creation by organizing failure sequence, probable causes, evidence points, and prevention actions.

Change AI

Change Risk Brief Generator

Produces pre-deployment risk briefs from change details so teams can plan rollback, validation, and communication before production rollout.

Delivery AI

AI Workflow Trainer for Project Execution

Built and trained AI assistants using project runbooks and delivery patterns to generate faster implementation guidance, improve consistency, and keep execution timelines on track.

Prompt Studio

Choose a real use case to preview how I frame AI prompts in operations.

Sample Prompt

Summarize this incident timeline in 5 bullets with impact, current status, suspected cause, mitigation already applied, and next action for stakeholders.

Expected Outcome

A clean and business-ready incident update in less than 1 minute.

Talk with Me

Ask quick questions about my experience, OPK project, Terraform work, or availability.

Hi, I am Manigandan. Ask me about my experience, current project, skills, or how I can support your team.

Contact

Open to Platform and DevOps opportunities across the Netherlands and Europe. Currently in Utrecht. Visa sponsorship required.

Location

Utrecht, Netherlands

Phone (Netherlands)

+31 617886316

Phone (India)

+91 8056353767