At OLX India, our microservices architecture isn’t just a collection of services; it’s a high-traffic ecosystem that demands precision. As we scaled, we hit a critical wall with our deployment strategy. We needed to implement reliable, automated Canary Deployments (using Argo Rollouts) to gradually shift traffic to newer versions of our services.

However, we faced a major architectural hurdle: Multi-Ingress Complexity.

Depending on the use case, a request might enter our cluster through Nginx-Ingress, an Ambassador API Gateway, or an AWS ALB Ingress. Because Argo Rollouts relies on ingress-specific implementations for traffic splitting, configuring progressive delivery across this fragmented networking landscape was becoming an operational nightmare. We needed a “central brain” for traffic management—a single implementation logic that could cleanly intercept and route both North-South (public) and East-West (internal) traffic.

This led us to evaluate integrating a Service Mesh.

The Evaluation: Linkerd vs. Istio Sidecar

Our core requirements were strict: we needed deep traffic control, robust observability, and seamless integration with Argo Rollouts. We evaluated two primary candidates: Linkerd and Istio.

While Linkerd (built on a purpose-built Rust micro-proxy) shined in simplicity and raw performance, its reliance on the basic Service Mesh Interface (SMI) lacked the advanced flexibility we needed for complex deployment strategies. Istio, powered by the battle-tested C++ Envoy proxy, offered superior technical capabilities:

  • Advanced Traffic Management: Beyond basic traffic splits, Envoy allows for header-based routing, traffic mirroring, fault injection, and dynamic retries out-of-the-box.
  • Native Ingress Controller: Istio comes with its own Gateway component, dramatically simplifying external traffic management and edge routing compared to Linkerd’s reliance on third-party ingress controllers.
  • Seamless Canary Integration: Flawless, native integration with Argo Rollouts without needing custom SMI translation layers.

Istio was the clear winner functionally. But then we looked at the cost.

The Problem: The Exorbitant “Sidecar Tax”

Historically, Istio relies on the Sidecar Model. To bring a service into the mesh, an Envoy proxy container is injected into every single pod. It intercepts all network traffic, allowing for application-aware features.

But this model scales terribly regarding resource consumption. As per our 90-day New Relic data in production, a single sidecar proxy with 2 worker threads consumes about 0.20 vCPU and 60 MB of memory.

When we ran the numbers for our infrastructure (approx. 190 nodes and 5,300 pods at peak), the financial reality hit hard:

  • Subtracting our daemonsets, we had roughly 3,600 application pods in production.
  • Reserving just 100m vCPU per pod meant allocating massive overhead just to run the mesh.
  • The Estimated Cost: Running sidecars across all environments (PRD, STG, QA) would cost us approximately $6,000 per month.

We were at a crossroads. We needed the power of Istio, but the “Sidecar Tax” was a non-starter.

The Solution: Deep Dive into Istio Ambient Mesh

This is where Istio Ambient Mesh changed the game. Ambient is a “sidecar-less” architecture. Instead of forcing a massive proxy into every pod, Ambient splits the mesh’s responsibilities into two lightweight, distinct layers: Layer 4 (Secure Transport) and Layer 7 (The Routing Brain).

Layer 4: The Ztunnel (Zero Trust Tunnel)

Instead of a proxy per pod, the Ztunnel runs as a highly optimized daemonset—one proxy per node.

  • Responsibility: It handles the heavy lifting at the L3/L4 level, strictly managing mTLS (encryption), strong SPIFFE identity-based authentication, and TCP telemetry.
  • Efficiency: Because it explicitly avoids parsing HTTP headers or executing complex L7 routing logic, it’s incredibly lightweight. A single Ztunnel consumes only about 0.06 vCPU and 12 MB of memory.
  • The Mechanism (HBONE): Ambient introduces HBONE (HTTP-Based Overlay Network Environment). Traffic leaving the source pod is intercepted by the iptables rules programmed by istio-cni node agent and routed to the Ztunnel. The Ztunnel encapsulates this traffic inside an HTTP/2 CONNECT request (HBONE), encrypting it via mTLS. It arrives at the destination node’s Ztunnel, is decrypted, and delivered inside the destination pod’s network namespace. This preserves the original source Pod IP, ensuring full compatibility with our existing Kubernetes Network Policies.

Layer 7: The Waypoint Proxy

When we need “smart” L7 capabilities—like the weighted HTTP traffic splitting required for our Canary deployments—we deploy an Envoy-based Waypoint Proxy.

  • On-Demand Power: Waypoints are strictly deployed only for workloads that actively require L7 manipulation (like Canary routing, HTTP circuit breaking, or advanced authorization policies). If a service only needs mTLS, it completely bypasses the Waypoint overhead and relies solely on the Ztunnel.
  • Native Kubernetes Gateway API Integration: Unlike the old VirtualService attachments, Waypoints are natively provisioned and managed using the modern Kubernetes Gateway API (Gateway and HTTPRoute CRDs), aligning our mesh configuration with upstream Kubernetes networking standards.
  • Decoupled Scaling: Because Waypoints run completely outside the application pod’s lifecycle, a single Waypoint deployment can serve a specific service account or an entire namespace. We can scale them independently using standard HPAs based on request throughput, ensuring our L7 mesh infrastructure grows only where it’s actively utilized.

The Simplified Request Flow

To understand the elegance of this design, here is the step-by-step lifecycle of a request in our new Ambient Mesh:

  1. Source: A request originates from a pod (e.g., an Ingress controller) and leaves its network namespace.
  2. Intercept: The node’s istio-cni agent intercepts the outbound traffic and redirects it to the source node’s Ztunnel.
  3. Tunnel (L4): The source Ztunnel encapsulates the request in an HBONE tunnel, encrypting it with mTLS.
  4. Waypoint Routing (L7): If the destination service requires Canary routing, the traffic is sent from the source Ztunnel to the destination’s Waypoint Proxy. The Waypoint evaluates the HTTP rules (e.g., 99% to Stable, 1% to Canary).
  5. Delivery: The traffic is routed to the destination node’s Ztunnel, decrypted, and finally delivered directly into the destination pod.

The Financial Impact: Slashing the Bill by 90%

By moving to this two-tiered architecture, we transformed our service mesh from a massive infrastructure burden into a lean, high-performance asset.

Our Ambient Cost Breakdown:

  • Ztunnel: Across our ~290 nodes (PRD, STG, QA), the shared Ztunnels consume a max of 58 vCPU.
  • Waypoint: We only deploy Waypoints for services undergoing Canary deployments. Based on load tests (500m vCPU per 50k RPM), serving our targeted 850k RPM requires roughly 15 vCPU (including a 60% headroom buffer).
  • Total Resource Usage: 73 vCPU.

Monthly Cost Comparison:

ModelInfrastructure FootprintEstimated Monthly Cost
Istio Sidecar~3,600 Proxies (1 per pod)$6,000
Istio Ambient~290 Ztunnels + Targeted Waypoints$660

By adopting Ambient mode, we achieved the exact same functional goals while reducing our projected mesh infrastructure costs by nearly 90%.

The “Bonus” Wins: Why We Stayed

We originally evaluated Istio to solve our Canary deployment problem, but the Ambient architecture unlocked massive operational and reliability benefits across our entire cluster, independent of Canary usage:

1. Zero-Downtime Onboarding & Zero-Trust by Default

In the Sidecar model, adding an application to the mesh requires restarting the deployment. With Ambient, onboarding is as simple as adding a label (istio.io/dataplane-mode=ambient) to the namespace. The istio-cni silently updates iptables rules to redirect traffic to the node’s Ztunnel. Instantly, and without a single pod restart, our service-to-service traffic gained automatic mTLS encryption and strong identity-based authentication.

2. Superior Load Balancing and Service Stability

Standard Kubernetes load balancing relies on kube-proxy at the L4 TCP level, which frequently results in “skewed” or imbalanced traffic distribution across pods. By utilizing the Envoy-based Waypoint Proxy for our high-traffic services, we gained sophisticated L7 load balancing. We observed that Istio’s load balancing performed on par with AWS ALBs, significantly increasing service stability and preventing hot-spotting on individual pods.

3. Crystal-Clear Network Observability

Prior to Istio, our network was largely a black box. Now, even without deploying Waypoints, the node-level Ztunnels provide comprehensive out-of-the-box TCP telemetry (L4 metrics like bytes sent/received and connection durations). For services utilizing Waypoints, we gained a complete suite of L7 metrics (HTTP status codes, request paths) and full access logs. We can monitor success rates, P50/P90/P99 latencies, and request volumes with granular detail directly at the network level, entirely independent of our application-level APM instrumentation.

The Trade-Offs: The Hidden Network Cost

While Ambient Mesh dramatically reduced our compute footprint, it is not without its trade-offs compared to the Sidecar approach. The primary drawback is the extra network hop.

Because the Waypoint Proxy does not run inside the pod like a sidecar, routing traffic through it adds an additional hop over the network. Crucially, in a multi-AZ Kubernetes cluster, this increases the probability that traffic will cross Availability Zone boundaries. During our initial onboarding of high-traffic namespaces, we observed a direct increase in our AWS Cross-AZ Data Transfer costs (an increase of about $70 for an extra 6TB of cross-AZ transfer).

However, because the traffic is now fully controlled by the mesh, this “hidden cost” is actually solvable by utilizing Istio’s built-in Topology Aware Routing—which allows us to forcefully keep traffic within the same AZ.


Conclusion

The transition to Istio Ambient Mesh has fundamentally changed how we approach deployments and network security at OLX India. We managed to secure our East-West traffic with mTLS, gain unprecedented network observability, and avoid a massive $6,000/month “Sidecar Tax”—all without requiring a single pod restart.

We successfully established a highly efficient, secure, and observable networking foundation. But the ultimate test remained: Could this new architecture actually solve our multi-ingress Canary routing problem?

In Part 2, we will deep-dive into how we untangled the ingress chaos, migrated our AWS ALB traffic, and successfully executed precise Canary deployments using Istio Ambient Mesh and Argo Rollouts.

Author