Sundaram Ananthanarayanan

Outline


Intro

I am a Staff Software Engineer with a decade of experience at various large tech companies. Over the years, I have honed my software skills and built distributed systems to tackle problems of various scales.

Over time, my focus has shifted from honing technical skills to understanding the following key questions:

This evolution reflects my transition from solely providing technical expertise to ensuring that we develop solutions that address genuine user needs.


Work Experience

Netflix

December 2019 - Present

Staff Software Engineer, Data Platform

Broadly, I have been been focusing on making it super easy for engineers at Netflix to publish, transform, and read real-time data. The engines and abstractions I have been involved with is used to power a variety of real-time use-cases at Netflix from powering Netflix’s famous recommendations to powering shadow canaries for improving the resilience of services at Netflix. Below are a set of selected projects I have been lucky to be part of at Netflix.

Mantis

Mantis is a stream processing engine combined with a data transport system, akin to Kafka and Flink. At its core, Mantis is engineered to handle large volumes of operational data with minimal latency. It achieves this by leveraging two essential properties of operational data:

Mantis capitalizes on these properties by providing a transport system that performs:

In my opinion, Mantis is uniquely suited for addressing challenges in the operational domain, making it an excellent fit for large software companies like Netflix. Originally developed in-house, Mantis is now also being utilized by other companies, including Stripe.

Migrating Mantis to Kubernetes

When Mantis was initially built, Apache Mesos was the prevalent DC/OS. Consequently, Mantis was designed around Mesos. However, with the widespread adoption of Kubernetes, Mesos has lost its relevance in the DC/OS landscape and has been deprecated. Even at Netflix, we have migrated away from Mesos to a specialized version of Kubernetes known internally as Titus.

To adapt to this shift, I spearheaded a project to migrate Mantis from Mesos to Kubernetes. This migration posed several unique challenges:

After a couple of years of hard work and leading a team of senior engineers at Netflix, we succeeded in fully transitioning Mantis from Mesos to Kubernetes. This was one of the most challenging projects I’ve been part of, offering deep insights into building, maintaining, and migrating large distributed systems. In the end, we even leveraged machine learning to significantly reduce costs associated with the project.

Flink is a highly popular open-source stream processing engine. Unlike Mantis, which prioritizes latency over consistency, Flink focuses on analytical use cases where consistency and correctness are paramount. At Netflix, the footprint of Flink is as substantial as that of Mantis and continues to grow rapidly. I have been actively involved with various initiatives within the Flink platform at Netflix.

One of the major pain points for teams building Flink pipelines is the need to maintain separate batch jobs that pull data from a data warehouse during outages. This approach has several limitations:

To address these issues, I developed a new system that allows users to backfill their Flink pipelines using a streaming Iceberg source. This source mimics the properties of Kafka, providing the same ordering guarantees without the high storage costs associated with Kafka. This project has been highly successful at Netflix, with hundreds of Flink jobs adopting this solution to backfill their pipelines in the event of an outage. Additionally, I helped open-source the project, and it is now being used by other organizations that utilize both Flink and Apache Iceberg.

Uber

May 2016 - December 2019

Senior Software Engineer II, Developer Infrastructure

Submit Queue

At Uber, we managed a monorepo that thousands of engineers were committing changes to concurrently. Monorepos offer significant advantages, such as improved code sharing, consistent tooling, and streamlined platform updates. However, a major downside is that the master branch frequently breaks due to the high volume of commits. This was a significant issue at Uber, where the master branch was broken 50% of the time.

To resolve this problem, I led a team to develop a system called Submit Queue. The primary goal of Submit Queue is to ensure that the master branch remains stable. It achieves this by running all necessary checks on changes before merging them into the master branch. The simplest approach would be to queue the changes and run checks sequentially, but this would be inefficient and not scalable.

We identified two critical requirements for the system:

  1. Scalability: The system needed to handle thousands of commits per day.
  2. Low Latency: Engineers at Uber are highly sensitive to the time it takes for their changes to be merged.

To meet these challenges, we developed two innovative techniques:

  1. Probabilistic Speculation: This technique involves speculating on the outcome of checks before they are run, allowing us to perform checks in parallel and reduce overall latency. We utilized machine learning models to predict the outcomes.
  2. Conflict Analysis: By analyzing the build graph, we could detect conflicts between changes. This enabled us to run checks for changes affecting independent parts of the system in parallel.

The implementation of Submit Queue was a tremendous success, reducing the master branch breakage rate from 50% to 0%.

You can read more about this system in the paper we published at EuroSys ‘19.

Baidu Silicon Valley AI Research Lab

Jan 2016 - May 2016

Software Engineer, Speech Inference

At Baidu, I contributed to the Deep Speech 2 project, which aimed to develop a highly accurate speech recognition system capable of transcribing both English and Mandarin speech using Deep Learning. My primary responsibility was to design and implement the inference APIs and backend infrastructure, enabling developers to seamlessly integrate the speech recognition system into their applications.

Twitter

June 2014 - Jan 2016

Software Engineer in Growth Infrastructure & Engineering Effectiveness

Twitter was my first job after graduating from Stanford. I initially joined the Growth Infrastructure team, where I worked on various projects, including the development of a system to store and retrieve users’ address book contacts. This system played a crucial role in suggesting people for users to follow on Twitter based on their existing contacts.

It was incredibly fascinating to see Scala in action at Twitter. I gained significant insights into functional programming and deepened my understanding of concepts such as monoids, monads, and functors.


Side Projects

BigBeans

BigBeans is a project I started with a few friends to help engineers and data engineers learn ML by practicing on well-known datasets. We offer a set of curated problems designed to teach ML concepts and their practical applications.

Unlike platforms such as Kaggle, BigBeans emphasizes learning rather than competition. Each problem on BigBeans has a specific threshold of correctness that must be met for the solution to be accepted. These thresholds are tailored to the techniques being taught in each problem.

For example, we feature three versions of the MNIST dataset on BigBeans, each with a different threshold of correctness and a unique technique required to solve it. This approach ensures that users who know a particular technique, such as Support Vector Machines (SVM), can practice problems specifically designed for that technique and receive validation that they are applying it correctly.


Education

Stanford University

September 2012 - June 2014

Master of Science in Electrical Engineering - Distributed Systems

College of Engineering, Guindy

Graduated in June 2012

Bachelors in Information Technology


Publications

I have been fortunate to publish some of the work I conducted during my time at Stanford and at various companies where I have been employed. For a complete list of my publications, please refer to my Google Scholar profile.

  1. Keeping Master Green at Scale
    EuroSys ‘19, Dresden, Germany

  2. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
    ICML ‘16, New York, NY, USA

  3. Reliable Computing with Ultra-Reduced Instruction Set Coprocessors
    DAC ‘12, San Francisco, CA, USA

  4. Low cost permanent fault detection using ultra-reduced instruction set co-processors
    DATE ‘13, Grenoble, France


Talks

  1. Mantis: Stream Processing for Operational Data
    Data Engineering Things, Slides

  2. Backfill Streaming Data Pipelines in Kappa Architecture
    DATA+AI Summit ‘22, Video, Slides

  3. Backfill Flink Pipelines with Apache Iceberg
    Flink Forward ‘21, Video, Slides

  4. Keeping Master Green at Scale
    EuroSys ‘19, Video, Slides