Mux began using Flink in mid-2016 while evaluating stream-processing platforms to drive our anomaly-detection system. Nearly one year later, we stand by our decision to use Flink and are excited to be part of this growing community.
I was honored to be invited to speak at Flink Forward SF about how Mux uses Flink. The conference organizers, Data Artisans, have done an amazing job of producing & publishing all of the talks with very little delay. You can watch the full talk below:
This was the first Flink Forward conference held outside Berlin, Germany. There was a strong presence from attendees and sponsors alike. Sponsors included Google, Alibaba, and Dell-EMC.
The conference was hosted at Hotel Kabuki in the Japantown neighborhood of San Francisco. I’ve been to plenty of conferences in the Financial District of SF; I was excited to see this would be held in Japantown. The meeting rooms were sized appropriately for the conference. Unfortunately, the WiFi and cellular coverage was weak (the conference area was below street-level); hopefully Hotel Kabuki can improve this aspect of their otherwise-great facilities.
Two of my favorite sessions were from Monal Daxini of Netflix, and Stephan Ewen of Data Artisans:
Keynote: Stream Processing with Flink at Netflix - Monal Daxini
Netflix uses Flink to power their Keystone pipeline project, which is responsible for processing all of their video-view measurements. I was eagerly anticipating this talk since the Keystone system is very similar to Mux's analytics service, both in terms of the underlying data (video-views) and the system architecture.
Netflix’s Keystone service is composed of:
- Messaging services (Apache Kafka)
- Management Web UI for managing & scaling services
- Stream Processing as a Service (SPaaS) deployments
The Management system enables non-technical internal users at Netflix to create their own stream-processing pipelines by defining the following:
- input source consisting of an environment (e.g. production, test), geographic region (e.g. US-East), and peak data rate.
- output where the data should be directed (S3, Hive), along with encoding (JSON, XML, etc) and the subset of input fields to include (up to 10 outputs per input)
- filters written as XPath expressions (e.g.
Keystone SPaaS pipelines are implemented as distinct Flink applications. It’s possible to select a pipeline from the Keystone Management service and see the details about the associated Flink cluster and job (only one job per cluster). A web dashboard and automated alerts are created automatically for each pipeline; this enables detection of poorly performing pipelines, which can be cancelled or restarted.
A single Zookeeper cluster is used to manage all Flink clusters & jobs in the Keystone system. Each cluster includes two Flink Job Managers to ensure high-availability. All Flink Job and Task Managers run in Docker containers.
Experiences Running Flink at Very Large Scale - Stephan Ewen
There were lots of great tips from Stephan Ewen, the CTO of Data Artisans.
Set the minimum time between checkpoints instead of a checkpoint interval. This will allow for computation time to catch up before starting on the next checkpoint. As the checkpointed application state grows in size, so does the time it takes to write checkpoints to storage. If the checkpoint interval and end-to-end checkpoint duration are too close, then very little computation will be performed before the next checkpoint.
For example, a checkpoint interval of 60 seconds with an end-to-end checkpoint duration of 30 seconds would only allow for 30 seconds of computation to be performed before the next checkpoint. Specifying a minimum time between checkpoints of 60 seconds would allow 60 seconds to elapse before the next checkpoint is initiated.
If the checkpoint interval and end-to-end checkpoint duration are too close, then very little computation will be performed before the next checkpoint.
Use asynchronous checkpointing whenever possible; the default checkpointing behavior is synchronous checkpointing. Asynchronous checkpointing adds complexity to your Flink operators, but can yield significant performance gains, especially when paired with the RocksDB backend.
Flink will be adding support for automatic cleanup of orphaned state. Currently Flink jobs that terminate abruptly don’t always have their state cleaned up. This could lead to files being abandoned on an HDFS filesystem, requiring an out-of-band process to identify and reap the orphaned files. It’s good to see that Flink will be addressing this.
What I Learned
I was surprised to see a large number of Flink users running only a single application per Flink cluster. Netflix has what’s probably the largest deployment running in such a configuration, with thousands of relatively small clusters. This results in a simplified deployment, and scales horizontally very easily.
Flink has won over many large organizations, but has a lot of room to grow with small- and medium-sized businesses.
Many of the conference attendees were evaluating Flink but not using it in production yet; however, some of those using it in production were major players like Netflix, Uber, and Alibaba. Flink has won over many large organizations, but has a lot of room to grow with small- and medium-sized businesses.
Docker deployments of Flink are common, with each group creating their own customized Flink Docker image. It would be great to have an official Docker image from the Flink team.
Lastly, using Apache Kafka with Flink is extremely popular. Mux currently uses AWS Kinesis instead of Kafka, but will likely introduce Kafka to our deployment in the next year or so. It’s reassuring to see that Kafka is well supported and in widespread use.
Flink Forward SF boosted my confidence in the Flink platform, and was a great chance to meet many of the project committers in person and see where they’re taking the platform. It was a fantastic experience being part of the speaker lineup, and I look forward to attending next year!