The Flume Kafka Pipeline

More often than not, I come across designs which makes use of the Flume Kafka pipeline for ingestion of data. What surprises me is the indiscriminate manner in which it is used as a generic one-size-fits-all solution for ingesting any kind of data. In this post I try to put forward some thoughts in this regard 

At a very high level the Flume Kafka pipeline looks like below;

Data Source -> Flume -> Kafka -> Applications

They key advantages (best of both worlds) from this setup are
1. Flume offers a rich ecosystem of Sources and Sinks and hence can be used to push almost any kind of data from different sources
2. Kafka offers a reliable scalable enterprise grade messaging system which can connect a large number of applications together

Now coming to the kind of data we want to ingest, there are essentially two kinds
1. Static data sources such as files (logs, CDRs and so on) - Data at rest, Non Real Time
2. Dynamic data sources such as streams (messages, events, triggers and so on) - Data in motion, Real Time

Flume is more oriented towards the Hadoop ecosystem and is an excellent choice to push data to HDFS or HBase. Infact so much that a large amount of effort in maintaining Flume goes in keeping up with the Hadoop features. Flume thus is more suitable for the Static data kind and even though it has support for ingesting messages, it works well when the target is finally a static data store. That is its strength.

Kafka on the other hand is a general purpose publish-subscribe messaging system which can help to exchange information or messages between large number of applications. These applications can be either Producers or Consumers. Thus it appears to be a more suitable for message ingestion and thus connecting to streaming data sources. However Kafka ecosystem for out of the box producers and consumers is not nearly as rich as Flume and hence we most of the times need to write our own code to connect the source with the Kafka cluster.

Then how does a Flume-Kafka pipeline come into picture?

Simply put, such a pipeline can be mainly employed when we need to transform a static source to spew dynamic data. Such as reading lines from a file and pushing each line as an individual message though the Kafka cluster so that one or more applications can read and process it. It essentially increases the ease of integration.

The only other case where I can think of employing this pipeline is when one wants a quick hack to connect a streaming data source to Kafka because of lack of a ready to use producer (assuming Flume already has one - such as HTTP, Avro, Thrift and so on)

Barring these case, for the classic cases of ingesting files as files or messages as messages this pipeline is not really required. Infact ingesting streams through a Flume-Kafka pipeline can open up several anti patterns

1. Flume is less reliable than Kafka (the latter uses replications to ensure that simply a broker failure does not cause data loss), hence putting a more reliable cluster after a less reliable cluster in the data pipeline reduces the overall reliability of the data

2. Flume is in general found to have lesser throughput than Kafka (compare 200 MB/s against 350 MB/s per node, for 12C-24T/96GB/6 Disk/10gbE setup). This essentially reduces the overall throughput of the system. In fact we can say quite intuitively that as data progresses in a pipeline, it may move into lesser throughput or lower performing modules but not the other way

3. In order to increase the throughput of Flume, we need to batch the output at the sink side (and sometimes at the source side too). Batching while increases the throughput, it also increases the latencies and moreover compromises the orderliness or time stamps of the events

All these somehow indicate that while Flume can be located more downstream than Kafka but not the other way. In fact that is true because Flume can very well be considered as one of the many applications which source their data from Kafka.

In the following I present some of the common pipelines that may be useful
1. File Source -> Flume -> Kafka -> Message Processor Applications
 

2. File Source -> Flume -> HDFS/HBase
 

3. Streaming Source (client) -> Kakfa Producer (eg. Socket Server) -> Kafka -> Message Processor Applications
 

4. Streaming Source (client) -> Kakfa Producer (eg. Socket Server) -> Kafka -> Flume -> HDFS/HBase
 

5. Streaming Source (client) -> Flume (HTTP, Avro, Thrift, ..other ready made message sources) -> Kafka -> ...

In the end, I would just like to say that: it seems as long as we are clear about the data source and the kind of processing it needs, we can figure out what kind of pipeline is best for the situation.






Comments

Popular Posts