Flume With Many Kafka Sources Can Leave .tmp Files In HDFS

0 votes
asked Aug 19, 2017 in Hadoop by admin (4,410 points)

Flume doesn't close .tmp file upon shutdown during a rolling restart. As a result, the .tmp files contains some of the data/events are inaccessible. This can mean a 'lost' of data since the .tmp files have to be removed.

Applies To

If a Flume agent has a large number of Kafka sources and HDFS sinks, Flume may not have enough time to close the sources and sinks while shutting down. By default, when CM tell a process to stop/restart, supervisord gives it 30s to shutdown. At the 30secs timeout, the process is killed.

To confirm if Flume agents are shutting down properly, look for 'INFO org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Configuration provider stopping' during Flume shutdown. If this line is not shown, then Flume is being killed before it can shutdown properly. 

Here are some example messages showing when the *configuration provider* is not allowed to shutdown properly:

2017-02-23 19:30:51,309 INFO com.demo.MyDeserializer: filtered event {host=aaa.me.demo.com, topic=sv_ad_request_log, version=9, key=9f212fe4-2ce8-4d62-bd1f-725cb33fd34c, timestamp=1487878248886}
2017-02-23 19:30:51,708 INFO org.apache.flume.lifecycle.LifecycleSupervisor: Stopping lifecycle supervisor 24


Until there is a fix in CDH, the only work around is to manually kill the Flume Agent from from the command like using.

1. Identify the Flume Agent to shutdown/restart:
    a) Log into Cloudera Manager
    b) Click on the Flume Service
    c) Click on the Instances tab
    d) Select the Flume Agent
    e) Click on the Processes tab
    f) Record the PID value of the Flume Agent

2. Terminate the flume agent by logging into the host and running the following command:
$ kill <flume-pid>

Killing Flume this way prevents Supervisord from having to use a timeout. NOTE: Use 'kill' and *not* 'kill -9'.

Please log in or register to answer this question.