Unable to Import Data as Parquet into Encrypted HDFS Zone | Sqoop Parquet Import

0 votes
1 view
asked Aug 19, 2017 in Hadoop by admin (4,410 points)
Summary
Symptoms
When importing data using Sqoop into Hive table, whose location is in an encrypted HDFS zone, the sqoop command will fail with the following errors:

Command:
sqoop import --connect <postgres_url> --username <username> --password <password> \
--table sourceTable --split-by id --hive-import --hive-database staging \
--hive-table hiveTable --as-parquetfile
Errors:
2017-05-24 13:38:51,539 INFO [Thread-84] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: 
Setting job diagnostics to Job commit failed: org.kitesdk.data.DatasetIOException: Could not move contents of hdfs://nameservice1/tmp/staging/.
temp/job_1495453174050_1035/mr/job_1495453174050_1035 to 
hdfs://nameservice1/user/hive/warehouse/staging.db/hiveTable
        at org.kitesdk.data.spi.filesystem.FileSystemUtil.stageMove(FileSystemUtil.java:117)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:406)
        at org.kitesdk.data.spi.filesystem.FileSystemDataset.merge(FileSystemDataset.java:62)
        at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.commitJob(DatasetKeyOutputFormat.java:387)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:274)
        at org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:237)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
/tmp/staging/.temp/job_1495453174050_1035/mr/job_1495453174050_1035/964f7b5e-2f55-421d-bfb6-7613cc4bf26e.parquet 
can't be moved into an encryption zone.
        at org.apache.hadoop.hdfs.server.namenode.EncryptionZoneManager.checkMoveValidity(EncryptionZoneManager.java:284)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedRenameTo(FSDirectory.java:564)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.renameTo(FSDirectory.java:478)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3929)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3891)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3856)
Applies To
  • Sqoop
  • Parquet
  • HDFS Encryption Zone
Cause

This is caused by a known Sqoop bug: SQOOP-2943. This happens because Sqoop currently uses the Kite SDK to generate Parquet file, and the Kite SDK uses the /tmp directory to generate the parquet file on the fly.  Because the /tmp directory is not encrypted and the hive warehouse directory is encrypted, the final move command to move the parquet file from the /tmp directory to hive warehouse will fail due to the encryption.

The import only fails with parquet format, the text file format currently works as expected.

Instructions
As SQOOP-2943 is not fixed at current, and there is no direct workaround, we suggest the following two methods for importing the data into a Hive parquet table, inside the encrypted warehouse:
  1. Import the data as text file format into Hive temporary table inside the Hive warehouse (encrypted), and then use Hive query to copy data into destination parquet table
  2. Import the data as parquet file into non-encrypted temporary directory outside of Hive warehouse, and then again use Hive to copy data into destination parquet table inside the Hive warehouse (encrypted)

Please log in or register to answer this question.

...