Renner, T., Müller, J., Thamsen, L. and Kao, O. (2017) Addressing Hadoop's Small File Problem with an Appendable Archive File Format. In: Computing Frontiers Conference (CF '17) - Workshop on Big Data Analytics (BigDAW '17), Siena, Italy, 15-17 May 2017, pp. 367-372. ISBN 9781450344876 (doi: 10.1145/3075564.3078888)
Text
268136.pdf - Accepted Version Restricted to Repository staff only 678kB |
Abstract
Hadoop has been used widely for data analytic tasks in various domains. At the same time, data volume is expected to grow even further in the next years. Hadoop recently introduced the concept Archival Storage, an automated tiered storage technique for increasing storage capacity for long-term storage. However, Hadoop Distributed File System's scalability is limited by the total number of files that can be stored, and it is likely that the number of files increases fast when using it for archival purposes. This paper presents an approach for improving HDFS' scalability when using it as an archival storage. We present a tool that extends Hadoop Archive to an appendable file format. New files are appended to one of the existing archive data files efficiently without rewriting the whole archive. Therefore, a first fit algorithm is used to fill up the often not fully utilized fixed-sized data blocks of the archive data files. Index files are updated using a red-black tree providing guaranteed fast lookup and insert performance. We show that the tool performs well for different sizes of archives and number of files to add. By distributing new files efficiently, we also reduce the number of data blocks needed for archiving and, thus, reduce the memory footprint on the NameNode.
Item Type: | Conference Proceedings |
---|---|
Additional Information: | Funding: This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere and by the German Ministry for Education and Research (BMBF) as Berlin Big Data Center BBDC (funding mark 01IS14013A). |
Status: | Published |
Refereed: | Yes |
Glasgow Author(s) Enlighten ID: | Thamsen, Dr Lauritz |
Authors: | Renner, T., Müller, J., Thamsen, L., and Kao, O. |
College/School: | College of Science and Engineering > School of Computing Science |
Publisher: | ACM |
ISBN: | 9781450344876 |
University Staff: Request a correction | Enlighten Editors: Update this record