HDFS that is part of Hadoop has a command to download a current namenode snapshot. With this image we can load via Spark or make an ingestion in Hive to analyze the data and verify how is the use of HDFS.
The HDFS file system metadata are stored in a file called the FsImage. Contained in this snapshot we have:
The entire file system namespace;
Maps, blocks and file replication;
Properties such as quotas, ACLS, etc…
The problem I had to solve is the following steps:
Run the command to download the image and generate an XML;
Implement a spark JOB to process and save the data in an HIVE table;
Analyze some data using Hive SQL and plot the data with GnuPlot.
1. Generating an HDFS FsImage
The FSImage can generate image in CSV, XML or distributed format, in my case I had to evaluate the blocks and acls as they are fields of type array in CSV format they do not work. You can see more details here:
To generate image, check where it is in the name node:
Now let’s download the image to /tmp, in my case the file that was being analyzed is 35 GB in size:
It is now necessary to convert to readable format, in this case XML:
1.1 Loading the file into Spark and saving to an HIVE table.
It used the Databricks library for XML, and it is very easy to load because it already transforms the data into a dataframe. You can see all
the details here: https://github.com/databricks/spark-xml.
The structure of my HIVE table:
In my scenario, because there are other clusters to be analyzed, a partition was created with the ISO standard ingestion day
and the cluster name.
Using the spark-xml library is very easy to make the parser in the file, read, modify and save the data. Simple example loaded XML:
1.2 Analyzing information and plotting with GnuPlot
In these analyzes was used SQL and the GnuPlot to view the data, but could be other tools like:
https://github.com/paypal/NNAnalytics
https://github.com/vegas-viz/Vegas
Continuing, with our JOB batch data, you can now do some analysis. Generating a histogram with the most commonly used
replication values in the cluster:
There are several types of graphics you can do using GnuPlot, please look here for more examples: GnuPlot Demos. It is necessary that you copy the output in the histogram and place in example file replication.dat:
Now copy the code below and run:
The generated data will look like this:
In this case, most data is using replication block 3. We can do another analysis, to check the files that were modified in
the period of one week. Standardize the output of the histogram with the weekly-changes.dat file:
Using GnuPlot:
The generated data will look like this:
I will leave some other querys that may be useful:
1.4 References
Documents that helped in the publication of this article:
In the universe of Data Science and Machine Learning, the Python language has been widely adopted, offering an extensive ecosystem of libraries and tools tha...