BigData Developing a graph in spark and scala
Overview
This article was also published on the site: https://dzone.com/articles/bigdata-developing-a-graph-in-spark-and-scala.
Recently in my work I needed to create a solution using graph to create a hierarchy of relationship between employees of the company. This is a simple problem, but I had never done it in the BigData area and using Apache Graphx.
In this case we were using Cloudera’s cluster environment that commercial environment, but the solution works the same way for the open-source versions.
For anyone interested in Cloudera solutions to study Big-Data you can download the virtual machine with a cluster configured and ready to use.
1. The problem
We need to ingest the employee data that was in the relational database Oracle for the cluster, do some data processing, and save it in Hive, but we did not have the relationships to create a hierarchy between the employee levels, so we used a table with auxiliary data and created a graph top-down to create the relationships.
Detail: Relational databases such as Oracle have the CONNECT BY function that works with the Hierarchy.
Below I did a hypothetical design of the employee hierarchy that has 9 levels:
Graphx (pregel) has two types of graphs that are Top-Down and Bottom-Up, in which case it would be a modeling using Top-Down:
1.1 Start
After several transformations of the data, we arrive at a DataFrame similar to this one:
Level | Role | Name | Connect By |
---|---|---|---|
1 | PRESIDENT | Tim Cox | null |
2 | VP | Robert Watson | 1 |
3 | VP | Matthew Powers | 1 |
4 | VP | Alex Pess | 1 |
5 | DIRECTOR | João Bastos | 2 |
6 | DIRECTOR | Martin Jay | 3 |
7 | SUPERVISOR | Ana Becker | 5 |
8 | SUPERVISOR | Marcos Silverio | 6 |
9 | MANAGER | Carlos Klaus | 8 |
10 | ANALYST | Jacob Oliver | 9 |
11 | ANALYST | Charlie Noah | 9 |
12 | ANALYST | Claudio Stwart | 9 |
13 | SENIOR DEVELOPER | Jack Connor | 10 |
14 | SENIOR DEVELOPER | Daniel Mason | 10 |
15 | JUNIOR DEVELOPER | George Reece | 13 |
But this is only an example for our test, using the diagram above the hierarchy. The data containing null is considered the root of the relationship, in the case it would be a President or CEO of the company.
1.2 Analyzing the result
Creating a DataFrame with the above data for tests:
After running that DataFrame in the graph, you will get the following result:
Let’s understand the results:
Field | Description |
---|---|
id | The number of the identifier or could be the enrollment of the employee, this number must be unique |
name | The employee’s name |
role | The position or job by the employee in the company |
id_connect_by | It is the key of relationship that this employee is connected to another person |
level | The level number that the employee is in the company, each possition must have a level |
root | The father of all in the hierarchy |
path | The path traveled until reaching the highest position (top dog) |
iscyclic | The number of vertices in Cn equals the number of edges, and every vertex has degree 2 |
isleaf | Determines whether a vertex is a leaf |
1.3 Code
To do more tests, please download the code in the repository:
In the readme file of the project you have the tips on how to run and test.
1.4 References
Graph theory is something we see in discrete math in high school and then we go deeper into my case computer science in data structure. Below are some references to remember and study:
- MIT Computer Science Course Lecture on Graph Theory
- https://spark.apache.org/graphx/
- https://dzone.com/articles/processing-hierarchical-data-using-spark-graphx-pr
- Spark Graphx in Action
Thanks!