Recommendation System Using Spark ML Akka and Cassandra
Overview
This article was also published on the sites: Java Code Geeks and DZone.
Building a recommendation system with Spark is a simple task. Spark’s machine learning library already does all the hard work for us.
In this study I will show you how to build a scalable application for Big Data using the following technologies:
- Scala Language
- Spark with Machine Learning
- Akka with Actors
- Cassandra
A recommendation system is an information filtering mechanism that attempts to predict the rating a user would give a particular product. There are some algorithms to create a Recommendation System.
Apache Spark ML implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations.
ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It factors the user to item matrix A into the user-to-feature matrix U and the item-to-feature matrix M: It runs the ALS algorithm in a parallel fashion. The ALS algorithm should uncover the latent factors that explain the observed user to item ratings and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings.
Example:
We also know that not all users rate the products (movies), or we don’t already know all the entries in the matrix. With collaborative filtering, the idea is to approximate the ratings matrix by factorizing it as the product of two matrices: one that describes properties of each user (shown in green), and one that describes properties of each movie (shown in blue).
Example:
1. Project Architecture
Architecture used in the project:
2. Data set
The data sets with the movie information and user rating were taken from site Movie Lens. Then the data was customized and loaded into Apache Cassandra. A docker was also used for Cassandra.
The keyspace is called movies. The data in Cassandra is modeled as follows:
3. The code
The code is available in: https://github.com/edersoncorbari/movie-rec
4. Organization and End-Points
Collections:
Collection | Comments |
---|---|
movies.uitem | Contains available movies, total dataset used is 1682. |
movies.udata | Contains movies rated by each user, total dataset used is 100000. |
movies.uresult | Where the data calculated by the model is saved, by default it is empty. |
The end-points:
Method | End-Point | Comments |
---|---|---|
POST | /movie-model-train | Do the training of the model. |
GET | /movie-get-recommendation/{ID} | Lists user recommended movies. |
5. Hands-on Docking and Configuring Cassandra
Run the commands below to upload and configure cassandra:
In the project directory (movie-rec) there are the datasets already prepared to put in Cassandra.
6. Hands-on Running and testing
Enter the project root folder and run the commands, if this is the first time SBT will download the necessary dependencies.
Now! In another terminal run the command to train the model:
This will start the model training. You can then run the command to see results with recommendations. Example:
The answer should be:
That’s icing on the cake! Remember that the setting is set to show 10 movies recommendations per user.
You can also check the result in the uresult collection:
7. Model Predictions
The model and application training settings are in: (src/main/resources/application.conf)
This setting controls forecasts and is linked with how much and what kind of data we have. For more detailed project information please access the link:
8. References
To development this demonstration project the books were used:
And the Spark ML Documentation:
- https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html
- https://spark.apache.org/docs/latest/ml-guide.html
Thanks!