Recently, we are working on structuring data on the Brazilian Electoral History. I work at a Brazilian non-profit political school, called RenovaBR. We use data from the Brazilian Electoral History to develop analyzes and understand patterns between types of candidates.
In Brazil there is a public organization responsible for the elections, called TSE, that entity makes the data available as transparency to the population in a Data Repository. This data is available in CSV and TXT files, but without many standards. So we had the need to create a structure for this data and make ETL transformations, so that we can make analyzes from it.
We currently process data on brazilian elections for the following years: 2010, 2012, 2014, 2016, 2018. The full import of every year totals 26.151.069 million lines in the database.
So an architecture was created using the following technologies:
The initial idea is to show the architecture and some queries, and in the next tutorial show some analysis using Elasticsearch and Kibana. With the data in Elasticsearch we created an API for querying and analyzing the data.
Architecture used in the project:
We decided to put the data first on MySQL, due to the ease of people on the team already using SQL language, then with Logstash we sent the data to Elasticsearch.
We automate the entire process to create the tables in the database and make the information available for analysis. To see the next-step to process the data check the page on the project that is publicly on GitHub:
On a machine with 16 GB / 4 core, it takes around 10 hours to process all data.
Also known as Hare Quote, it is a method by which the seats in the elections are distributed by the proportional system of votes in conjunction with the party quotient and the distribution of leftovers.
Elections in Brazil use the Brazilian proportional system to legislative seats. The program below calculates the electoral quotient for the 2016 year Brazilian Municipal Elections.
Check the complete code example here:
The result of the file, computing the electoral coefficient for all brazilian cities in the 2016 elections:
There is a data dictionary containing the description of the tables and fields:
Checking the total number of votes of the Governors of the state of Santa Catarina in the city of Florianópolis in the first shift of the 2018 elections.
Checking the total number of votes of the Governors of the State of São Paulo in the first shift of the 2018 elections.
|250000612133||MAJOR COSTA E SILVA||Governador||747.462|
|250000617766||PROF. CLAUDIO FERNANDO||Governador||28.666|
Which top 10 city has the most votes for a distinguished candidate example the candidate (250000612596) to governor of the state of São Paulo, for the second shift.
|SÃO JOSÉ DOS CAMPOS||232775|
|SÃO BERNARDO DO CAMPO||196202|
The main objective of this article is to help people who want to study data from Brazil’s electoral system. It is possible to create several queries to explore the data for analysis. In the next article we show charts and queries using Elasticsearch and Kibana.
More details can be found in RenovaBR’s public repository: