Using Machine Learning to Analyze the 2nd Round of the 2022 Brazilian Presidential Election
Author: Yuri Henrique Galvao - Student # 3151850
This is the Final Project for the Artificial Intelligence Diploma program of The University of Winnipeg - Professional, Applied and Continuing Education (PACE). The idea of this project is to extract the available data generated by the Electronic Voting Machines (EVM) that were used in the 2nd Round of the 2022 Brazilian Presidential Election, clean it, analyze it, and then use clustering models to find data patterns - especially hidden or non-intuitive patterns - and anomalies.
For these clustering and anomaly-detection tasks, I will use the following three unsupervised clustering algorithms, which will be presented and compared: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Self-Organizing Maps (SOMs).
This notebook is divided into 6 main sections. At the end of the notebook, you will find also the references for the content that I used during the preparation of this notebook. In order to increase readability, some code is hidden (so, please unhide them if you want).
SECTIONS:¶
- Introduction
- Data Ingestion
- Exploratory Data Analysis
- Clustering
4.1 K-Means
4.2 DBSCAN
4.3 SOM - Conclusions
- References
1. Introduction ^¶
1.1. Context¶
In October of the current year (2022) Brazil had the fiercest presidential election of the past three decades. During the first round of federal elections, more than 10 candidates ran for President of the Federative Republic of Brazil, but all Brazilians knew that only two candidates would have real chances of going to the second round: Luis Inácio “Lula” da Silva and Jair Messias Bolsonaro.
As everyone expected, the presidential elections went to the second round with Lula and Bolsonaro, and Brazil had a very controversial election in which Lula won with 50.9% of the valid votes against 49.1% of the valid votes that Bolsonaro received.
Those election were controversial because, among with other issues, Brazil uses an 100% electronic voting system, which is based on a digital electronic voting machine (EVM) and in a digital voting processing system which is considered a black-box system.
1.2. Project Idea¶
In an attempt to make the system more transparent, the public organization responsible for the elections, the Superior Electoral Court (Tribunal Superior Eleitoral – TSE), published all the data gathered by the EVM in their own website. More precisely, it is possible to get the logs, the digital vote registry (Registro Digital de Voto – RDV), and the voting machine bulletin (Boletim de Urna – BU), of each machine used in the elections.
In this project I will exctract raw data from the log files and the BUs, transform it, load it into Pandas DataFrames, clean it, analyze it, and then use clustering models to find hidden or non-intuitive patterns and anomalies. Nevertheless, those patterns will be a great finding and should tell a lot about the dataset.
For that, this notebook focuses on three algorithms: K-Means, DBSCAN and SOMs. They are implemented using mainly two Python libraries: Scikit-Learn and MiniSom.
2.1. Importing the necessary libraries for the ETL process¶
import pandas as pd
import numpy as np
import glob, py7zr, os
2.2. Extracting the Data¶
2.2.1. Web Scraping¶
To web scrap TSE's website and get the zip files of each EVM, I had do develop a Selenium-based web scraper. This webscraper it able to download all EVMs from every single Federative Unit (also called "scope"). Since Brazil has 26 states, 1 Federal District, and 1 scope for the EVMs placed in other countries, the total of scopes / federative units is 28. Therefore, you can run 28 instances of this web scraper to speed up the process of downloading the zip files.
!git clone https://github.com/ygalvao/2022brascraper.git
Cloning into '2022brascraper'... remote: Enumerating objects: 31, done. remote: Counting objects: 100% (31/31), done. remote: Compressing objects: 100% (23/23), done. remote: Total 31 (delta 14), reused 24 (delta 7), pack-reused 0 Receiving objects: 100% (31/31), 8.20 MiB | 9.32 MiB/s, done. Resolving deltas: 100% (14/14), done.
!python 2022brascraper/web_scraper_en.py
Please, enter the name of the state for this instance of the web scraper (e.g. Rio de Janeiro): ^C Traceback (most recent call last): File "/home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/2022brascraper/web_scraper_en.py", line 19, in <module> scope = input('Please, enter the name of the state for this instance of the web scraper (e.g. Rio de Janeiro): ').strip() KeyboardInterrupt
2.2.2. Unzipping¶
The Bash script below automates the process of unzipping the .bu and .logjez files. It will extract the files in the "extracted" subfolder.
!./extract_bu_and_log.sh
Archive: ./BU_e_RDV/DadosDeUrna_o00407-0001900010003_1673243086911.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.logjez Archive: ./BU_e_RDV/DadosDeUrna_o00407-0001900010003_1673269612030.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.logjez Archive: ./BU_e_RDV/DadosDeUrna_o00407-0001900010004_1673243094855.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010004.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010004.logjez (...) Archive: ./BU_e_RDV/DadosDeUrna_o00407-9005000080063_1673201365039.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-9005000080063.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-9005000080063.logjez
2.3. Transforming the Data¶
The BU files (.bu) are in ASN.1 format, which is not readable by humans nor by Python (at least not natively). Therefore, the Bash script below automates the process of dumping the data from the ASN.1 format files into readable flat files (.txt).
Moreover, the "bu_etl" function transforms the data from a BU text file and stores it into a Pandas DataFrame (then returns it).
!./dump_bu.sh
2023-01-09 21:52:00,620 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010003.bu com as especificações ['bu.asn1'] 2023-01-09 21:52:00,868 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010004.bu com as especificações ['bu.asn1'] 2023-01-09 21:52:01,115 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010005.bu com as especificações ['bu.asn1'] (...) 2023-01-09 23:31:29,765 - INFO - Converte ./BU_e_RDV/extracted/o00407-1583000350047.bu com as especificações ['bu.asn1']
def bu_etl(file_path:str)->pd.DataFrame:
"""
Transforms the data from a BU text file and stores it into a Pandas DataFrame.
Arg.: (str) the path for the BU text file
Returns: a Pandas DataFrame.
"""
# Read the .txt file
with open(file_path) as f:
data = f.read()
# Split the data into lines
lines = data.split('\n')
# Initialize two empty lists to store the rows of the DataFrame
data_dict = {}
codes = []
brancos_e_nulos = 0
# Iterate through the lines of the file
for i, line in enumerate(lines):
try:
# Split the line by ' = ' and extract the left and right parts
left, right = line.split(' = ')
except:
continue
# Split the left part by '.', extract the last element (the column name), and delete white spaces before and after it(if there is any)
column_name = left.split('.')[-1].strip()
# Get the desired data and stores it in a dictionary
wanted_variables = (
'fase',
'local',
'municipio',
'zona',
'secao',
'qtdEleitoresCompBiometrico',
'idEleicao',
'qtdEleitoresAptos',
'qtdComparecimento',
'qtd_votos_13',
'qtd_votos_22',
'brancos_nulos',
'versaoVotacao'
)
value = right.strip("'")
if column_name == 'codigo':
codes.append(value)
if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 34 and i < 40):
column_name = f'qtd_votos_{codes[0]}'
codes.pop()
if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 40 and i < 46):
column_name = f'qtd_votos_{codes[0]}'
codes.pop()
if column_name == 'quantidadeVotos' and len(codes) == 0 and (i >= 46):
column_name = 'brancos_nulos'
brancos_e_nulos += int(value)
value = brancos_e_nulos
if column_name in wanted_variables:
data_dict[column_name] = [value]
# Create the DataFrame from the rows
df = pd.DataFrame(data_dict)
return df
2.4. Loading the Data¶
Below are the procedures (mainly loops) to, finally, load the data into DataFrames that can be used by us, by Scikit-Learn, and by MiniSom.
# Importing the data from the BUs into a Pandas DataFrame
files_df2 = glob.glob("./BU_e_RDV/extracted/*.txt")
df_bu_list = []
## This loop will
for i, file_path in enumerate(files_df2):
bu = bu_etl(file_path)
df_bu_list.append(bu)
os.remove(file_path)
if i % 50 == 0:
df_bu = pd.concat(df_bu_list, ignore_index=True)
df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
df_bu_list = [df_bu]
if len(df_bu_list) > 1:
df_bu = pd.concat(df_bu_list, ignore_index=True)
df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
# Importing the data from the log files into a Pandas DataFrame
files_df1 = glob.glob("./BU_e_RDV/extracted/*.logjez") # This is where I stored all the 7Zip logfiles from the "extract_bu_and_log.sh" script
df_logs_list = []
## This loop will exctract the logs (in flat file format, .dat) and correctly import its data into a Pandas DataFrame
## At the end of each iteraction, it will delete the recently extracted .dat file (which has around 700kB)
## and the original 7Zip file (the .logjez files), in order to save space in disk
for i, file_path in enumerate(files_df1):
logjez_file = py7zr.SevenZipFile(file_path, mode="r")
logjez_file.extractall(path="./BU_e_RDV/extracted/")
logjez_file.close()
log = pd.read_csv(
'./BU_e_RDV/extracted/logd.dat',
encoding='ISO 8859-1',
header=None,
names=['date_time', 'event_type', 'id_evm', 'system', 'description', 'authenticator'],
sep=None,
engine='python'
)
df_logs_list.append(log)
os.remove('./BU_e_RDV/extracted/logd.dat')
os.remove(file_path)
if i % 50 == 0:
df_logs = pd.concat(df_logs_list, ignore_index=True)
df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
df_logs_list = [df_logs]
if i > 20000:
break
if len(df_logs_list) > 1:
df_logs = pd.concat(df_logs_list, ignore_index=True)
df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
--------------------------------------------------------------------------- KeyboardInterrupt
# In case of crash of this notebook (or if you already have the CSV files),
#you can just read the CSV files that were created in the previous two cells.
# For that, just run this cell.
# Nevertheless, running this cell when it is not necessary is safe and harmless.
df_logs = pd.read_csv('df_logs.csv', index_col=0)
df_bu = pd.read_csv('df_bu.csv', index_col=0)
df_logs.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2257041 entries, 0 to 2257040 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date_time object 1 event_type object 2 id_evm int64 3 system object 4 description object dtypes: int64(1), object(4) memory usage: 103.3+ MB
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null int64 1 municipio 34644 non-null int64 2 zona 34644 non-null int64 3 secao 34644 non-null int64 4 qtdEleitoresCompBiometrico 34644 non-null int64 5 qtdEleitoresAptos 34644 non-null int64 6 qtdComparecimento 34644 non-null int64 7 qtd_votos_13 34644 non-null int64 8 qtd_votos_22 34644 non-null int64 9 brancos_nulos 34644 non-null int64 dtypes: int64(10) memory usage: 2.9 MB
3.1. Overall Analysis of the Logs DataFrame¶
from IPython.display import display as show
show(df_logs.head(10))
show(df_logs.tail(10))
date_time | event_type | id_evm | system | description | authenticator | |
---|---|---|---|---|---|---|
0 | 23/09/2022 09:46:41 | INFO | 67305985 | LOGD | Início das operações do logd | 5A7DA76DD475845C |
1 | 23/09/2022 09:46:41 | INFO | 67305985 | LOGD | Urna ligada em 23/09/2022 às 09:45:21 | 624DDD7828CC05C0 |
2 | 23/09/2022 09:46:41 | INFO | 67305985 | SCUE | Iniciando aplicação - Oficial - 1º turno | 81D6CC6DB5FDDCCC |
3 | 23/09/2022 09:46:41 | INFO | 67305985 | SCUE | Versão da aplicação: 8.26.0.0 - Onça-pintada | 958EFE65A6117BF4 |
4 | 23/09/2022 09:46:43 | INFO | 67305985 | SCUE | Urna operando com rede elétrica | 13130D8758EFB5EB |
5 | 23/09/2022 09:46:43 | INFO | 67305985 | SCUE | Bateria interna com carga plena | EE968B0C96E9D91B |
6 | 23/09/2022 09:46:49 | INFO | 67305985 | SCUE | Tamanho da mídia interna: 488.7 MB | 1E829998F4272762 |
7 | 23/09/2022 09:46:52 | INFO | 67305985 | SCUE | Tamanho da memória: 489.4 MB | 114FCB81EB5E33C1 |
8 | 23/09/2022 09:46:53 | INFO | 67305985 | SCUE | Verificação de assinatura de aplicação por eta... | E0F945D741906B3C |
9 | 23/09/2022 09:46:53 | INFO | 67305985 | SCUE | Verificação de assinatura de aplicação por eta... | 1ACBC4446E919E65 |
date_time | event_type | id_evm | system | description | authenticator | |
---|---|---|---|---|---|---|
44639068 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqbio] + [Térm... | 02B9AA39FEFCFDD6 |
44639069 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqman] + [Início] | EBE25D58F9D6EEA5 |
44639070 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqman] + [Térm... | DEB08BE94048457D |
44639071 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqmes] + [Início] | C25D1B574367931F |
44639072 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqmes] + [Térm... | 8BD3F01DA05ADD28 |
44639073 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.ver] + [Início] | 42C25A3BA8669F70 |
44639074 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.ver] + [Término] | 1010DDA0D08CAF07 |
44639075 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.chvtp] + [Início] | F53CEA7E4F73FAD2 |
44639076 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.chvtp] + [Término] | BE5C958E224DFDC6 |
44639077 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.logjez] + [Início] | 557504AA3E902073 |
df_logs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18639078 entries, 0 to 18639077 Data columns (total 6 columns): # Column Dtype --- ------ ----- 0 date_time object 1 event_type object 2 id_evm int64 3 system object 4 description object 5 authenticator object dtypes: int64(1), object(5) memory usage: 853.2+ MB
We can see that each row in the df_logs DataFrame is one event of one of the systems running inside an EVM.
Due to feasibility purposes, this DataFrame is only a fraction (around 2%) of the data of all the EVM's logs.
3.1.1. Checking duplicated rows¶
df_logs.duplicated().sum()
0
3.1.2. Checking null values¶
df_logs.isna().sum()
date_time 0 event_type 0 id_evm 0 system 0 description 0 authenticator 0 dtype: int64
3.1.3. Checking zero values¶
(df_logs == 0).any(axis=1).sum()
0
# Checking zero values on the columns
(df_logs == 0).any(axis=0)
date_time False event_type False id_evm False system False description False authenticator False dtype: bool
3.1.4. Checking the possible values for each variable and their frequencies¶
for column in df_logs.columns:
print(column+''''s different possible values quantity:''', df_logs[column].nunique(), '\n')
date_time's different possible values quantity: 556999 event_type's different possible values quantity: 4 id_evm's different possible values quantity: 761 system's different possible values quantity: 10 description's different possible values quantity: 133351 authenticator's different possible values quantity: 18639078
3.2. Overall Analysis of the BUs DataFrame¶
from IPython.display import display as show
show(df_bu.head(10))
show(df_bu.tail(10))
fase | local | municipio | zona | secao | qtdEleitoresCompBiometrico | idEleicao | qtdEleitoresAptos | qtdComparecimento | qtd_votos_13 | qtd_votos_22 | brancos_nulos | versaoVotacao | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | oficial | 1406 | 18376 | 51 | 92 | 199 | 545 | 243 | 208 | 112 | 91 | 5.0 | 8.26.0.0 - Onça-pintada |
1 | oficial | 1090 | 23086 | 34 | 110 | 247 | 547 | 295 | 258 | 200 | 43 | 15.0 | 8.26.0.0 - Onça-pintada |
2 | oficial | 1112 | 2372 | 30 | 12 | 190 | 547 | 291 | 205 | 157 | 40 | 8.0 | 8.26.0.0 - Onça-pintada |
3 | oficial | 1414 | 8052 | 35 | 71 | 185 | 545 | 281 | 220 | 191 | 25 | 4.0 | 8.26.0.0 - Onça-pintada |
4 | oficial | 1031 | 17655 | 12 | 58 | 204 | 545 | 274 | 217 | 147 | 61 | 9.0 | 8.26.0.0 - Onça-pintada |
5 | oficial | 1325 | 4014 | 7 | 188 | 238 | 545 | 310 | 268 | 218 | 44 | 6.0 | 8.26.0.0 - Onça-pintada |
6 | oficial | 1066 | 477 | 21 | 310 | 166 | 547 | 300 | 185 | 42 | 142 | 1.0 | 8.26.0.0 - Onça-pintada |
7 | oficial | 1074 | 9539 | 49 | 23 | 158 | 545 | 266 | 183 | 116 | 64 | 3.0 | 8.26.0.0 - Onça-pintada |
8 | oficial | 1023 | 22314 | 27 | 18 | 217 | 547 | 284 | 225 | 157 | 42 | 26.0 | 8.26.0.0 - Onça-pintada |
9 | oficial | 1708 | 15695 | 81 | 44 | 276 | 545 | 347 | 281 | 131 | 134 | 16.0 | 8.26.0.0 - Onça-pintada |
fase | local | municipio | zona | secao | qtdEleitoresCompBiometrico | idEleicao | qtdEleitoresAptos | qtdComparecimento | qtd_votos_13 | qtd_votos_22 | brancos_nulos | versaoVotacao | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61068 | oficial | 1058 | 54070 | 280 | 120 | 138 | 545 | 377 | 301 | 124 | 169 | 8.0 | 8.26.0.0 - Onça-pintada |
61069 | oficial | 1635 | 58335 | 126 | 367 | 327 | 545 | 393 | 330 | 128 | 193 | 9.0 | 8.26.0.0 - Onça-pintada |
61070 | oficial | 1392 | 39136 | 33 | 212 | 298 | 547 | 414 | 355 | 250 | 95 | 10.0 | 8.26.0.0 - Onça-pintada |
61071 | oficial | 1120 | 54070 | 280 | 318 | 215 | 545 | 371 | 262 | 134 | 118 | 10.0 | 8.26.0.0 - Onça-pintada |
61072 | oficial | 1210 | 58190 | 76 | 57 | 115 | 545 | 246 | 176 | 50 | 117 | 9.0 | 8.26.0.0 - Onça-pintada |
61073 | oficial | 1040 | 58300 | 255 | 192 | 223 | 545 | 367 | 260 | 128 | 129 | 3.0 | 8.26.0.0 - Onça-pintada |
61074 | oficial | 1171 | 58092 | 42 | 18 | 191 | 545 | 374 | 290 | 117 | 161 | 12.0 | 8.26.0.0 - Onça-pintada |
61075 | oficial | 1090 | 39675 | 69 | 111 | 221 | 547 | 301 | 227 | 183 | 32 | 12.0 | 8.26.0.0 - Onça-pintada |
61076 | oficial | 1120 | 54259 | 311 | 191 | 247 | 545 | 370 | 291 | 119 | 165 | 7.0 | 8.26.0.0 - Onça-pintada |
61077 | oficial | 1066 | 73334 | 34 | 92 | 201 | 545 | 272 | 202 | 138 | 58 | 6.0 | 8.26.0.0 - Onça-pintada |
df_bu.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61078 entries, 0 to 61077 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fase 61078 non-null object 1 local 61078 non-null object 2 municipio 61078 non-null object 3 zona 61078 non-null object 4 secao 61078 non-null object 5 qtdEleitoresCompBiometrico 60764 non-null object 6 idEleicao 61078 non-null object 7 qtdEleitoresAptos 61078 non-null object 8 qtdComparecimento 61078 non-null object 9 qtd_votos_13 60793 non-null object 10 qtd_votos_22 59551 non-null object 11 brancos_nulos 59380 non-null float64 12 versaoVotacao 61078 non-null object dtypes: float64(1), object(12) memory usage: 6.1+ MB
We can see that each row in the df_bu DataFrame is one EVM.
Due to feasibility purposes, this DataFrame is only a fraction (around 13%) of the data of all the EVM's logs.
3.1.1. Checking duplicated rows¶
df_bu.duplicated().sum()
0
3.1.2. Checking null values¶
df_bu.isna().sum()
fase 0 local 0 municipio 0 zona 0 secao 0 qtdEleitoresCompBiometrico 314 idEleicao 0 qtdEleitoresAptos 0 qtdComparecimento 0 qtd_votos_13 285 qtd_votos_22 1527 brancos_nulos 1698 versaoVotacao 0 dtype: int64
3.1.3. Checking zero values¶
(df_bu == 0).any(axis=1).sum()
0
# Checking zero values on the columns
(df_bu == 0).any(axis=0)
fase False local False municipio False zona False secao False qtdEleitoresCompBiometrico False idEleicao False qtdEleitoresAptos False qtdComparecimento False qtd_votos_13 False qtd_votos_22 False brancos_nulos False versaoVotacao False dtype: bool
3.1.4. Checking the possible values for each variable and their frequencies¶
for column in df_bu.columns:
print(column+''''s different possible values quantity:''', df_bu[column].nunique(), '\n')
fase's different possible values quantity: 1 local's different possible values quantity: 265 municipio's different possible values quantity: 789 zona's different possible values quantity: 195 secao's different possible values quantity: 1054 qtdEleitoresCompBiometrico's different possible values quantity: 421 idEleicao's different possible values quantity: 2 qtdEleitoresAptos's different possible values quantity: 523 qtdComparecimento's different possible values quantity: 426 qtd_votos_13's different possible values quantity: 333 qtd_votos_22's different possible values quantity: 291 brancos_nulos's different possible values quantity: 50 versaoVotacao's different possible values quantity: 1
df_bu.idEleicao.value_counts()
545 34644 547 26434 Name: idEleicao, dtype: int64
3.3. Data Cleaning - Logs DF¶
3.3.1. Deleting unnecessary features¶
df_logs.drop(columns='authenticator', inplace=True)
df_logs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18639078 entries, 0 to 18639077 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date_time object 1 event_type object 2 id_evm int64 3 system object 4 description object dtypes: int64(1), object(4) memory usage: 711.0+ MB
3.3.2. Correcting data types¶
df_logs['date_time'] = pd.to_datetime(df_logs.date_time)
for column in ['event_type', 'id_evm', 'system']:
df_logs[column] = df_logs[column].astype('category')
df_logs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18639078 entries, 0 to 18639077 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date_time datetime64[ns] 1 event_type category 2 id_evm category 3 system category 4 description object dtypes: category(3), datetime64[ns](1), object(1) memory usage: 355.5+ MB
3.3.3. Deleting personal data¶
This step is for ethical reasons: it is just to delete some personal data from the poll workers / clerks that are in the log files, like social insurance numbers (Cadastro de Pessoa Física - CPF, in Brazil).
df_logs[df_logs.description.str.contains('mesário', case=False)]
date_time | event_type | id_evm | system | description | |
---|---|---|---|---|---|
438 | 2022-02-10 07:14:21 | INFO | 67305985 | VOTA | Operador indagado se ocorrerá registro de mesá... |
439 | 2022-02-10 07:16:12 | INFO | 67305985 | VOTA | Operador confirmou para registrar mesários |
440 | 2022-02-10 07:16:13 | INFO | 67305985 | VOTA | Registrando mesários antes da votação |
441 | 2022-02-10 07:16:29 | INFO | 67305985 | VOTA | Pedido de leitura da biometria do mesário 0182... |
442 | 2022-02-10 07:16:37 | INFO | 67305985 | VOTA | Realizada a conferência da biometria do mesário |
... | ... | ... | ... | ... | ... |
18639044 | 2022-10-30 17:01:33 | INFO | 67305985 | VOTA | Mesário 048058530906 não é eleitor da seção |
18639045 | 2022-10-30 17:01:34 | INFO | 67305985 | VOTA | Mesário 048058530906 registrado |
18639046 | 2022-10-30 17:01:38 | INFO | 67305985 | VOTA | Operador encerrou ciclo de registro de mesários |
18639055 | 2022-10-30 17:02:00 | INFO | 67305985 | VOTA | Mesário indagado sobre qualidade do Boletim de... |
18639056 | 2022-10-30 17:02:37 | INFO | 67305985 | VOTA | Mesário indicou qualidade OK para Boletim de Urna |
1743820 rows × 5 columns
def delete_cpf(description:str)->str:
"""Deletes only the CPF."""
new = description.split(' ')
for token in new:
try:
int(token)
except:
pass
else:
new.remove(token)
new = ' '.join(new)
return new
df_logs['description'] = df_logs.description.apply(delete_cpf)
df_logs[df_logs.description.str.contains('mesário', case=False)]
date_time | event_type | id_evm | system | description | event_type_enc | date_time_enc | system_enc | description_enc | |
---|---|---|---|---|---|---|---|---|---|
138 | 2022-07-10 08:43:20 | INFO | 67305985 | VOTA | Mesário indagado se deseja imprimir cópia do BU | 3 | 52665 | 8 | 24839 |
821 | 2022-10-30 07:01:22 | INFO | 67305985 | VOTA | Operador indagado se ocorrerá registro de mesá... | 3 | 127038 | 8 | 25281 |
822 | 2022-10-30 07:11:26 | INFO | 67305985 | VOTA | Operador confirmou para registrar mesários | 3 | 127640 | 8 | 25277 |
823 | 2022-10-30 07:11:26 | INFO | 67305985 | VOTA | Registrando mesários antes da votação | 3 | 127640 | 8 | 26971 |
824 | 2022-10-30 07:11:55 | INFO | 67305985 | VOTA | Pedido de leitura da biometria do mesário | 3 | 127669 | 8 | 25287 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2256962 | 2022-02-10 16:57:10 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42355 | 8 | 27572 |
2256985 | 2022-02-10 16:58:55 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42460 | 8 | 27572 |
2257001 | 2022-02-10 17:00:05 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42530 | 8 | 27572 |
2257018 | 2022-02-10 17:01:04 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42589 | 8 | 27572 |
2257033 | 2022-02-10 17:02:04 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42649 | 8 | 27572 |
212030 rows × 9 columns
3.4. Data Cleaning - BUs DF¶
3.4.1. Deleting unnecessary features¶
df_bu.drop(columns='fase', inplace=True)
df_bu = df_bu[df_bu.idEleicao=='545'].copy() # The id for the 2nd Round of the 2022 Presidencial Election is 545
df_bu.drop(columns='idEleicao', inplace=True)
df_bu.drop(columns='versaoVotacao', inplace=True)
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 61077 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null object 1 municipio 34644 non-null object 2 zona 34644 non-null object 3 secao 34644 non-null object 4 qtdEleitoresCompBiometrico 34644 non-null object 5 qtdEleitoresAptos 34644 non-null object 6 qtdComparecimento 34644 non-null object 7 qtd_votos_13 34580 non-null object 8 qtd_votos_22 33843 non-null object 9 brancos_nulos 33535 non-null float64 dtypes: float64(1), object(9) memory usage: 2.9+ MB
3.4.2. Replacing null values¶
After analyzing the null values in this DF, I came to the conclusion that they mean, actually, zero (0) values.
df_bu.fillna(0, inplace=True)
3.4.3. Correcting data types¶
for column in df_bu.columns.to_list():
if column[:3] == 'qtd':
df_bu[column] = df_bu[column].astype('int')
else:
df_bu[column] = df_bu[column].astype('category')
df_bu['brancos_nulos'] = df_bu['brancos_nulos'].astype('int')
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null category 1 municipio 34644 non-null category 2 zona 34644 non-null category 3 secao 34644 non-null category 4 qtdEleitoresCompBiometrico 34644 non-null int64 5 qtdEleitoresAptos 34644 non-null int64 6 qtdComparecimento 34644 non-null int64 7 qtd_votos_13 34644 non-null int64 8 qtd_votos_22 34644 non-null int64 9 brancos_nulos 34644 non-null int64 dtypes: category(4), int64(6) memory usage: 2.2 MB
3.5. Descriptive Statistics¶
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
# Defining a function to classify columns into categorical and quantitative variables
def check_variables(df: pd.DataFrame) -> list:
"""
Separates the categorical variables from the quantitative variables, and store them in their respective list.
"""
cols = df.columns
date_cols = df.select_dtypes(include='datetime').columns
quantitative_cols = df.select_dtypes(include='number').columns
categorical_cols = list(set(cols) - set(quantitative_cols) - set(date_cols))
quantitative_cols = set(quantitative_cols) - set(date_cols)
return categorical_cols, list(quantitative_cols), list(date_cols)
# Defining a function to examine categorical variables
def examine_categorical(categ_var : pd.Series, top : int = 10, others : bool = True) -> None:
'''
This function gets a Pandas DataSeries (categorical column of the Pandas DataFrame) and:
- Gets the top 10 (or other chosen quantity) values
- Compiles all the other values into "others" (or not, if chosen otherwise)
- Prints a frequency distribution table
- Plots a pie chart
- Plots a bar chart
'''
vc = categ_var.value_counts()
vc2 = vc.sort_values(ascending=False)[:top]
new_row = pd.Series(data = {'others': vc.sort_values(ascending=False)[top:].sum()})
vc3 = pd.concat([vc2, new_row])
if others == True:
vc = vc3
msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.'''
else:
vc = vc2
msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.'''
# Frequency distribution
print(f'''Frequency distribution table for different values of "{categ_var.name}" variable: \n\n{vc}\n''')
print(msg)
# Pie chart
vc.plot(
kind='pie',
ylabel=categ_var.name,
autopct='%.2f%%',
figsize=(10,10))
plt.show()
plt.close()
# Bar chart
bar = vc.plot(
kind='bar',
figsize=(10,8),
align='center')
bar.bar_label(bar.containers[0])
plt.show()
plt.close()
print('_' * 120+'\n' * 3)
# Defining a function to examine numerical variables
def examine_quant(
variable:pd.Series,
optmize_n_bins:bool=False,
no_outliers:bool=False,
n_bins:bool=False
)->None:
'''
Gets a Pandas DataSeries and:
- Prints measures of central tendancy
- Prints measures of spread
- Take the outliers out using the 1.5 IQR criterion, if "no_outliers" == True
- Try to calculate the optimal number of bins for the histogram, if "optmize_n_bins" == True
- Plots a histogram
- Plots box-plot
'''
var_desc = variable.describe()
IQR = var_desc['75%'] - var_desc['25%']
print(f'''### Measures for variable '{variable.name}':
## Measures of center:
Mode: {variable.mode()[0]}
Mean: {var_desc['mean']}
Median: {var_desc['50%']}
## Measures of spread:
Min: {var_desc['min']}
Max: {var_desc['max']}
Range: {var_desc['max'] - var_desc['min']}
1st Quartile (Q25): {var_desc['25%']}
3rd Quartile (Q75): {var_desc['75%']}
IQR: {IQR}
Standard deviation: {var_desc['std']}\n''')
if no_outliers == True:
variable = variable[(variable <= (var_desc['75%'] + 1.5 * IQR)) & (variable >= (var_desc['25%'] - 1.5 * IQR))]
def freedman_diaconis(variable : np.ndarray) -> int:
"""
Use Freedman Diaconis rule to compute optimal histogram bin width - it tries to return the optimal number of bins.
"""
data = np.asarray(variable.values, dtype=np.float_)
IQR = stats.iqr(data, rng=(25, 75), scale=1.0, nan_policy='propagate')
N = data.size
bw = (2 * IQR) / np.power(N, 1/3)
datmin, datmax = data.min(), data.max()
datrng = datmax - datmin
result = int(((datrng / bw) + 1)/5)
return result
#Histogram
if optmize_n_bins:
try:
n_bins_ = freedman_diaconis(variable)
except Exception as e:
print(e)
else:
variable.hist(bins=n_bins_)
plt.show()
plt.close()
elif n_bins:
variable.hist(bins=n_bins)
plt.show()
plt.close()
else:
variable.hist()
plt.show()
plt.close()
#Boxplot
plt.boxplot(x=variable, labels=[variable.name])
plt.ylabel(variable.name)
plt.show()
plt.close()
#Separator line
print('_' * 120+'\n' * 3)
3.5.1. Visualizing the Data - Logs DF¶
mpl.rcParams['font.family'] = ['serif']
# Defining better settings for the visualizations using Seaborn
sns.set(rc={'figure.figsize':(8,6)}, style='whitegrid')
# Classifying variables by their type
cat_cols_logs, quan_cols_logs, date_cols_logs = check_variables(df_logs)
print('Categorical variables:')
show(cat_cols_logs)
print('\nQuantitative variables:')
show(quan_cols_logs)
print('\nFull date/time variables:')
show(date_cols_logs)
Categorical variables:
['event_type', 'id_evm', 'system', 'description']
Quantitative variables:
[]
Full date/time variables:
['date_time']
Visualizing and Examining Categorical Variables¶
# Examining categorical variables
for variable in cat_cols_logs:
examine_categorical(df_logs[variable], top=8, others=True)
Frequency distribution table for different values of "id_evm" variable: 67305985 11761980 2070822 12840 2054533 12742 2053621 12712 2072283 12679 2036805 12648 2040850 12629 2086021 12607 others 6788241 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "system" variable: VOTA 16883946 GAP 1140956 ATUE 347426 SCUE 173027 LOGD 78854 INITJE 13284 VPP 1038 RED 357 others 190 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "description" variable: Aguardando digitação do título 1157222 Título digitado pelo mesário 1126201 Tecla indevida pressionada 1053952 Eleitor foi habilitado 1035473 O voto do eleitor foi computado 1035029 Voto confirmado para [Presidente] 1034747 Solicita digital. Tentativa [1] de [4] 928686 Capturada a digital. Tentativa [1] de [4] 915983 others 10351785 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "event_type" variable: INFO 18460996 ALERTA 173218 ERRO 4752 EXTERNO 112 others 0 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________
There are no quantitative variables in df_logs.
# Examining quantitative variables with outliers
for variable in quan_cols_logs:
try:
examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
except Exception as e:
print(e)
# Examining quantitative variables without outliers
for variable in quan_cols_logs:
try:
examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
except Exception as e:
print(e)
3.5.2. Visualizing the Data - BUs DF¶
# Classifying variables by their type
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)
Categorical variables:
['municipio', 'zona', 'local', 'secao']
Quantitative variables:
['brancos_nulos', 'qtdEleitoresAptos', 'qtd_votos_13', 'qtdComparecimento', 'qtdEleitoresCompBiometrico', 'qtd_votos_22']
Full date/time variables:
[]
Visualizing and Examining Categorical Variables¶
# Examining categorical variables
for variable in cat_cols_bu:
examine_categorical(df_bu[variable], top=12, others=True)
Frequency distribution table for different values of "municipio" variable: 58335 1816 12190 1678 54038 1453 17612 1414 58190 1090 6050 952 58041 902 3018 741 59013 694 59250 662 73440 609 17590 540 others 22093 dtype: int64 Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "zona" variable: 1 1805 2 1548 5 974 7 941 4 773 29 759 98 652 19 607 50 599 8 568 3 540 97 539 others 24339 dtype: int64 Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "secao" variable: 25 166 27 165 26 162 30 160 28 159 43 159 41 158 33 158 23 156 44 156 22 155 24 155 others 32735 dtype: int64 Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "local" variable: 1015 1974 1023 1534 1031 1442 1040 1163 1058 1142 1066 988 1082 961 1074 826 1120 750 1112 743 1090 741 1104 660 others 21720 dtype: int64 Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________
Visualizing and Examining Quantitative Variables¶
# Examining quantitative variables with outliers
for variable in quan_cols_bu:
try:
examine_quant(df_bu[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
except Exception as e:
print(e)
### Measures for variable 'qtd_votos_22': ## Measures of center: Mode: 0 Mean: 109.83163029673248 Median: 110.0 ## Measures of spread: Min: 0.0 Max: 357.0 Range: 357.0 1st Quartile (Q25): 64.0 3rd Quartile (Q75): 155.0 IQR: 91.0 Standard deviation: 57.0083866624702
________________________________________________________________________________________________________________________ ### Measures for variable 'qtdEleitoresAptos': ## Measures of center: Mode: 393 Mean: 323.0540930608475 Median: 330.0 ## Measures of spread: Min: 22.0 Max: 552.0 Range: 530.0 1st Quartile (Q25): 282.0 3rd Quartile (Q75): 372.0 IQR: 90.0 Standard deviation: 69.95317971623534
________________________________________________________________________________________________________________________ ### Measures for variable 'brancos_nulos': ## Measures of center: Mode: 8 Mean: 8.493418773813648 Median: 8.0 ## Measures of spread: Min: 0.0 Max: 36.0 Range: 36.0 1st Quartile (Q25): 5.0 3rd Quartile (Q75): 11.0 IQR: 6.0 Standard deviation: 4.593713047585358
________________________________________________________________________________________________________________________ ### Measures for variable 'qtdComparecimento': ## Measures of center: Mode: 268 Mean: 255.61413231728437 Median: 261.0 ## Measures of spread: Min: 13.0 Max: 471.0 Range: 458.0 1st Quartile (Q25): 217.0 3rd Quartile (Q75): 297.0 IQR: 80.0 Standard deviation: 60.862333071411435
________________________________________________________________________________________________________________________ ### Measures for variable 'qtdEleitoresCompBiometrico': ## Measures of center: Mode: 258 Mean: 221.14672093291767 Median: 225.0 ## Measures of spread: Min: 3.0 Max: 425.0 Range: 422.0 1st Quartile (Q25): 179.0 3rd Quartile (Q75): 265.0 IQR: 86.0 Standard deviation: 64.89242566941138
________________________________________________________________________________________________________________________ ### Measures for variable 'qtd_votos_13': ## Measures of center: Mode: 111 Mean: 133.9214871261979 Median: 128.0 ## Measures of spread: Min: 0.0 Max: 346.0 Range: 346.0 1st Quartile (Q25): 98.0 3rd Quartile (Q75): 170.0 IQR: 72.0 Standard deviation: 52.067254609204234
________________________________________________________________________________________________________________________
# Examining quantitative variables without outliers
for variable in quan_cols_bu:
try:
examine_quant(df_bu[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
except Exception as e:
print(e)
### Measures for variable 'qtd_votos_22': ## Measures of center: Mode: 0 Mean: 109.83163029673248 Median: 110.0 ## Measures of spread: Min: 0.0 Max: 357.0 Range: 357.0 1st Quartile (Q25): 64.0 3rd Quartile (Q75): 155.0 IQR: 91.0 Standard deviation: 57.0083866624702
________________________________________________________________________________________________________________________ ### Measures for variable 'qtdEleitoresAptos': ## Measures of center: Mode: 393 Mean: 323.0540930608475 Median: 330.0 ## Measures of spread: Min: 22.0 Max: 552.0 Range: 530.0 1st Quartile (Q25): 282.0 3rd Quartile (Q75): 372.0 IQR: 90.0 Standard deviation: 69.95317971623534
________________________________________________________________________________________________________________________ ### Measures for variable 'brancos_nulos': ## Measures of center: Mode: 8 Mean: 8.493418773813648 Median: 8.0 ## Measures of spread: Min: 0.0 Max: 36.0 Range: 36.0 1st Quartile (Q25): 5.0 3rd Quartile (Q75): 11.0 IQR: 6.0 Standard deviation: 4.593713047585358
________________________________________________________________________________________________________________________ ### Measures for variable 'qtdComparecimento': ## Measures of center: Mode: 268 Mean: 255.61413231728437 Median: 261.0 ## Measures of spread: Min: 13.0 Max: 471.0 Range: 458.0 1st Quartile (Q25): 217.0 3rd Quartile (Q75): 297.0 IQR: 80.0 Standard deviation: 60.862333071411435
________________________________________________________________________________________________________________________ ### Measures for variable 'qtdEleitoresCompBiometrico': ## Measures of center: Mode: 258 Mean: 221.14672093291767 Median: 225.0 ## Measures of spread: Min: 3.0 Max: 425.0 Range: 422.0 1st Quartile (Q25): 179.0 3rd Quartile (Q75): 265.0 IQR: 86.0 Standard deviation: 64.89242566941138
________________________________________________________________________________________________________________________ ### Measures for variable 'qtd_votos_13': ## Measures of center: Mode: 111 Mean: 133.9214871261979 Median: 128.0 ## Measures of spread: Min: 0.0 Max: 346.0 Range: 346.0 1st Quartile (Q25): 98.0 3rd Quartile (Q75): 170.0 IQR: 72.0 Standard deviation: 52.067254609204234
________________________________________________________________________________________________________________________
4.1 Preprocessing the Data¶
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
4.1.1. Logs DF¶
Label Encoding¶
labelencoder = LabelEncoder()
df_logs
date_time | event_type | id_evm | system | description | |
---|---|---|---|---|---|
0 | 2022-02-10 17:15:10 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [2] |
1 | 2022-02-10 17:15:47 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [3] |
2 | 2022-02-10 17:16:25 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [4] |
3 | 2022-02-10 17:17:02 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [5] |
4 | 2022-02-10 17:17:11 | INFO | 67305985 | VOTA | Imprimindo relatório [BUJ] via nº [1] |
... | ... | ... | ... | ... | ... |
2257036 | 2022-02-10 17:02:15 | INFO | 2074298 | VOTA | Dedo reconhecido e o score para habilitá-lo. P... |
2257037 | 2022-02-10 17:02:17 | INFO | 2074298 | VOTA | Tipo de habilitação do eleitor [biométrica] |
2257038 | 2022-02-10 17:02:17 | INFO | 2074298 | VOTA | Eleitor foi habilitado |
2257039 | 2022-02-10 17:02:29 | INFO | 2074298 | VOTA | Tecla indevida pressionada |
2257040 | 2022-02-10 17:02:35 | INFO | 2074298 | VOTA | Voto confirmado para [Deputado Federal] |
2257041 rows × 5 columns
for column in cat_cols_logs:
df_logs[column+'_enc'] = labelencoder.fit_transform(df_logs[column])
df_logs
date_time | event_type | id_evm | system | description | event_type_enc | date_time_enc | system_enc | description_enc | |
---|---|---|---|---|---|---|---|---|---|
0 | 2022-02-10 17:15:10 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [2] | 3 | 43435 | 8 | 24567 |
1 | 2022-02-10 17:15:47 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [3] | 3 | 43472 | 8 | 24568 |
2 | 2022-02-10 17:16:25 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [4] | 3 | 43510 | 8 | 24569 |
3 | 2022-02-10 17:17:02 | INFO | 67305985 | VOTA | Imprimindo relatório [BU] via nº [5] | 3 | 43547 | 8 | 24570 |
4 | 2022-02-10 17:17:11 | INFO | 67305985 | VOTA | Imprimindo relatório [BUJ] via nº [1] | 3 | 43556 | 8 | 24564 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2257036 | 2022-02-10 17:02:15 | INFO | 2074298 | VOTA | Dedo reconhecido e o score para habilitá-lo. P... | 3 | 42660 | 8 | 3331 |
2257037 | 2022-02-10 17:02:17 | INFO | 2074298 | VOTA | Tipo de habilitação do eleitor [biométrica] | 3 | 42662 | 8 | 27563 |
2257038 | 2022-02-10 17:02:17 | INFO | 2074298 | VOTA | Eleitor foi habilitado | 3 | 42662 | 8 | 22381 |
2257039 | 2022-02-10 17:02:29 | INFO | 2074298 | VOTA | Tecla indevida pressionada | 3 | 42674 | 8 | 27533 |
2257040 | 2022-02-10 17:02:35 | INFO | 2074298 | VOTA | Voto confirmado para [Deputado Federal] | 3 | 42680 | 8 | 30011 |
2257041 rows × 9 columns
df_logs.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2257041 entries, 0 to 2257040 Data columns (total 10 columns): # Column Dtype --- ------ ----- 0 date_time datetime64[ns] 1 event_type category 2 id_evm category 3 system category 4 description object 5 event_type_enc int64 6 date_time_enc int64 7 system_enc int64 8 description_enc int64 9 id_evm_enc int64 dtypes: category(3), datetime64[ns](1), int64(5), object(1) memory usage: 144.2+ MB
X_logs = df_logs[['event_type_enc', 'description_enc', 'system_enc', 'id_evm_enc']]
X_logs
event_type_enc | description_enc | system_enc | id_evm_enc | |
---|---|---|---|---|
0 | 3 | 24567 | 8 | 87 |
1 | 3 | 24568 | 8 | 87 |
2 | 3 | 24569 | 8 | 87 |
3 | 3 | 24570 | 8 | 87 |
4 | 3 | 24564 | 8 | 87 |
... | ... | ... | ... | ... |
2257036 | 3 | 3331 | 8 | 48 |
2257037 | 3 | 27563 | 8 | 48 |
2257038 | 3 | 22381 | 8 | 48 |
2257039 | 3 | 27533 | 8 | 48 |
2257040 | 3 | 30011 | 8 | 48 |
2257041 rows × 4 columns
Scaling the data¶
sc = MinMaxScaler(feature_range = (0,1))
X_logs_scaled = sc.fit_transform(X_logs)
X_logs_scaled
array([[1. , 0.81835443, 0.88888889, 1. ], [1. , 0.81838774, 0.88888889, 1. ], [1. , 0.81842105, 0.88888889, 1. ], ..., [1. , 0.74553631, 0.88888889, 0.55172414], [1. , 0.91715523, 0.88888889, 0.55172414], [1. , 0.9997002 , 0.88888889, 0.55172414]])
X_logs.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2257041 entries, 0 to 2257040 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 event_type_enc int64 1 description_enc int64 2 system_enc int64 3 id_evm_enc int64 dtypes: int64(4) memory usage: 86.1 MB
4.1.2. BUs DF¶
Translating features names to English¶
df_bu.rename(columns={
'municipio' : 'municipality',
'zona' : 'zone',
'secao' : 'section',
'qtdEleitoresCompBiometrico' : 'qty_voters_with_biometrics',
'qtdEleitoresAptos' : 'qty_voters_able_to_vote',
'qtdComparecimento' : 'qty_attendance',
'qtd_votos_13' : 'qty_votes_on_13',
'qtd_votos_22' : 'qty_votes_on_22',
'brancos_nulos' : 'qty_blank_and_null_votes',
}, inplace=True)
# Classifying variables by their type again
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)
Categorical variables:
['section', 'zone', 'local', 'municipality']
Quantitative variables:
['qty_votes_on_13', 'qty_attendance', 'qty_voters_with_biometrics', 'qty_votes_on_22', 'qty_voters_able_to_vote', 'qty_blank_and_null_votes']
Full date/time variables:
[]
Label Encoding¶
labelencoder = LabelEncoder()
for column in cat_cols_bu:
df_bu[column+'_enc'] = labelencoder.fit_transform(df_bu[column])
df_bu
local | municipality | zone | section | qty_voters_with_biometrics | qty_voters_able_to_vote | qty_attendance | qty_votes_on_13 | qty_votes_on_22 | qty_blank_and_null_votes | section_enc | zone_enc | local_enc | municipality_enc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1406 | 18376 | 51 | 92 | 199 | 243 | 208 | 112 | 91 | 5 | 91 | 50 | 39 | 265 |
1 | 1414 | 8052 | 35 | 71 | 185 | 281 | 220 | 191 | 25 | 4 | 70 | 34 | 40 | 64 |
2 | 1031 | 17655 | 12 | 58 | 204 | 274 | 217 | 147 | 61 | 9 | 57 | 11 | 2 | 229 |
3 | 1325 | 4014 | 7 | 188 | 238 | 310 | 268 | 218 | 44 | 6 | 187 | 6 | 31 | 27 |
4 | 1074 | 9539 | 49 | 23 | 158 | 266 | 183 | 116 | 64 | 3 | 22 | 48 | 6 | 94 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34639 | 1210 | 58190 | 76 | 57 | 115 | 246 | 176 | 50 | 117 | 9 | 56 | 74 | 20 | 359 |
34640 | 1040 | 58300 | 255 | 192 | 223 | 367 | 260 | 128 | 129 | 3 | 191 | 134 | 3 | 366 |
34641 | 1171 | 58092 | 42 | 18 | 191 | 374 | 290 | 117 | 161 | 12 | 17 | 41 | 16 | 351 |
34642 | 1120 | 54259 | 311 | 191 | 247 | 370 | 291 | 119 | 165 | 7 | 190 | 147 | 11 | 335 |
34643 | 1066 | 73334 | 34 | 92 | 201 | 272 | 202 | 138 | 58 | 6 | 91 | 33 | 5 | 415 |
34644 rows × 14 columns
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null category 1 municipality 34644 non-null category 2 zone 34644 non-null category 3 section 34644 non-null category 4 qty_voters_with_biometrics 34644 non-null int64 5 qty_voters_able_to_vote 34644 non-null int64 6 qty_attendance 34644 non-null int64 7 qty_votes_on_13 34644 non-null int64 8 qty_votes_on_22 34644 non-null int64 9 qty_blank_and_null_votes 34644 non-null int64 10 section_enc 34644 non-null int64 11 zone_enc 34644 non-null int64 12 local_enc 34644 non-null int64 13 municipality_enc 34644 non-null int64 dtypes: category(4), int64(10) memory usage: 3.2 MB
X_bu = df_bu[quan_cols_bu + [column+'_enc' for column in cat_cols_bu]]
X_bu
qty_votes_on_13 | qty_attendance | qty_voters_with_biometrics | qty_votes_on_22 | qty_voters_able_to_vote | qty_blank_and_null_votes | section_enc | zone_enc | local_enc | municipality_enc | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 112 | 208 | 199 | 91 | 243 | 5 | 91 | 50 | 39 | 265 |
1 | 191 | 220 | 185 | 25 | 281 | 4 | 70 | 34 | 40 | 64 |
2 | 147 | 217 | 204 | 61 | 274 | 9 | 57 | 11 | 2 | 229 |
3 | 218 | 268 | 238 | 44 | 310 | 6 | 187 | 6 | 31 | 27 |
4 | 116 | 183 | 158 | 64 | 266 | 3 | 22 | 48 | 6 | 94 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34639 | 50 | 176 | 115 | 117 | 246 | 9 | 56 | 74 | 20 | 359 |
34640 | 128 | 260 | 223 | 129 | 367 | 3 | 191 | 134 | 3 | 366 |
34641 | 117 | 290 | 191 | 161 | 374 | 12 | 17 | 41 | 16 | 351 |
34642 | 119 | 291 | 247 | 165 | 370 | 7 | 190 | 147 | 11 | 335 |
34643 | 138 | 202 | 201 | 58 | 272 | 6 | 91 | 33 | 5 | 415 |
34644 rows × 10 columns
X_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 qty_votes_on_13 34644 non-null int64 1 qty_attendance 34644 non-null int64 2 qty_voters_with_biometrics 34644 non-null int64 3 qty_votes_on_22 34644 non-null int64 4 qty_voters_able_to_vote 34644 non-null int64 5 qty_blank_and_null_votes 34644 non-null int64 6 section_enc 34644 non-null int64 7 zone_enc 34644 non-null int64 8 local_enc 34644 non-null int64 9 municipality_enc 34644 non-null int64 dtypes: int64(10) memory usage: 2.9 MB
X_bu_quant = df_bu[quan_cols_bu]
X_bu_quant
qty_votes_on_13 | qty_attendance | qty_voters_with_biometrics | qty_votes_on_22 | qty_voters_able_to_vote | qty_blank_and_null_votes | |
---|---|---|---|---|---|---|
0 | 112 | 208 | 199 | 91 | 243 | 5 |
1 | 191 | 220 | 185 | 25 | 281 | 4 |
2 | 147 | 217 | 204 | 61 | 274 | 9 |
3 | 218 | 268 | 238 | 44 | 310 | 6 |
4 | 116 | 183 | 158 | 64 | 266 | 3 |
... | ... | ... | ... | ... | ... | ... |
34639 | 50 | 176 | 115 | 117 | 246 | 9 |
34640 | 128 | 260 | 223 | 129 | 367 | 3 |
34641 | 117 | 290 | 191 | 161 | 374 | 12 |
34642 | 119 | 291 | 247 | 165 | 370 | 7 |
34643 | 138 | 202 | 201 | 58 | 272 | 6 |
34644 rows × 6 columns
Feature Engineering¶
# Let's create a new feature: attendance_rate
df_bu['attendance_rate'] = df_bu.qty_attendance / df_bu.qty_voters_able_to_vote
# Let's scale it, so it become from 0 to 100, instead of 0 to 1
df_bu['attendance_rate'] = df_bu['attendance_rate'] * 100
X_bu_quant['attendance_rate'] = df_bu['attendance_rate']
X_bu_quant
qty_votes_on_13 | qty_attendance | qty_voters_with_biometrics | qty_votes_on_22 | qty_voters_able_to_vote | qty_blank_and_null_votes | attendance_rate | |
---|---|---|---|---|---|---|---|
0 | 112 | 208 | 199 | 91 | 243 | 5 | 85.596708 |
1 | 191 | 220 | 185 | 25 | 281 | 4 | 78.291815 |
2 | 147 | 217 | 204 | 61 | 274 | 9 | 79.197080 |
3 | 218 | 268 | 238 | 44 | 310 | 6 | 86.451613 |
4 | 116 | 183 | 158 | 64 | 266 | 3 | 68.796992 |
... | ... | ... | ... | ... | ... | ... | ... |
34639 | 50 | 176 | 115 | 117 | 246 | 9 | 71.544715 |
34640 | 128 | 260 | 223 | 129 | 367 | 3 | 70.844687 |
34641 | 117 | 290 | 191 | 161 | 374 | 12 | 77.540107 |
34642 | 119 | 291 | 247 | 165 | 370 | 7 | 78.648649 |
34643 | 138 | 202 | 201 | 58 | 272 | 6 | 74.264706 |
34644 rows × 7 columns
X_bu_quant.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 qty_votes_on_13 34644 non-null int64 1 qty_attendance 34644 non-null int64 2 qty_voters_with_biometrics 34644 non-null int64 3 qty_votes_on_22 34644 non-null int64 4 qty_voters_able_to_vote 34644 non-null int64 5 qty_blank_and_null_votes 34644 non-null int64 6 attendance_rate 34644 non-null float64 dtypes: float64(1), int64(6) memory usage: 2.1 MB
4.2. K-Means¶
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
4.2.1. Logs DF¶
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(X_logs_scaled)
visualizer.show()
plt.show()
Choosing optimal K¶
In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.
The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.
Building the model and getting the clusters¶
KM_5_clusters_logs = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_logs_scaled) # initialise and fit K-Means model
KM_5_clustered_logs = X_logs.copy()
KM_5_clustered_logs.loc[:,'Cluster'] = KM_5_clusters_logs.labels_ # append labels to points
KM_5_clustered_logs
event_type_enc | description_enc | system_enc | id_evm_enc | Cluster | |
---|---|---|---|---|---|
0 | 3 | 24567 | 8 | 87 | 0 |
1 | 3 | 24568 | 8 | 87 | 0 |
2 | 3 | 24569 | 8 | 87 | 0 |
3 | 3 | 24570 | 8 | 87 | 0 |
4 | 3 | 24564 | 8 | 87 | 0 |
... | ... | ... | ... | ... | ... |
2257036 | 3 | 3331 | 8 | 48 | 4 |
2257037 | 3 | 27563 | 8 | 48 | 2 |
2257038 | 3 | 22381 | 8 | 48 | 2 |
2257039 | 3 | 27533 | 8 | 48 | 2 |
2257040 | 3 | 30011 | 8 | 48 | 2 |
2257041 rows × 5 columns
Visualizing the clusters¶
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM_5_clustered_logs,
x='event_type_enc',
y='system_enc',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM_5_clustered_logs,
x='id_evm_enc',
y='event_type_enc',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
axes[0].scatter(KM_5_clusters_logs.cluster_centers_[:,1],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_logs.cluster_centers_[:,0],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM_5_clustered_logs,
x='description_enc',
y='id_evm_enc',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM_5_clustered_logs,
x='description_enc',
y='event_type_enc',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
axes[0].scatter(KM_5_clusters_logs.cluster_centers_[:,1],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_logs.cluster_centers_[:,0],KM_5_clusters_logs.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
Checking the size of the clusters¶
KM4_clust_sizes = KM4_clustered.groupby('Cluster').size().to_frame()
KM4_clust_sizes.columns = ["KM_size"]
KM4_clust_sizes
KM_size | |
---|---|
Cluster | |
0 | 540974 |
1 | 1127412 |
2 | 61606 |
3 | 527049 |
4.2.2. BUs DF - with labeled categorical features¶
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(X_bu)
visualizer.show()
plt.show()
Choosing optimal K¶
In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.
The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.
Building the model and getting the clusters¶
KM_5_clusters = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_bu) # initialise and fit K-Means model
KM5_clustered = X_bu.copy()
KM5_clustered.loc[:,'Cluster'] = KM_5_clusters.labels_ # append labels to points
KM5_clustered
qty_votes_on_13 | qty_attendance | qty_voters_with_biometrics | qty_votes_on_22 | qty_voters_able_to_vote | qty_blank_and_null_votes | section_enc | zone_enc | local_enc | municipality_enc | Cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 112 | 208 | 199 | 91 | 243 | 5 | 91 | 50 | 39 | 265 | 1 |
1 | 191 | 220 | 185 | 25 | 281 | 4 | 70 | 34 | 40 | 64 | 4 |
2 | 147 | 217 | 204 | 61 | 274 | 9 | 57 | 11 | 2 | 229 | 4 |
3 | 218 | 268 | 238 | 44 | 310 | 6 | 187 | 6 | 31 | 27 | 0 |
4 | 116 | 183 | 158 | 64 | 266 | 3 | 22 | 48 | 6 | 94 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34639 | 50 | 176 | 115 | 117 | 246 | 9 | 56 | 74 | 20 | 359 | 1 |
34640 | 128 | 260 | 223 | 129 | 367 | 3 | 191 | 134 | 3 | 366 | 1 |
34641 | 117 | 290 | 191 | 161 | 374 | 12 | 17 | 41 | 16 | 351 | 1 |
34642 | 119 | 291 | 247 | 165 | 370 | 7 | 190 | 147 | 11 | 335 | 1 |
34643 | 138 | 202 | 201 | 58 | 272 | 6 | 91 | 33 | 5 | 415 | 1 |
34644 rows × 11 columns
Visualizing the clusters¶
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM5_clustered,
x='municipality_enc',
y='qty_votes_on_22',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered,
x='municipality_enc',
y='qty_votes_on_13',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM5_clustered,
x='qty_voters_with_biometrics',
y='qty_votes_on_22',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered,
x='qty_voters_with_biometrics',
y='qty_votes_on_13',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM5_clustered,
x='qty_voters_with_biometrics',
y='qty_blank_and_null_votes',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered,
x='qty_blank_and_null_votes',
y='qty_votes_on_22',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
axes[0].scatter(KM_5_clusters.cluster_centers_[:,1],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters.cluster_centers_[:,0],KM_5_clusters.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
Checking the size of the clusters¶
KM5_clust_sizes = KM5_clustered.groupby('Cluster').size().to_frame()
KM5_clust_sizes.columns = ["KM_size"]
KM5_clust_sizes
KM_size | |
---|---|
Cluster | |
0 | 8840 |
1 | 11049 |
2 | 3456 |
3 | 4427 |
4 | 6872 |
4.2.2. BUs DF - without categorical features¶
model = KMeans(random_state=1, n_init=10)
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(X_bu_quant)
visualizer.show()
plt.show()
Choosing optimal K¶
In order to find an appropriate number of clusters, the elbow method was used. In this method for this case, the inertia for a number of clusters between 2 and 10 will be calculated. The rule is to choose the number of clusters where you see a kink or "an elbow" in the graph.
The graph above shows the reduction of a distortion score as the number of clusters increases. However, there is no clear "elbow" visible. The underlying algorithm suggests 4 clusters. A choice of 4 or 5 clusters seems to be fair.
Building the model and getting the clusters¶
KM_5_clusters_only_quant = KMeans(n_clusters=5, init='k-means++', random_state=123, n_init=10).fit(X_bu_quant) # initialise and fit K-Means model
KM5_clustered_only_quant = X_bu_quant.copy()
KM5_clustered_only_quant.loc[:,'Cluster'] = KM_5_clusters_only_quant.labels_ # append labels to points
KM5_clustered_only_quant
qty_votes_on_13 | qty_attendance | qty_voters_with_biometrics | qty_votes_on_22 | qty_voters_able_to_vote | qty_blank_and_null_votes | attendance_rate | Cluster | |
---|---|---|---|---|---|---|---|---|
0 | 112 | 208 | 199 | 91 | 243 | 5 | 85.596708 | 2 |
1 | 191 | 220 | 185 | 25 | 281 | 4 | 78.291815 | 4 |
2 | 147 | 217 | 204 | 61 | 274 | 9 | 79.197080 | 4 |
3 | 218 | 268 | 238 | 44 | 310 | 6 | 86.451613 | 4 |
4 | 116 | 183 | 158 | 64 | 266 | 3 | 68.796992 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
34639 | 50 | 176 | 115 | 117 | 246 | 9 | 71.544715 | 1 |
34640 | 128 | 260 | 223 | 129 | 367 | 3 | 70.844687 | 3 |
34641 | 117 | 290 | 191 | 161 | 374 | 12 | 77.540107 | 3 |
34642 | 119 | 291 | 247 | 165 | 370 | 7 | 78.648649 | 3 |
34643 | 138 | 202 | 201 | 58 | 272 | 6 | 74.264706 | 4 |
34644 rows × 8 columns
Visualizing the clusters¶
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM5_clustered_only_quant,
x='qty_votes_on_13',
y='qty_votes_on_22',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_blank_and_null_votes',
y='qty_votes_on_13',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_with_biometrics',
y='qty_votes_on_22',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_with_biometrics',
y='qty_votes_on_13',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
scat_1 = sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_with_biometrics',
y='qty_blank_and_null_votes',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_with_biometrics',
y='qty_attendance',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_able_to_vote',
y='qty_blank_and_null_votes',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_able_to_vote',
y='qty_attendance',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot(
KM5_clustered_only_quant,
x='attendance_rate',
y='qty_votes_on_13',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered_only_quant,
x='attendance_rate',
y='qty_votes_on_22',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
fig1, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_with_biometrics',
y='attendance_rate',
hue='Cluster',
ax=axes[0],
palette='Set1',
legend='full'
)
sns.scatterplot(
KM5_clustered_only_quant,
x='qty_voters_able_to_vote',
y='attendance_rate',
hue='Cluster',
palette='Set1',
ax=axes[1],
legend='full')
#axes[0].scatter(KM_5_clusters_only_quant.cluster_centers_[:,1],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
#axes[1].scatter(KM_5_clusters_only_quant.cluster_centers_[:,0],KM_5_clusters_only_quant.cluster_centers_[:,2], marker='s', s=40, c="blue")
plt.show()
Checking the size of the clusters¶
KM5_clust_sizes = KM5_clustered_only_quant.groupby('Cluster').size().to_frame()
KM5_clust_sizes.columns = ["KM_size"]
KM5_clust_sizes
KM_size | |
---|---|
Cluster | |
0 | 4373 |
1 | 5509 |
2 | 6849 |
3 | 9293 |
4 | 8620 |
4.3. DBSCAN¶
In DBSCAN there are two major hyperparameters:
- eps
- min_samples
It is difficult arbitrarily to say what values will work the best. Therefore, I will first create a matrix of combinations.
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from itertools import product
4.3.1. Logs DF¶
After many tries, I realized that DBSCAN will not work with the Logs DF, since it always crashes the Jupyter Lab (regardless if it is in my own computer or in Google Colab or in a powerful machine in Google Cloud Platform Workbench).
This is probably due to the fact that this dataset is a huge one (it has more than 2,000,000 rows, and it is already much smaller), and also it has only categorical variables (even though they are already encoded).
4.3.1. BUs DF¶
Chosing optimal parameters¶
eps_values = np.arange(6,10.75,0.75) # eps values to be investigated
min_samples = np.arange(6,12) # min_samples values to be investigated
DBSCAN_params = list(product(eps_values, min_samples))
# Because DBSCAN creates clusters itself based on those two parameters let's check the number of generated clusters.
no_of_clusters = []
sil_score = []
for p in DBSCAN_params:
DBS_clustering = DBSCAN(eps=p[0], min_samples=p[1]).fit(X_bu_quant)
no_of_clusters.append(len(np.unique(DBS_clustering.labels_)))
sil_score.append(silhouette_score(X_bu_quant, DBS_clustering.labels_))
tmp = pd.DataFrame.from_records(DBSCAN_params, columns =['Eps', 'Min_samples'])
tmp['No_of_clusters'] = no_of_clusters
pivot_1 = pd.pivot_table(tmp, values='No_of_clusters', index='Min_samples', columns='Eps')
fig, ax = plt.subplots(figsize=(12,6))
sns.heatmap(pivot_1, annot=True,annot_kws={"size": 16}, cmap="YlGnBu", ax=ax)
ax.set_title('Number of clusters')
plt.show()
The heatplot immediately above shows that, with the given parameters, the number of clusters vary from 2 to 330. However, most of the combinations gives more than 20 clusters. Nevertheless, we can safely choose numbers located on the bottom-left or the bottom-right corner of the heatmap.
Cluster # 1¶
Building the model and getting the clusters¶
DBS_clustering = DBSCAN(eps=16, min_samples=14).fit(X_bu_quant)
DBSCAN_clustered = X_bu_quant.copy()
DBSCAN_clustered.loc[:,'Cluster'] = DBS_clustering.labels_ # append labels to points
Checking the size of the clusters¶
DBSCAN_clust_sizes = DBSCAN_clustered.groupby('Cluster').size().to_frame()
DBSCAN_clust_sizes.columns = ["DBSCAN_size"]
DBSCAN_clust_sizes
DBSCAN_size | |
---|---|
Cluster | |
-1 | 6579 |
0 | 27851 |
1 | 30 |
2 | 7 |
3 | 13 |
4 | 64 |
5 | 78 |
6 | 12 |
7 | 10 |
DBSCAN created 8 clusters plus outliers cluster (-1). Sizes of clusters vary significantly. There are 6579 outliers.
Cluster # 2¶
Building the model and getting the clusters¶
DBS_clustering = DBSCAN(eps=6, min_samples=10).fit(X_bu_quant)
DBSCAN_clustered = X_bu_quant.copy()
DBSCAN_clustered.loc[:,'Cluster'] = DBS_clustering.labels_ # append labels to points
Checking the size of the clusters¶
DBSCAN_clust_sizes = DBSCAN_clustered.groupby('Cluster').size().to_frame()
DBSCAN_clust_sizes.columns = ["DBSCAN_size"]
DBSCAN_clust_sizes
DBSCAN_size | |
---|---|
Cluster | |
-1 | 34613 |
0 | 11 |
1 | 10 |
2 | 10 |
DBSCAN created 3 clusters plus outliers cluster (-1). Sizes of clusters are almost the same. There are 34613 outliers in this cluster.
4.3. SOM¶
Self Organizing Map (SOM) is an unsupervised ANN that uses competitive learning to update its weights - i.e Competition, Cooperation and Adaptation.
Each neuron of the output layer is present with a vector with dimension n. The distance between each neuron present at the output layer and the input data is computed. The neuron with the lowest distance is termed as the most suitable fit.
Updating the vector of the suitable neuron in the final process is known as adaptation, along with its neighbour in cooperation. After selecting the suitable neuron and its neighbours, we process the neuron to update. The more the distance between the neuron and the input, the more the data grows.
from minisom import MiniSom
4.3.1. BUs DF¶
X_bu_quant.values.shape # Let's just check the number of columns in the dataset
(34644, 7)
Setup # 1¶
Building the model¶
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.1, sigma=1.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
[ 10000 / 10000 ] 100% - 0:00:00 left quantization error: 35.115005121036326
Visualizing the U-Matrix¶
The U-Matrix is a common way to visualize the results of a Self-Organizing Map (SOM). It is a 2D representation of the SOM's neurons and the distances between them, where each cell in the U-Matrix corresponds to a neuron in the SOM. The color of each cell represents the distance between that neuron and its neighbors.
Each cell in the U-Matrix corresponds to a neuron in the SOM, and the numbers inside the cells are the indexes of the neurons. The lines separating the cells represent the distances between the neurons.
In general, cells with similar colors in the U-Matrix tend to have similar input vectors assigned to them. This means that these neurons form clusters of similar data in the input space. The darker the color, the more similar are the vectors assigned to that neuron and its neighbors, indicating that these neurons are closer in the input space.
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 2¶
Building the model¶
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.2, sigma=12)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
[ 10000 / 10000 ] 100% - 0:00:00 left quantization error: 54.98352278887444
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 3¶
Building the model¶
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.5, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
[ 10000 / 10000 ] 100% - 0:00:00 left quantization error: 30.96662533699219
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 4¶
Building the model¶
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.75, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 10000, verbose=True)
[ 10000 / 10000 ] 100% - 0:00:00 left quantization error: 29.3336298497645
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 5¶
Building the model¶
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
[ 50000 / 50000 ] 100% - 0:00:00 left quantization error: 24.236883025608197
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 6¶
Building the model¶
neurons_a = 15
neurons_b = 15
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 150000, verbose=True)
[ 150000 / 150000 ] 100% - 0:00:00 left quantization error: 22.86847569466429
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 6¶
Building the model¶
neurons_a = 25
neurons_b = 25
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 200000, verbose=True)
[ 200000 / 200000 ] 100% - 0:00:00 left quantization error: 18.63131306217731
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 6¶
Building the model¶
neurons_a = 40
neurons_b = 40
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.25, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 500000, verbose=True)
[ 500000 / 500000 ] 100% - 0:00:00 left quantization error: 16.029377127043546
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 7¶
Building the model¶
neurons_a = 35
neurons_b = 35
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 500000, verbose=True)
[ 500000 / 500000 ] 100% - 0:00:00 left quantization error: 15.207089114831074
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Setup # 8¶
Building the model¶
neurons_a = 10
neurons_b = 10
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=1)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
[ 50000 / 50000 ] 100% - 0:00:00 left quantization error: 29.85964189856346
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Generating Clusters¶
Of course, we can also generate clusters from SOMs. These clusters are based on the winner neurons (basically, each winner neuron represents a cluster).
# each winner neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))
# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
plt.scatter(X_bu_quant.values[cluster_index == c, 0],
X_bu_quant.values[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)
# plotting centroids
#for centroid in som.get_weights():
# plt.scatter(centroid[:, 0], centroid[:, 1], marker='x',
# s=80, linewidths=35, color='k', label='centroid')
plt.legend()
plt.show()
Setup # 9¶
Building the model¶
neurons_a = 3
neurons_b = 5
som = MiniSom(neurons_a, neurons_b, X_bu_quant.values.shape[1], random_seed=0, learning_rate=.35, sigma=.5)
som.pca_weights_init(X_bu_quant.values)
som.train(X_bu_quant.values, 50000, verbose=True)
[ 50000 / 50000 ] 100% - 0:00:00 left quantization error: 54.21844073175913
Visualizing the U-Matrix¶
plt.figure(figsize=(7, 7))
frequencies = som.activation_response(X_bu_quant.values)
plt.pcolor(frequencies.T, cmap='Blues')
plt.colorbar()
plt.show()
Generating Clusters¶
# each winner neuron represents a cluster
winner_coordinates = np.array([som.winner(x) for x in X_bu_quant.values]).T
# with np.ravel_multi_index we convert the bidimensional
# coordinates to a monodimensional index
cluster_index = np.ravel_multi_index(winner_coordinates, (neurons_a, neurons_b))
# plotting the clusters using the first 2 dimentions of the data
for c in np.unique(cluster_index):
plt.scatter(X_bu_quant.values[cluster_index == c, 0],
X_bu_quant.values[cluster_index == c, 1], label='cluster='+str(c), alpha=.7)
# plotting centroids
#for centroid in som.get_weights():
# plt.scatter(centroid[:, 0], centroid[:, 1], marker='x',
# s=80, linewidths=35, color='k', label='centroid')
plt.legend()
plt.show()