Using Machine Learning to Analyze the 2nd Round of the 2022 Brazilian Presidential Election
Author: Yuri Henrique Galvao - Student # 3151850
This is the Final Project for the Artificial Intelligence Diploma program of The University of Winnipeg - Professional, Applied and Continuing Education (PACE). The idea of this project is to extract the available data generated by the Electronic Voting Machines (EVM) that were used in the 2nd Round of the 2022 Brazilian Presidential Election, clean it, analyze it, and then use clustering models to find data patterns - especially hidden or non-intuitive patterns - and anomalies.
For these clustering and anomaly-detection tasks, I will use the following three unsupervised clustering algorithms, which will be presented and compared: K-Means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Self-Organizing Maps (SOMs).
This notebook is divided into 6 main sections. At the end of the notebook, you will find also the references for the content that I used during the preparation of this notebook. In order to increase readability, some code is hidden (so, please unhide them if you want).
SECTIONS:¶
- Introduction
- Data Ingestion
- Exploratory Data Analysis
- Clustering
4.1 K-Means
4.2 DBSCAN
4.3 SOM - Conclusions
- References
1. Introduction ^¶
1.1. Context¶
In October of the current year (2022) Brazil had the fiercest presidential election of the past three decades. During the first round of federal elections, more than 10 candidates ran for President of the Federative Republic of Brazil, but all Brazilians knew that only two candidates would have real chances of going to the second round: Luis Inácio “Lula” da Silva and Jair Messias Bolsonaro.
As everyone expected, the presidential elections went to the second round with Lula and Bolsonaro, and Brazil had a very controversial election in which Lula won with 50.9% of the valid votes against 49.1% of the valid votes that Bolsonaro received.
Those election were controversial because, among with other issues, Brazil uses an 100% electronic voting system, which is based on a digital electronic voting machine (EVM) and in a digital voting processing system which is considered a black-box system.
1.2. Project Idea¶
In an attempt to make the system more transparent, the public organization responsible for the elections, the Superior Electoral Court (Tribunal Superior Eleitoral – TSE), published all the data gathered by the EVM in their own website. More precisely, it is possible to get the logs, the digital vote registry (Registro Digital de Voto – RDV), and the voting machine bulletin (Boletim de Urna – BU), of each machine used in the elections.
In this project I will exctract raw data from the log files and the BUs, transform it, load it into Pandas DataFrames, clean it, analyze it, and then use clustering models to find hidden or non-intuitive patterns and anomalies. Nevertheless, those patterns will be a great finding and should tell a lot about the dataset.
For that, this notebook focuses on three algorithms: K-Means, DBSCAN and SOMs. They are implemented using mainly two Python libraries: Scikit-Learn and MiniSom.
2.1. Importing the necessary libraries for the ETL process¶
import pandas as pd
import numpy as np
import glob, py7zr, os
2.2. Extracting the Data¶
2.2.1. Web Scraping¶
To web scrap TSE's website and get the zip files of each EVM, I had do develop a Selenium-based web scraper. This webscraper it able to download all EVMs from every single Federative Unit (also called "scope"). Since Brazil has 26 states, 1 Federal District, and 1 scope for the EVMs placed in other countries, the total of scopes / federative units is 28. Therefore, you can run 28 instances of this web scraper to speed up the process of downloading the zip files.
!git clone https://github.com/ygalvao/2022brascraper.git
Cloning into '2022brascraper'... remote: Enumerating objects: 31, done. remote: Counting objects: 100% (31/31), done. remote: Compressing objects: 100% (23/23), done. remote: Total 31 (delta 14), reused 24 (delta 7), pack-reused 0 Receiving objects: 100% (31/31), 8.20 MiB | 9.32 MiB/s, done. Resolving deltas: 100% (14/14), done.
!python 2022brascraper/web_scraper_en.py
Please, enter the name of the state for this instance of the web scraper (e.g. Rio de Janeiro): ^C Traceback (most recent call last): File "/home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/2022brascraper/web_scraper_en.py", line 19, in <module> scope = input('Please, enter the name of the state for this instance of the web scraper (e.g. Rio de Janeiro): ').strip() KeyboardInterrupt
2.2.2. Unzipping¶
The Bash script below automates the process of unzipping the .bu and .logjez files. It will extract the files in the "extracted" subfolder.
!./extract_bu_and_log.sh
Archive: ./BU_e_RDV/DadosDeUrna_o00407-0001900010003_1673243086911.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.logjez Archive: ./BU_e_RDV/DadosDeUrna_o00407-0001900010003_1673269612030.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010003.logjez Archive: ./BU_e_RDV/DadosDeUrna_o00407-0001900010004_1673243094855.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010004.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-0001900010004.logjez (...) Archive: ./BU_e_RDV/DadosDeUrna_o00407-9005000080063_1673201365039.zip extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-9005000080063.bu extracting: /home/yhgalvao/OneDrive/Documentos/AI/Notebooks/Brazil_2022_Elections/formato-arquivos-bu-rdv-ass-digital/BU_e_RDV/extracted/o00407-9005000080063.logjez
2.3. Transforming the Data¶
The BU files (.bu) are in ASN.1 format, which is not readable by humans nor by Python (at least not natively). Therefore, the Bash script below automates the process of dumping the data from the ASN.1 format files into readable flat files (.txt).
Moreover, the "bu_etl" function transforms the data from a BU text file and stores it into a Pandas DataFrame (then returns it).
!./dump_bu.sh
2023-01-09 21:52:00,620 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010003.bu com as especificações ['bu.asn1'] 2023-01-09 21:52:00,868 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010004.bu com as especificações ['bu.asn1'] 2023-01-09 21:52:01,115 - INFO - Converte ./BU_e_RDV/extracted/o00407-0001900010005.bu com as especificações ['bu.asn1'] (...) 2023-01-09 23:31:29,765 - INFO - Converte ./BU_e_RDV/extracted/o00407-1583000350047.bu com as especificações ['bu.asn1']
def bu_etl(file_path:str)->pd.DataFrame:
"""
Transforms the data from a BU text file and stores it into a Pandas DataFrame.
Arg.: (str) the path for the BU text file
Returns: a Pandas DataFrame.
"""
# Read the .txt file
with open(file_path) as f:
data = f.read()
# Split the data into lines
lines = data.split('\n')
# Initialize two empty lists to store the rows of the DataFrame
data_dict = {}
codes = []
brancos_e_nulos = 0
# Iterate through the lines of the file
for i, line in enumerate(lines):
try:
# Split the line by ' = ' and extract the left and right parts
left, right = line.split(' = ')
except:
continue
# Split the left part by '.', extract the last element (the column name), and delete white spaces before and after it(if there is any)
column_name = left.split('.')[-1].strip()
# Get the desired data and stores it in a dictionary
wanted_variables = (
'fase',
'local',
'municipio',
'zona',
'secao',
'qtdEleitoresCompBiometrico',
'idEleicao',
'qtdEleitoresAptos',
'qtdComparecimento',
'qtd_votos_13',
'qtd_votos_22',
'brancos_nulos',
'versaoVotacao'
)
value = right.strip("'")
if column_name == 'codigo':
codes.append(value)
if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 34 and i < 40):
column_name = f'qtd_votos_{codes[0]}'
codes.pop()
if column_name == 'quantidadeVotos' and len(codes) == 1 and (i > 40 and i < 46):
column_name = f'qtd_votos_{codes[0]}'
codes.pop()
if column_name == 'quantidadeVotos' and len(codes) == 0 and (i >= 46):
column_name = 'brancos_nulos'
brancos_e_nulos += int(value)
value = brancos_e_nulos
if column_name in wanted_variables:
data_dict[column_name] = [value]
# Create the DataFrame from the rows
df = pd.DataFrame(data_dict)
return df
2.4. Loading the Data¶
Below are the procedures (mainly loops) to, finally, load the data into DataFrames that can be used by us, by Scikit-Learn, and by MiniSom.
# Importing the data from the BUs into a Pandas DataFrame
files_df2 = glob.glob("./BU_e_RDV/extracted/*.txt")
df_bu_list = []
## This loop will
for i, file_path in enumerate(files_df2):
bu = bu_etl(file_path)
df_bu_list.append(bu)
os.remove(file_path)
if i % 50 == 0:
df_bu = pd.concat(df_bu_list, ignore_index=True)
df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
df_bu_list = [df_bu]
if len(df_bu_list) > 1:
df_bu = pd.concat(df_bu_list, ignore_index=True)
df_bu.to_csv('df_bu.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
# Importing the data from the log files into a Pandas DataFrame
files_df1 = glob.glob("./BU_e_RDV/extracted/*.logjez") # This is where I stored all the 7Zip logfiles from the "extract_bu_and_log.sh" script
df_logs_list = []
## This loop will exctract the logs (in flat file format, .dat) and correctly import its data into a Pandas DataFrame
## At the end of each iteraction, it will delete the recently extracted .dat file (which has around 700kB)
## and the original 7Zip file (the .logjez files), in order to save space in disk
for i, file_path in enumerate(files_df1):
logjez_file = py7zr.SevenZipFile(file_path, mode="r")
logjez_file.extractall(path="./BU_e_RDV/extracted/")
logjez_file.close()
log = pd.read_csv(
'./BU_e_RDV/extracted/logd.dat',
encoding='ISO 8859-1',
header=None,
names=['date_time', 'event_type', 'id_evm', 'system', 'description', 'authenticator'],
sep=None,
engine='python'
)
df_logs_list.append(log)
os.remove('./BU_e_RDV/extracted/logd.dat')
os.remove(file_path)
if i % 50 == 0:
df_logs = pd.concat(df_logs_list, ignore_index=True)
df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
df_logs_list = [df_logs]
if i > 20000:
break
if len(df_logs_list) > 1:
df_logs = pd.concat(df_logs_list, ignore_index=True)
df_logs.to_csv('df_logs.csv') # Saves the DF to a CSV file, so we don't need to run all the ETL process again in the future
--------------------------------------------------------------------------- KeyboardInterrupt
# In case of crash of this notebook (or if you already have the CSV files),
#you can just read the CSV files that were created in the previous two cells.
# For that, just run this cell.
# Nevertheless, running this cell when it is not necessary is safe and harmless.
df_logs = pd.read_csv('df_logs.csv', index_col=0)
df_bu = pd.read_csv('df_bu.csv', index_col=0)
df_logs.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2257041 entries, 0 to 2257040 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date_time object 1 event_type object 2 id_evm int64 3 system object 4 description object dtypes: int64(1), object(4) memory usage: 103.3+ MB
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null int64 1 municipio 34644 non-null int64 2 zona 34644 non-null int64 3 secao 34644 non-null int64 4 qtdEleitoresCompBiometrico 34644 non-null int64 5 qtdEleitoresAptos 34644 non-null int64 6 qtdComparecimento 34644 non-null int64 7 qtd_votos_13 34644 non-null int64 8 qtd_votos_22 34644 non-null int64 9 brancos_nulos 34644 non-null int64 dtypes: int64(10) memory usage: 2.9 MB
3.1. Overall Analysis of the Logs DataFrame¶
from IPython.display import display as show
show(df_logs.head(10))
show(df_logs.tail(10))
date_time | event_type | id_evm | system | description | authenticator | |
---|---|---|---|---|---|---|
0 | 23/09/2022 09:46:41 | INFO | 67305985 | LOGD | Início das operações do logd | 5A7DA76DD475845C |
1 | 23/09/2022 09:46:41 | INFO | 67305985 | LOGD | Urna ligada em 23/09/2022 às 09:45:21 | 624DDD7828CC05C0 |
2 | 23/09/2022 09:46:41 | INFO | 67305985 | SCUE | Iniciando aplicação - Oficial - 1º turno | 81D6CC6DB5FDDCCC |
3 | 23/09/2022 09:46:41 | INFO | 67305985 | SCUE | Versão da aplicação: 8.26.0.0 - Onça-pintada | 958EFE65A6117BF4 |
4 | 23/09/2022 09:46:43 | INFO | 67305985 | SCUE | Urna operando com rede elétrica | 13130D8758EFB5EB |
5 | 23/09/2022 09:46:43 | INFO | 67305985 | SCUE | Bateria interna com carga plena | EE968B0C96E9D91B |
6 | 23/09/2022 09:46:49 | INFO | 67305985 | SCUE | Tamanho da mídia interna: 488.7 MB | 1E829998F4272762 |
7 | 23/09/2022 09:46:52 | INFO | 67305985 | SCUE | Tamanho da memória: 489.4 MB | 114FCB81EB5E33C1 |
8 | 23/09/2022 09:46:53 | INFO | 67305985 | SCUE | Verificação de assinatura de aplicação por eta... | E0F945D741906B3C |
9 | 23/09/2022 09:46:53 | INFO | 67305985 | SCUE | Verificação de assinatura de aplicação por eta... | 1ACBC4446E919E65 |
date_time | event_type | id_evm | system | description | authenticator | |
---|---|---|---|---|---|---|
44639068 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqbio] + [Térm... | 02B9AA39FEFCFDD6 |
44639069 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqman] + [Início] | EBE25D58F9D6EEA5 |
44639070 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqman] + [Térm... | DEB08BE94048457D |
44639071 | 30/10/2022 17:02:50 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqmes] + [Início] | C25D1B574367931F |
44639072 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.wsqmes] + [Térm... | 8BD3F01DA05ADD28 |
44639073 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.ver] + [Início] | 42C25A3BA8669F70 |
44639074 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.ver] + [Término] | 1010DDA0D08CAF07 |
44639075 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.chvtp] + [Início] | F53CEA7E4F73FAD2 |
44639076 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.chvtp] + [Término] | BE5C958E224DFDC6 |
44639077 | 30/10/2022 17:02:51 | INFO | 67305985 | VOTA | Gerando arquivo de resultado [.logjez] + [Início] | 557504AA3E902073 |
df_logs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18639078 entries, 0 to 18639077 Data columns (total 6 columns): # Column Dtype --- ------ ----- 0 date_time object 1 event_type object 2 id_evm int64 3 system object 4 description object 5 authenticator object dtypes: int64(1), object(5) memory usage: 853.2+ MB
We can see that each row in the df_logs DataFrame is one event of one of the systems running inside an EVM.
Due to feasibility purposes, this DataFrame is only a fraction (around 2%) of the data of all the EVM's logs.
3.1.1. Checking duplicated rows¶
df_logs.duplicated().sum()
0
3.1.2. Checking null values¶
df_logs.isna().sum()
date_time 0 event_type 0 id_evm 0 system 0 description 0 authenticator 0 dtype: int64
3.1.3. Checking zero values¶
(df_logs == 0).any(axis=1).sum()
0
# Checking zero values on the columns
(df_logs == 0).any(axis=0)
date_time False event_type False id_evm False system False description False authenticator False dtype: bool
3.1.4. Checking the possible values for each variable and their frequencies¶
for column in df_logs.columns:
print(column+''''s different possible values quantity:''', df_logs[column].nunique(), '\n')
date_time's different possible values quantity: 556999 event_type's different possible values quantity: 4 id_evm's different possible values quantity: 761 system's different possible values quantity: 10 description's different possible values quantity: 133351 authenticator's different possible values quantity: 18639078
3.2. Overall Analysis of the BUs DataFrame¶
from IPython.display import display as show
show(df_bu.head(10))
show(df_bu.tail(10))
fase | local | municipio | zona | secao | qtdEleitoresCompBiometrico | idEleicao | qtdEleitoresAptos | qtdComparecimento | qtd_votos_13 | qtd_votos_22 | brancos_nulos | versaoVotacao | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | oficial | 1406 | 18376 | 51 | 92 | 199 | 545 | 243 | 208 | 112 | 91 | 5.0 | 8.26.0.0 - Onça-pintada |
1 | oficial | 1090 | 23086 | 34 | 110 | 247 | 547 | 295 | 258 | 200 | 43 | 15.0 | 8.26.0.0 - Onça-pintada |
2 | oficial | 1112 | 2372 | 30 | 12 | 190 | 547 | 291 | 205 | 157 | 40 | 8.0 | 8.26.0.0 - Onça-pintada |
3 | oficial | 1414 | 8052 | 35 | 71 | 185 | 545 | 281 | 220 | 191 | 25 | 4.0 | 8.26.0.0 - Onça-pintada |
4 | oficial | 1031 | 17655 | 12 | 58 | 204 | 545 | 274 | 217 | 147 | 61 | 9.0 | 8.26.0.0 - Onça-pintada |
5 | oficial | 1325 | 4014 | 7 | 188 | 238 | 545 | 310 | 268 | 218 | 44 | 6.0 | 8.26.0.0 - Onça-pintada |
6 | oficial | 1066 | 477 | 21 | 310 | 166 | 547 | 300 | 185 | 42 | 142 | 1.0 | 8.26.0.0 - Onça-pintada |
7 | oficial | 1074 | 9539 | 49 | 23 | 158 | 545 | 266 | 183 | 116 | 64 | 3.0 | 8.26.0.0 - Onça-pintada |
8 | oficial | 1023 | 22314 | 27 | 18 | 217 | 547 | 284 | 225 | 157 | 42 | 26.0 | 8.26.0.0 - Onça-pintada |
9 | oficial | 1708 | 15695 | 81 | 44 | 276 | 545 | 347 | 281 | 131 | 134 | 16.0 | 8.26.0.0 - Onça-pintada |
fase | local | municipio | zona | secao | qtdEleitoresCompBiometrico | idEleicao | qtdEleitoresAptos | qtdComparecimento | qtd_votos_13 | qtd_votos_22 | brancos_nulos | versaoVotacao | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61068 | oficial | 1058 | 54070 | 280 | 120 | 138 | 545 | 377 | 301 | 124 | 169 | 8.0 | 8.26.0.0 - Onça-pintada |
61069 | oficial | 1635 | 58335 | 126 | 367 | 327 | 545 | 393 | 330 | 128 | 193 | 9.0 | 8.26.0.0 - Onça-pintada |
61070 | oficial | 1392 | 39136 | 33 | 212 | 298 | 547 | 414 | 355 | 250 | 95 | 10.0 | 8.26.0.0 - Onça-pintada |
61071 | oficial | 1120 | 54070 | 280 | 318 | 215 | 545 | 371 | 262 | 134 | 118 | 10.0 | 8.26.0.0 - Onça-pintada |
61072 | oficial | 1210 | 58190 | 76 | 57 | 115 | 545 | 246 | 176 | 50 | 117 | 9.0 | 8.26.0.0 - Onça-pintada |
61073 | oficial | 1040 | 58300 | 255 | 192 | 223 | 545 | 367 | 260 | 128 | 129 | 3.0 | 8.26.0.0 - Onça-pintada |
61074 | oficial | 1171 | 58092 | 42 | 18 | 191 | 545 | 374 | 290 | 117 | 161 | 12.0 | 8.26.0.0 - Onça-pintada |
61075 | oficial | 1090 | 39675 | 69 | 111 | 221 | 547 | 301 | 227 | 183 | 32 | 12.0 | 8.26.0.0 - Onça-pintada |
61076 | oficial | 1120 | 54259 | 311 | 191 | 247 | 545 | 370 | 291 | 119 | 165 | 7.0 | 8.26.0.0 - Onça-pintada |
61077 | oficial | 1066 | 73334 | 34 | 92 | 201 | 545 | 272 | 202 | 138 | 58 | 6.0 | 8.26.0.0 - Onça-pintada |
df_bu.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61078 entries, 0 to 61077 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fase 61078 non-null object 1 local 61078 non-null object 2 municipio 61078 non-null object 3 zona 61078 non-null object 4 secao 61078 non-null object 5 qtdEleitoresCompBiometrico 60764 non-null object 6 idEleicao 61078 non-null object 7 qtdEleitoresAptos 61078 non-null object 8 qtdComparecimento 61078 non-null object 9 qtd_votos_13 60793 non-null object 10 qtd_votos_22 59551 non-null object 11 brancos_nulos 59380 non-null float64 12 versaoVotacao 61078 non-null object dtypes: float64(1), object(12) memory usage: 6.1+ MB
We can see that each row in the df_bu DataFrame is one EVM.
Due to feasibility purposes, this DataFrame is only a fraction (around 13%) of the data of all the EVM's logs.
3.1.1. Checking duplicated rows¶
df_bu.duplicated().sum()
0
3.1.2. Checking null values¶
df_bu.isna().sum()
fase 0 local 0 municipio 0 zona 0 secao 0 qtdEleitoresCompBiometrico 314 idEleicao 0 qtdEleitoresAptos 0 qtdComparecimento 0 qtd_votos_13 285 qtd_votos_22 1527 brancos_nulos 1698 versaoVotacao 0 dtype: int64
3.1.3. Checking zero values¶
(df_bu == 0).any(axis=1).sum()
0
# Checking zero values on the columns
(df_bu == 0).any(axis=0)
fase False local False municipio False zona False secao False qtdEleitoresCompBiometrico False idEleicao False qtdEleitoresAptos False qtdComparecimento False qtd_votos_13 False qtd_votos_22 False brancos_nulos False versaoVotacao False dtype: bool
3.1.4. Checking the possible values for each variable and their frequencies¶
for column in df_bu.columns:
print(column+''''s different possible values quantity:''', df_bu[column].nunique(), '\n')
fase's different possible values quantity: 1 local's different possible values quantity: 265 municipio's different possible values quantity: 789 zona's different possible values quantity: 195 secao's different possible values quantity: 1054 qtdEleitoresCompBiometrico's different possible values quantity: 421 idEleicao's different possible values quantity: 2 qtdEleitoresAptos's different possible values quantity: 523 qtdComparecimento's different possible values quantity: 426 qtd_votos_13's different possible values quantity: 333 qtd_votos_22's different possible values quantity: 291 brancos_nulos's different possible values quantity: 50 versaoVotacao's different possible values quantity: 1
df_bu.idEleicao.value_counts()
545 34644 547 26434 Name: idEleicao, dtype: int64
3.3. Data Cleaning - Logs DF¶
3.3.1. Deleting unnecessary features¶
df_logs.drop(columns='authenticator', inplace=True)
df_logs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18639078 entries, 0 to 18639077 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date_time object 1 event_type object 2 id_evm int64 3 system object 4 description object dtypes: int64(1), object(4) memory usage: 711.0+ MB
3.3.2. Correcting data types¶
df_logs['date_time'] = pd.to_datetime(df_logs.date_time)
for column in ['event_type', 'id_evm', 'system']:
df_logs[column] = df_logs[column].astype('category')
df_logs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18639078 entries, 0 to 18639077 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 date_time datetime64[ns] 1 event_type category 2 id_evm category 3 system category 4 description object dtypes: category(3), datetime64[ns](1), object(1) memory usage: 355.5+ MB
3.3.3. Deleting personal data¶
This step is for ethical reasons: it is just to delete some personal data from the poll workers / clerks that are in the log files, like social insurance numbers (Cadastro de Pessoa Física - CPF, in Brazil).
df_logs[df_logs.description.str.contains('mesário', case=False)]
date_time | event_type | id_evm | system | description | |
---|---|---|---|---|---|
438 | 2022-02-10 07:14:21 | INFO | 67305985 | VOTA | Operador indagado se ocorrerá registro de mesá... |
439 | 2022-02-10 07:16:12 | INFO | 67305985 | VOTA | Operador confirmou para registrar mesários |
440 | 2022-02-10 07:16:13 | INFO | 67305985 | VOTA | Registrando mesários antes da votação |
441 | 2022-02-10 07:16:29 | INFO | 67305985 | VOTA | Pedido de leitura da biometria do mesário 0182... |
442 | 2022-02-10 07:16:37 | INFO | 67305985 | VOTA | Realizada a conferência da biometria do mesário |
... | ... | ... | ... | ... | ... |
18639044 | 2022-10-30 17:01:33 | INFO | 67305985 | VOTA | Mesário 048058530906 não é eleitor da seção |
18639045 | 2022-10-30 17:01:34 | INFO | 67305985 | VOTA | Mesário 048058530906 registrado |
18639046 | 2022-10-30 17:01:38 | INFO | 67305985 | VOTA | Operador encerrou ciclo de registro de mesários |
18639055 | 2022-10-30 17:02:00 | INFO | 67305985 | VOTA | Mesário indagado sobre qualidade do Boletim de... |
18639056 | 2022-10-30 17:02:37 | INFO | 67305985 | VOTA | Mesário indicou qualidade OK para Boletim de Urna |
1743820 rows × 5 columns
def delete_cpf(description:str)->str:
"""Deletes only the CPF."""
new = description.split(' ')
for token in new:
try:
int(token)
except:
pass
else:
new.remove(token)
new = ' '.join(new)
return new
df_logs['description'] = df_logs.description.apply(delete_cpf)
df_logs[df_logs.description.str.contains('mesário', case=False)]
date_time | event_type | id_evm | system | description | event_type_enc | date_time_enc | system_enc | description_enc | |
---|---|---|---|---|---|---|---|---|---|
138 | 2022-07-10 08:43:20 | INFO | 67305985 | VOTA | Mesário indagado se deseja imprimir cópia do BU | 3 | 52665 | 8 | 24839 |
821 | 2022-10-30 07:01:22 | INFO | 67305985 | VOTA | Operador indagado se ocorrerá registro de mesá... | 3 | 127038 | 8 | 25281 |
822 | 2022-10-30 07:11:26 | INFO | 67305985 | VOTA | Operador confirmou para registrar mesários | 3 | 127640 | 8 | 25277 |
823 | 2022-10-30 07:11:26 | INFO | 67305985 | VOTA | Registrando mesários antes da votação | 3 | 127640 | 8 | 26971 |
824 | 2022-10-30 07:11:55 | INFO | 67305985 | VOTA | Pedido de leitura da biometria do mesário | 3 | 127669 | 8 | 25287 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2256962 | 2022-02-10 16:57:10 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42355 | 8 | 27572 |
2256985 | 2022-02-10 16:58:55 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42460 | 8 | 27572 |
2257001 | 2022-02-10 17:00:05 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42530 | 8 | 27572 |
2257018 | 2022-02-10 17:01:04 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42589 | 8 | 27572 |
2257033 | 2022-02-10 17:02:04 | INFO | 2074298 | VOTA | Título digitado pelo mesário | 3 | 42649 | 8 | 27572 |
212030 rows × 9 columns
3.4. Data Cleaning - BUs DF¶
3.4.1. Deleting unnecessary features¶
df_bu.drop(columns='fase', inplace=True)
df_bu = df_bu[df_bu.idEleicao=='545'].copy() # The id for the 2nd Round of the 2022 Presidencial Election is 545
df_bu.drop(columns='idEleicao', inplace=True)
df_bu.drop(columns='versaoVotacao', inplace=True)
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 61077 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null object 1 municipio 34644 non-null object 2 zona 34644 non-null object 3 secao 34644 non-null object 4 qtdEleitoresCompBiometrico 34644 non-null object 5 qtdEleitoresAptos 34644 non-null object 6 qtdComparecimento 34644 non-null object 7 qtd_votos_13 34580 non-null object 8 qtd_votos_22 33843 non-null object 9 brancos_nulos 33535 non-null float64 dtypes: float64(1), object(9) memory usage: 2.9+ MB
3.4.2. Replacing null values¶
After analyzing the null values in this DF, I came to the conclusion that they mean, actually, zero (0) values.
df_bu.fillna(0, inplace=True)
3.4.3. Correcting data types¶
for column in df_bu.columns.to_list():
if column[:3] == 'qtd':
df_bu[column] = df_bu[column].astype('int')
else:
df_bu[column] = df_bu[column].astype('category')
df_bu['brancos_nulos'] = df_bu['brancos_nulos'].astype('int')
df_bu.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 34644 entries, 0 to 34643 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 local 34644 non-null category 1 municipio 34644 non-null category 2 zona 34644 non-null category 3 secao 34644 non-null category 4 qtdEleitoresCompBiometrico 34644 non-null int64 5 qtdEleitoresAptos 34644 non-null int64 6 qtdComparecimento 34644 non-null int64 7 qtd_votos_13 34644 non-null int64 8 qtd_votos_22 34644 non-null int64 9 brancos_nulos 34644 non-null int64 dtypes: category(4), int64(6) memory usage: 2.2 MB
3.5. Descriptive Statistics¶
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
# Defining a function to classify columns into categorical and quantitative variables
def check_variables(df: pd.DataFrame) -> list:
"""
Separates the categorical variables from the quantitative variables, and store them in their respective list.
"""
cols = df.columns
date_cols = df.select_dtypes(include='datetime').columns
quantitative_cols = df.select_dtypes(include='number').columns
categorical_cols = list(set(cols) - set(quantitative_cols) - set(date_cols))
quantitative_cols = set(quantitative_cols) - set(date_cols)
return categorical_cols, list(quantitative_cols), list(date_cols)
# Defining a function to examine categorical variables
def examine_categorical(categ_var : pd.Series, top : int = 10, others : bool = True) -> None:
'''
This function gets a Pandas DataSeries (categorical column of the Pandas DataFrame) and:
- Gets the top 10 (or other chosen quantity) values
- Compiles all the other values into "others" (or not, if chosen otherwise)
- Prints a frequency distribution table
- Plots a pie chart
- Plots a bar chart
'''
vc = categ_var.value_counts()
vc2 = vc.sort_values(ascending=False)[:top]
new_row = pd.Series(data = {'others': vc.sort_values(ascending=False)[top:].sum()})
vc3 = pd.concat([vc2, new_row])
if others == True:
vc = vc3
msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.
All the other values were compiled into the "others" name.'''
else:
vc = vc2
msg = f'''Please, note that, for many reasons, only the top {top} values were considered to these calculations and visualizations.'''
# Frequency distribution
print(f'''Frequency distribution table for different values of "{categ_var.name}" variable: \n\n{vc}\n''')
print(msg)
# Pie chart
vc.plot(
kind='pie',
ylabel=categ_var.name,
autopct='%.2f%%',
figsize=(10,10))
plt.show()
plt.close()
# Bar chart
bar = vc.plot(
kind='bar',
figsize=(10,8),
align='center')
bar.bar_label(bar.containers[0])
plt.show()
plt.close()
print('_' * 120+'\n' * 3)
# Defining a function to examine numerical variables
def examine_quant(
variable:pd.Series,
optmize_n_bins:bool=False,
no_outliers:bool=False,
n_bins:bool=False
)->None:
'''
Gets a Pandas DataSeries and:
- Prints measures of central tendancy
- Prints measures of spread
- Take the outliers out using the 1.5 IQR criterion, if "no_outliers" == True
- Try to calculate the optimal number of bins for the histogram, if "optmize_n_bins" == True
- Plots a histogram
- Plots box-plot
'''
var_desc = variable.describe()
IQR = var_desc['75%'] - var_desc['25%']
print(f'''### Measures for variable '{variable.name}':
## Measures of center:
Mode: {variable.mode()[0]}
Mean: {var_desc['mean']}
Median: {var_desc['50%']}
## Measures of spread:
Min: {var_desc['min']}
Max: {var_desc['max']}
Range: {var_desc['max'] - var_desc['min']}
1st Quartile (Q25): {var_desc['25%']}
3rd Quartile (Q75): {var_desc['75%']}
IQR: {IQR}
Standard deviation: {var_desc['std']}\n''')
if no_outliers == True:
variable = variable[(variable <= (var_desc['75%'] + 1.5 * IQR)) & (variable >= (var_desc['25%'] - 1.5 * IQR))]
def freedman_diaconis(variable : np.ndarray) -> int:
"""
Use Freedman Diaconis rule to compute optimal histogram bin width - it tries to return the optimal number of bins.
"""
data = np.asarray(variable.values, dtype=np.float_)
IQR = stats.iqr(data, rng=(25, 75), scale=1.0, nan_policy='propagate')
N = data.size
bw = (2 * IQR) / np.power(N, 1/3)
datmin, datmax = data.min(), data.max()
datrng = datmax - datmin
result = int(((datrng / bw) + 1)/5)
return result
#Histogram
if optmize_n_bins:
try:
n_bins_ = freedman_diaconis(variable)
except Exception as e:
print(e)
else:
variable.hist(bins=n_bins_)
plt.show()
plt.close()
elif n_bins:
variable.hist(bins=n_bins)
plt.show()
plt.close()
else:
variable.hist()
plt.show()
plt.close()
#Boxplot
plt.boxplot(x=variable, labels=[variable.name])
plt.ylabel(variable.name)
plt.show()
plt.close()
#Separator line
print('_' * 120+'\n' * 3)
3.5.1. Visualizing the Data - Logs DF¶
mpl.rcParams['font.family'] = ['serif']
# Defining better settings for the visualizations using Seaborn
sns.set(rc={'figure.figsize':(8,6)}, style='whitegrid')
# Classifying variables by their type
cat_cols_logs, quan_cols_logs, date_cols_logs = check_variables(df_logs)
print('Categorical variables:')
show(cat_cols_logs)
print('\nQuantitative variables:')
show(quan_cols_logs)
print('\nFull date/time variables:')
show(date_cols_logs)
Categorical variables:
['event_type', 'id_evm', 'system', 'description']
Quantitative variables:
[]
Full date/time variables:
['date_time']
Visualizing and Examining Categorical Variables¶
# Examining categorical variables
for variable in cat_cols_logs:
examine_categorical(df_logs[variable], top=8, others=True)
Frequency distribution table for different values of "id_evm" variable: 67305985 11761980 2070822 12840 2054533 12742 2053621 12712 2072283 12679 2036805 12648 2040850 12629 2086021 12607 others 6788241 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "system" variable: VOTA 16883946 GAP 1140956 ATUE 347426 SCUE 173027 LOGD 78854 INITJE 13284 VPP 1038 RED 357 others 190 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "description" variable: Aguardando digitação do título 1157222 Título digitado pelo mesário 1126201 Tecla indevida pressionada 1053952 Eleitor foi habilitado 1035473 O voto do eleitor foi computado 1035029 Voto confirmado para [Presidente] 1034747 Solicita digital. Tentativa [1] de [4] 928686 Capturada a digital. Tentativa [1] de [4] 915983 others 10351785 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________ Frequency distribution table for different values of "event_type" variable: INFO 18460996 ALERTA 173218 ERRO 4752 EXTERNO 112 others 0 dtype: int64 Please, note that, for many reasons, only the top 8 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.
________________________________________________________________________________________________________________________
There are no quantitative variables in df_logs.
# Examining quantitative variables with outliers
for variable in quan_cols_logs:
try:
examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=False, n_bins=False)
except Exception as e:
print(e)
# Examining quantitative variables without outliers
for variable in quan_cols_logs:
try:
examine_quant(df_logs[variable], optmize_n_bins=False, no_outliers=True, n_bins=False)
except Exception as e:
print(e)
3.5.2. Visualizing the Data - BUs DF¶
# Classifying variables by their type
cat_cols_bu, quan_cols_bu, date_cols_bu = check_variables(df_bu)
print('Categorical variables:')
show(cat_cols_bu)
print('\nQuantitative variables:')
show(quan_cols_bu)
print('\nFull date/time variables:')
show(date_cols_bu)
Categorical variables:
['municipio', 'zona', 'local', 'secao']
Quantitative variables:
['brancos_nulos', 'qtdEleitoresAptos', 'qtd_votos_13', 'qtdComparecimento', 'qtdEleitoresCompBiometrico', 'qtd_votos_22']
Full date/time variables:
[]
Visualizing and Examining Categorical Variables¶
# Examining categorical variables
for variable in cat_cols_bu:
examine_categorical(df_bu[variable], top=12, others=True)
Frequency distribution table for different values of "municipio" variable: 58335 1816 12190 1678 54038 1453 17612 1414 58190 1090 6050 952 58041 902 3018 741 59013 694 59250 662 73440 609 17590 540 others 22093 dtype: int64 Please, note that, for many reasons, only the top 12 values were considered to these calculations and visualizations. All the other values were compiled into the "others" name.