Pokemons Classification based on Characteristics

Data Set:

Pokemon dataset contains information on a total of 801 Pokemon.

it includes:

The English name of the Pokemon
percentage_male: The percentage of the species that are male. Blank if the Pokemon is genderless.
height_m: Height of the Pokemon in metres
weight_kg: The Weight of the Pokemon in kilograms
hp: The Base HP of the Pokemon
attack: The Base Attack of the Pokemon
defense: The Base Defense of the Pokemon
speed: The Base Speed of the Pokemon
base_egg_steps: The number of steps required to hatch an egg of the Pokemon
base_happiness: Base Happiness of the Pokemon
capture_rate: Capture Rate of the Pokemon
experience_growth: The Experience Growth of the Pokemon
sp_attack: The Base Special Attack of the Pokemon
sp_defense: The Base Special Defense of the Pokemon
generation: The numbered generation which the Pokemon was first introduced
is_legendary: Denotes if the Pokemon is legendary(two major classes: legendary pokemons and not legendary pokemons)

Data are downloaded from: https://www.kaggle.com/rounakbanik/pokemon/data

Packages needed:

library('DT')
library('missMDA')
library('NbClust')
library("FactoMineR")
library('factoextra')
library('fossil')
library("corrplot")
library('plotly')
library('kohonen')
library('mclust')

D=read.csv('Pokemon.csv',header = T,row.names = 1)
datatable(D, rownames = 1, filter="top", options = list(pageLength = 5, scrollX=T) )

summary(D)

##         names     percentage_male     height_m        weight_kg     
##  Abomasnow :  1   Min.   :  0.00   Min.   : 0.100   Min.   :  0.10  
##  Abra      :  1   1st Qu.: 50.00   1st Qu.: 0.600   1st Qu.:  9.00  
##  Absol     :  1   Median : 50.00   Median : 1.000   Median : 27.30  
##  Accelgor  :  1   Mean   : 55.16   Mean   : 1.164   Mean   : 61.38  
##  Aegislash :  1   3rd Qu.: 50.00   3rd Qu.: 1.500   3rd Qu.: 64.80  
##  Aerodactyl:  1   Max.   :100.00   Max.   :14.500   Max.   :999.90  
##  (Other)   :795   NA's   :98       NA's   :20       NA's   :20      
##        hp             attack          defense           speed       
##  Min.   :  1.00   Min.   :  5.00   Min.   :  5.00   Min.   :  5.00  
##  1st Qu.: 50.00   1st Qu.: 55.00   1st Qu.: 50.00   1st Qu.: 45.00  
##  Median : 65.00   Median : 75.00   Median : 70.00   Median : 65.00  
##  Mean   : 68.96   Mean   : 77.86   Mean   : 73.01   Mean   : 66.33  
##  3rd Qu.: 80.00   3rd Qu.:100.00   3rd Qu.: 90.00   3rd Qu.: 85.00  
##  Max.   :255.00   Max.   :185.00   Max.   :230.00   Max.   :180.00  
##                                                                     
##  base_egg_steps  base_happiness    capture_rate    experience_growth
##  Min.   : 1280   Min.   :  0.00   Min.   :  3.00   Min.   : 600000  
##  1st Qu.: 5120   1st Qu.: 70.00   1st Qu.: 45.00   1st Qu.:1000000  
##  Median : 5120   Median : 70.00   Median : 60.00   Median :1000000  
##  Mean   : 7191   Mean   : 65.36   Mean   : 98.76   Mean   :1054996  
##  3rd Qu.: 6400   3rd Qu.: 70.00   3rd Qu.:170.00   3rd Qu.:1059860  
##  Max.   :30720   Max.   :140.00   Max.   :255.00   Max.   :1640000  
##                                   NA's   :1                         
##    sp_attack        sp_defense       generation    is_legendary    
##  Min.   : 10.00   Min.   : 20.00   Min.   :1.00   Min.   :0.00000  
##  1st Qu.: 45.00   1st Qu.: 50.00   1st Qu.:2.00   1st Qu.:0.00000  
##  Median : 65.00   Median : 66.00   Median :4.00   Median :0.00000  
##  Mean   : 71.31   Mean   : 70.91   Mean   :3.69   Mean   :0.08739  
##  3rd Qu.: 91.00   3rd Qu.: 90.00   3rd Qu.:5.00   3rd Qu.:0.00000  
##  Max.   :194.00   Max.   :230.00   Max.   :7.00   Max.   :1.00000  
##

PCA on Characteristics of Pokemons:

Variables from height_m to sp_defense are the quantitatives variables for the PCA.

Supplementary qualitative variable are generation and ‘is_legendary’.

impData=imputePCA(D[,3:14], ncp = 2, scale = TRUE, method = c("Regularized","EM")) # Impute missing data
ImputData=data.frame(cbind(impData[["completeObs"]],D[,15:16]))
res.pca = PCA(ImputData[,1:14], graph = FALSE,quali.sup = c(13,14))

head(res.pca$eig,4)

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  4.6406574              38.672145                          38.67214
## comp 2  1.3770799              11.475666                          50.14781
## comp 3  1.2355183              10.295986                          60.44380
## comp 4  0.8768248               7.306873                          67.75067

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

In our case, we are studying the 2 first dimensions ( but taking three is the optimal choice with a 60.44% variance cumulative percentage and eigenvalues >1)

Visualizing the first 2 dimensions:

corrplot(res.pca$var$cos2, is.corr=FALSE)

fviz_pca_var(res.pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = T
             )

From the plots above (corrplot and variables plot), The first dimension represents the variables : height_m, weight_kg, hp, attack, Capture_rate(negatively), sp_attack and sp_defense while the second dimension represents especially the base_happinesss variables and experience_growth(negatively).

Pokemons will be projected on the factor map after clustering.

Optimal number of clusters:

NbCLust will be used to determine the optimal number of clusters for the k means and for hierarchical clustering, bellow is the code:

X=scale(ImputData[,1:12]) # Contains scaled charasteristics of each pokemon

res_nbclust<-NbClust(X,min.nc = 2, max.nc = 20, index="silhouette",method = "kmeans")
res_nbclust$All.index

##      2      3      4      5      6      7      8      9     10     11 
## 0.2238 0.2342 0.2334 0.2011 0.1504 0.1480 0.1147 0.1238 0.1206 0.1339 
##     12     13     14     15     16     17     18     19     20 
## 0.1273 0.1380 0.1335 0.1258 0.1206 0.1282 0.1303 0.1269 0.1289

In the following, 3 is considered the optimal cluster’s number with a silhouette index equal to 0.2342.

Clustering Using kmeans:

km=kmeans(X,3)

fviz_cluster(list(data = X, cluster = km$cluster), geom = "point", stand = FALSE )+
  scale_colour_manual(values = c("#ffa64d","#00b33c", "#ff3333"))+
  scale_fill_manual(values = c("#ffa64d","#00b33c", "#ff3333"))

Group 1 is characterised mainly by high score in every characteristics of dimension 1( height, weight, hp, attack, sp_attack and sp_defense). it seems that Pokemons of group1 are the most powerful Pokemons but doesn’t have high values in base_happiness.

Group2 otherwise is characterised by lower values on dimension 1 but higher values on dimension 2 (having higher base happiness than group 2 (see RadarPlot)).

Group 3 of Pokemons have the lowest values on the two dimensions but have a higher capture rate than other Pokemons.

Centers.km=data.frame(km$centers) #Contains the centroids of each class
datatable(Centers.km, filter="top", options = list(pageLength = 3, scrollX=T) )

plot_ly(  type = 'scatterpolar',fill = 'toself',mode='markers') %>%
  add_trace(r = as.numeric(Centers.km[1,]),theta = colnames(Centers.km),name = 'Group 1')%>%
  add_trace(r = as.numeric(Centers.km[2,]),theta = colnames(Centers.km),name = 'Group 2')%>%
  add_trace(r = as.numeric(Centers.km[3,]),theta = colnames(Centers.km),name = 'Group 3')%>%
  layout(polar = list(radialaxis = list(visible = T,range = c(-3,3))))

To conslude, Group1 is the most powerful group of Pokemons, Group2 is less powerful than Group1 and Group 3 is the least powerful group of pokemons but with higher Base_happiness abd capture_rate.

Clustering using Hierarchical clusterin:

d=dist(X,method = 'euclidean') 
hc=hclust(d,method = 'ward.D') 
classesHC=cutree(hc,k=3)# Return Classes of the hierarchical clustering

# function to find centroid in cluster i
clust.centroid = function(i, dat, classes) {
  ind = ( classes == i)
  colMeans(dat[ind,])
}
Centers.hc=sapply(unique(classesHC), clust.centroid, X, classesHC)
Centers.hc=data.frame(t(Centers.hc))

table(km$cluster,classesHC)

##    classesHC
##       1   2   3
##   1  18   0  49
##   2 419   0   2
##   3 129 184   0

Group1 of Pokemons of kmeans is mainly classified in Group3 of hierarchical clustering, Group2 is mainly classified in Group1 of hierarchical clustering(419 pokemons are in group1 and only 2 are in group3) and Group3 is divided between Group2 and Group3.

rand.index(km$cluster,classesHC)

## [1] 0.7207584

A Rand index of 0.72 indicates a similarity between kmeans results and hierarchical clustering results.

plot_ly(  type = 'scatterpolar',fill = 'toself',mode='markers') %>%
  add_trace(r = as.numeric(Centers.hc[1,]),theta = colnames(Centers.hc),name = 'Group 1')%>%
  add_trace(r = as.numeric(Centers.hc[2,]),theta = colnames(Centers.hc),name = 'Group 2')%>%
  add_trace(r = as.numeric(Centers.hc[3,]),theta = colnames(Centers.hc),name = 'Group 3')%>%
  layout(polar = list(radialaxis = list(visible = T,range = c(-3,3))))

Due to changes of switching Pokemons from a cluster to another in hierarchical clustering, the characteristics of the groups are mainly the same. Whereas Group2 have the loawest values of each score other than Base_happiness abd Capture_rate.

Clustering Usins SOM (Self-Organizing Map):

set.seed(7)

#create SOM grid
sommap <- som(X, grid = somgrid(3, 1, "hexagonal"))
plot(sommap)

The first Group is characterised by a high capture_rate and base_happiness while the second group of pokemons is characterised by higher scores in every other variable.

The third group is characterised by low values in every variable other than base_happiness.

somc=sommap$unit.classif
table(km$cluster,somc)

##    somc
##       1   2   3
##   1   0  67   0
##   2   0   0 421
##   3 310   0   3

rand.index(km$cluster,somc) # to compare kmeans and Som CLustering

## [1] 0.9931554

Groups of Pokemons that we obtained from Som are nearly the same as the groups we obtained with kmeans.

The rand index in this case is 0.99. this proves a very good similarity between Som Clustering results and kmeans results.

Clustering Usins EM (Expectation Maximization):

EMB <- Mclust(X)

summary(EMB)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 9 components: 
## 
##  log-likelihood   n  df       BIC       ICL
##       -5005.822 801 730 -14892.32 -14941.72
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9 
## 103  32 161 133  36 107  42  48 139

summary(EMB$BIC)

## Best BIC values:
##              VEV,9     VEV,6      VEV,7
## BIC      -14892.32 -16237.81 -16337.576
## BIC diff      0.00  -1345.49  -1445.253