Pokemon dataset contains information on a total of 801 Pokemon.
it includes:
Data are downloaded from: https://www.kaggle.com/rounakbanik/pokemon/data
library('DT')
library('missMDA')
library('NbClust')
library("FactoMineR")
library('factoextra')
library('fossil')
library("corrplot")
library('plotly')
library('kohonen')
library('mclust')
D=read.csv('Pokemon.csv',header = T,row.names = 1)
datatable(D, rownames = 1, filter="top", options = list(pageLength = 5, scrollX=T) )
## names percentage_male height_m weight_kg
## Abomasnow : 1 Min. : 0.00 Min. : 0.100 Min. : 0.10
## Abra : 1 1st Qu.: 50.00 1st Qu.: 0.600 1st Qu.: 9.00
## Absol : 1 Median : 50.00 Median : 1.000 Median : 27.30
## Accelgor : 1 Mean : 55.16 Mean : 1.164 Mean : 61.38
## Aegislash : 1 3rd Qu.: 50.00 3rd Qu.: 1.500 3rd Qu.: 64.80
## Aerodactyl: 1 Max. :100.00 Max. :14.500 Max. :999.90
## (Other) :795 NA's :98 NA's :20 NA's :20
## hp attack defense speed
## Min. : 1.00 Min. : 5.00 Min. : 5.00 Min. : 5.00
## 1st Qu.: 50.00 1st Qu.: 55.00 1st Qu.: 50.00 1st Qu.: 45.00
## Median : 65.00 Median : 75.00 Median : 70.00 Median : 65.00
## Mean : 68.96 Mean : 77.86 Mean : 73.01 Mean : 66.33
## 3rd Qu.: 80.00 3rd Qu.:100.00 3rd Qu.: 90.00 3rd Qu.: 85.00
## Max. :255.00 Max. :185.00 Max. :230.00 Max. :180.00
##
## base_egg_steps base_happiness capture_rate experience_growth
## Min. : 1280 Min. : 0.00 Min. : 3.00 Min. : 600000
## 1st Qu.: 5120 1st Qu.: 70.00 1st Qu.: 45.00 1st Qu.:1000000
## Median : 5120 Median : 70.00 Median : 60.00 Median :1000000
## Mean : 7191 Mean : 65.36 Mean : 98.76 Mean :1054996
## 3rd Qu.: 6400 3rd Qu.: 70.00 3rd Qu.:170.00 3rd Qu.:1059860
## Max. :30720 Max. :140.00 Max. :255.00 Max. :1640000
## NA's :1
## sp_attack sp_defense generation is_legendary
## Min. : 10.00 Min. : 20.00 Min. :1.00 Min. :0.00000
## 1st Qu.: 45.00 1st Qu.: 50.00 1st Qu.:2.00 1st Qu.:0.00000
## Median : 65.00 Median : 66.00 Median :4.00 Median :0.00000
## Mean : 71.31 Mean : 70.91 Mean :3.69 Mean :0.08739
## 3rd Qu.: 91.00 3rd Qu.: 90.00 3rd Qu.:5.00 3rd Qu.:0.00000
## Max. :194.00 Max. :230.00 Max. :7.00 Max. :1.00000
##
Variables from height_m to sp_defense are the quantitatives variables for the PCA.
Supplementary qualitative variable are generation and ‘is_legendary’.
impData=imputePCA(D[,3:14], ncp = 2, scale = TRUE, method = c("Regularized","EM")) # Impute missing data
ImputData=data.frame(cbind(impData[["completeObs"]],D[,15:16]))
res.pca = PCA(ImputData[,1:14], graph = FALSE,quali.sup = c(13,14))
head(res.pca$eig,4)
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 4.6406574 38.672145 38.67214
## comp 2 1.3770799 11.475666 50.14781
## comp 3 1.2355183 10.295986 60.44380
## comp 4 0.8768248 7.306873 67.75067
In our case, we are studying the 2 first dimensions ( but taking three is the optimal choice with a 60.44% variance cumulative percentage and eigenvalues >1)
fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = T
)
From the plots above (corrplot and variables plot), The first dimension represents the variables : height_m, weight_kg, hp, attack, Capture_rate(negatively), sp_attack and sp_defense while the second dimension represents especially the base_happinesss variables and experience_growth(negatively).
Pokemons will be projected on the factor map after clustering.
NbCLust will be used to determine the optimal number of clusters for the k means and for hierarchical clustering, bellow is the code:
X=scale(ImputData[,1:12]) # Contains scaled charasteristics of each pokemon
res_nbclust<-NbClust(X,min.nc = 2, max.nc = 20, index="silhouette",method = "kmeans")
res_nbclust$All.index
## 2 3 4 5 6 7 8 9 10 11
## 0.2238 0.2342 0.2334 0.2011 0.1504 0.1480 0.1147 0.1238 0.1206 0.1339
## 12 13 14 15 16 17 18 19 20
## 0.1273 0.1380 0.1335 0.1258 0.1206 0.1282 0.1303 0.1269 0.1289
In the following, 3 is considered the optimal cluster’s number with a silhouette index equal to 0.2342.
km=kmeans(X,3)
fviz_cluster(list(data = X, cluster = km$cluster), geom = "point", stand = FALSE )+
scale_colour_manual(values = c("#ffa64d","#00b33c", "#ff3333"))+
scale_fill_manual(values = c("#ffa64d","#00b33c", "#ff3333"))
Group 1 is characterised mainly by high score in every characteristics of dimension 1( height, weight, hp, attack, sp_attack and sp_defense). it seems that Pokemons of group1 are the most powerful Pokemons but doesn’t have high values in base_happiness.
Group2 otherwise is characterised by lower values on dimension 1 but higher values on dimension 2 (having higher base happiness than group 2 (see RadarPlot)).
Group 3 of Pokemons have the lowest values on the two dimensions but have a higher capture rate than other Pokemons.
Centers.km=data.frame(km$centers) #Contains the centroids of each class
datatable(Centers.km, filter="top", options = list(pageLength = 3, scrollX=T) )
plot_ly( type = 'scatterpolar',fill = 'toself',mode='markers') %>%
add_trace(r = as.numeric(Centers.km[1,]),theta = colnames(Centers.km),name = 'Group 1')%>%
add_trace(r = as.numeric(Centers.km[2,]),theta = colnames(Centers.km),name = 'Group 2')%>%
add_trace(r = as.numeric(Centers.km[3,]),theta = colnames(Centers.km),name = 'Group 3')%>%
layout(polar = list(radialaxis = list(visible = T,range = c(-3,3))))
To conslude, Group1 is the most powerful group of Pokemons, Group2 is less powerful than Group1 and Group 3 is the least powerful group of pokemons but with higher Base_happiness abd capture_rate.
d=dist(X,method = 'euclidean')
hc=hclust(d,method = 'ward.D')
classesHC=cutree(hc,k=3)# Return Classes of the hierarchical clustering
# function to find centroid in cluster i
clust.centroid = function(i, dat, classes) {
ind = ( classes == i)
colMeans(dat[ind,])
}
Centers.hc=sapply(unique(classesHC), clust.centroid, X, classesHC)
Centers.hc=data.frame(t(Centers.hc))
table(km$cluster,classesHC)
## classesHC
## 1 2 3
## 1 18 0 49
## 2 419 0 2
## 3 129 184 0
Group1 of Pokemons of kmeans is mainly classified in Group3 of hierarchical clustering, Group2 is mainly classified in Group1 of hierarchical clustering(419 pokemons are in group1 and only 2 are in group3) and Group3 is divided between Group2 and Group3.
## [1] 0.7207584
A Rand index of 0.72 indicates a similarity between kmeans results and hierarchical clustering results.
plot_ly( type = 'scatterpolar',fill = 'toself',mode='markers') %>%
add_trace(r = as.numeric(Centers.hc[1,]),theta = colnames(Centers.hc),name = 'Group 1')%>%
add_trace(r = as.numeric(Centers.hc[2,]),theta = colnames(Centers.hc),name = 'Group 2')%>%
add_trace(r = as.numeric(Centers.hc[3,]),theta = colnames(Centers.hc),name = 'Group 3')%>%
layout(polar = list(radialaxis = list(visible = T,range = c(-3,3))))
Due to changes of switching Pokemons from a cluster to another in hierarchical clustering, the characteristics of the groups are mainly the same. Whereas Group2 have the loawest values of each score other than Base_happiness abd Capture_rate.
The first Group is characterised by a high capture_rate and base_happiness while the second group of pokemons is characterised by higher scores in every other variable.
The third group is characterised by low values in every variable other than base_happiness.
## somc
## 1 2 3
## 1 0 67 0
## 2 0 0 421
## 3 310 0 3
## [1] 0.9931554
Groups of Pokemons that we obtained from Som are nearly the same as the groups we obtained with kmeans.
The rand index in this case is 0.99. this proves a very good similarity between Som Clustering results and kmeans results.
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 9 components:
##
## log-likelihood n df BIC ICL
## -5005.822 801 730 -14892.32 -14941.72
##
## Clustering table:
## 1 2 3 4 5 6 7 8 9
## 103 32 161 133 36 107 42 48 139
## Best BIC values:
## VEV,9 VEV,6 VEV,7
## BIC -14892.32 -16237.81 -16337.576
## BIC diff 0.00 -1345.49 -1445.253