B. Evaluation ofCluster Validity Cr

B. Evaluation ofCluster Validity Criterions
As we talked in Section III, there are many cluster validitycriterions can be used to evaluate the performance of
clustering algorithms. But the performance of cluster validity
criterions themselves is different. In this section, we will first
evaluate these validity criterions by applying a single feature
selection method DF on different datasets on which the
performance has already reached a compatible view in this
research field.
DF is a simple but effective feature selection method. When
applying DF on text datasets, if minority of terms are removed,
the clustering performance will be improved or no loss. When
more terms removed, the clustering performance will drop
quickly.
The values of different validity criterions when applying DF
on different datasets are showed in Fig. 1.
The results of AA, RS, FM are respectively range from
0.5714 to 0.7201, from 0.7370 to 0.8928, from 0.1422 to
0.5157. As can be seen in Fig. 1, four curves of RS methods
on different four datasets approximately follow the rule we
mentioned above. But the curves are very gently, so the trends
are not distinct. Four curves ofAA are all follow the rule well
except the curve of TR45. Curves of FM on datasets FBIS and
REI follow the rule of DF very well, while curves of TR45
and TR4 1 surge randomly.
So as to the result of our first experiment, AA is the best
validity criterion. And we can see from the result that text
classification performance varies greatly on different dataset.
The performance of FBIS and REI are much better than the
others. And if we only consider the result of FBIS and REI,
AA and FM validity criterions are both good, and FM may
even better. So in the experiments below, we will mainly use
FBIS and REI datasets, as well as AA and FM validity
criterions.
V. EVALUATION OF FEATURE SELECTION METHODS
The following experiments we conducted are to compare
the unsupervised feature selection methods DF, TC, TVQ and
TV.
We chose K-means to be the clustering algorithm.Since K-means clustering algorithm is easily influenced by selection of
initial centroids, we random produced 5 sets of initial
centroids for each dataset and averaged 5 times performance
as the final clustering performance.
The AA and FM results on FBIS and REI are shown in Fig.
2 to Fig. 5.
From these figures, first, we can see that the unsupervised
feature selection methods can improve the clustering
performance when a certain terms are removed. For all
methods in our experiments, at least 70% terms can be
removed with no loss in clustering performance on bothdatasets. And for most feature selection methods, when certain
features are removed, the clustering performances can be
improved. For instance, when 20% terms of FBIS are
removed by TC method, it can achieve 9.4% FM value
improvement.
Second, TC is the steadiest method in all. The performance
of clustering will not descend distinctly when terms are
removed. The results of TC method are shown in Fig. 6.
Third, TV method is a little worse than TC, but much better
than DF and TVQ. DF method drop quickly when more than
60% terrns are removed, and the performance of TVQ is very
bad when more than 70% terms are removed from RE 1
dataset. The results of TV method are shown in Fig. 7. When
no more than 80% terms are removed from datasets by TV
method, there will be no loss in clustering performance.
VI. CONCLUSION
Clustering is one of the most important tasks in the data
mining process for discovering groups and identifying
interesting distributions and patterns in the underlying data. In
order to solve the high dimensionality and inherent data
sparsity problems of feature space, feature selection methods
are used. In real case, the class information is unknown, so
only unsupervised feature selection methods can be exploited.
In this paper, we evaluate several unsupervised feature
selection methods, including DF, TC, TVQ and a new
proposed method TV. TC and TV are better than DF and TVQ. We also indicate in this paper that the performances of
different cluster validity criterions are not same, and AA and
FM criterions are better for evaluating the clustering results.

0/5000

From: -

To: -

Results (Indonesian) 1: [Copy]

Copied!

B. Evaluation ofCluster Validity CriterionsAs we talked in Section III, there are many cluster validitycriterions can be used to evaluate the performance ofclustering algorithms. But the performance of cluster validitycriterions themselves is different. In this section, we will firstevaluate these validity criterions by applying a single featureselection method DF on different datasets on which theperformance has already reached a compatible view in thisresearch field.DF is a simple but effective feature selection method. Whenapplying DF on text datasets, if minority of terms are removed,the clustering performance will be improved or no loss. Whenmore terms removed, the clustering performance will dropquickly.The values of different validity criterions when applying DFon different datasets are showed in Fig. 1.The results of AA, RS, FM are respectively range from0.5714 to 0.7201, from 0.7370 to 0.8928, from 0.1422 to0.5157. As can be seen in Fig. 1, four curves of RS methodson different four datasets approximately follow the rule wementioned above. But the curves are very gently, so the trendsare not distinct. Four curves ofAA are all follow the rule wellexcept the curve of TR45. Curves of FM on datasets FBIS andREI follow the rule of DF very well, while curves of TR45and TR4 1 surge randomly.So as to the result of our first experiment, AA is the bestvalidity criterion. And we can see from the result that textclassification performance varies greatly on different dataset.The performance of FBIS and REI are much better than theothers. And if we only consider the result of FBIS and REI,AA and FM validity criterions are both good, and FM mayeven better. So in the experiments below, we will mainly useFBIS and REI datasets, as well as AA and FM validitycriterions.V. EVALUATION OF FEATURE SELECTION METHODSThe following experiments we conducted are to comparethe unsupervised feature selection methods DF, TC, TVQ andTV.We chose K-means to be the clustering algorithm.Since K-means clustering algorithm is easily influenced by selection ofinitial centroids, we random produced 5 sets of initialcentroids for each dataset and averaged 5 times performanceas the final clustering performance.The AA and FM results on FBIS and REI are shown in Fig.2 to Fig. 5.From these figures, first, we can see that the unsupervisedfeature selection methods can improve the clusteringperformance when a certain terms are removed. For allmethods in our experiments, at least 70% terms can beremoved with no loss in clustering performance on bothdatasets. And for most feature selection methods, when certainfeatures are removed, the clustering performances can beimproved. For instance, when 20% terms of FBIS areremoved by TC method, it can achieve 9.4% FM valueimprovement.Second, TC is the steadiest method in all. The performanceof clustering will not descend distinctly when terms areremoved. The results of TC method are shown in Fig. 6.Third, TV method is a little worse than TC, but much betterthan DF and TVQ. DF method drop quickly when more than60% terrns are removed, and the performance of TVQ is verybad when more than 70% terms are removed from RE 1dataset. The results of TV method are shown in Fig. 7. Whenno more than 80% terms are removed from datasets by TVmethod, there will be no loss in clustering performance.VI. CONCLUSIONClustering is one of the most important tasks in the datamining process for discovering groups and identifyinginteresting distributions and patterns in the underlying data. Inorder to solve the high dimensionality and inherent datasparsity problems of feature space, feature selection methodsare used. In real case, the class information is unknown, soonly unsupervised feature selection methods can be exploited.In this paper, we evaluate several unsupervised featureselection methods, including DF, TC, TVQ and a newproposed method TV. TC and TV are better than DF and TVQ. We also indicate in this paper that the performances ofdifferent cluster validity criterions are not same, and AA andFM criterions are better for evaluating the clustering results.

Being translated, please wait..

Results (Indonesian) 2:[Copy]

Copied!

Being translated, please wait..

Results (Indonesian) 3:[Copy]

Copied!

Being translated, please wait..

Other languages

The translation tool support: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese, Chinese Traditional, Corsican, Croatian, Czech, Danish, Detect language, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar (Burmese), Nepali, Norwegian, Odia (Oriya), Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Turkmen, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu, Language translation.