bAveraged Accuracy:Rand Statistic:F

b

Averaged Accuracy:
Rand Statistic:
Folkes and Mallows index:

In Section IV and Section V, we will use DF, TC, TVQ and
TV methods to reduce the feature dimensionality of four
datasets: FBIS, REI, TR45 and TR41. Then cluster validity
criterions will be used to evaluate the effect of these feature
selection methods.
A. Datasets
Text classification performance varies greatly on different
dataset. So we chose four different text datasets to evaluate
the performance of the feature selection methods. The
characteristics of the various document collections used in our
experiments are summarized in Table 1.
Data set FBIS is from the Foreign Broadcast Information
Service data of TREC-5 [16]. Data sets REI is from Reuters-
21578 text categorization test collection Distribution 1.0 [17].
Data sets TR45 and TR4 1 are derived from TREC-6
collections. For all data sets, we used a stop-list to remove
common words, and the words were stemmed using Porter's
suffix-stripping algorithm [18].

As we talked in Section III, there are many cluster validitycriterions can be used to evaluate the performance of
clustering algorithms. But the performance of cluster validity
criterions themselves is different. In this section, we will first
evaluate these validity criterions by applying a single feature
selection method DF on different datasets on which the
performance has already reached a compatible view in this
research field.
DF is a simple but effective feature selection method. When
applying DF on text datasets, if minority of terms are removed,
the clustering performance will be improved or no loss. When
more terms removed, the clustering performance will drop
quickly.
The values of different validity criterions when applying DF
on different datasets are showed in Fig. 1.
The results of AA, RS, FM are respectively range from
0.5714 to 0.7201, from 0.7370 to 0.8928, from 0.1422 to
0.5157. As can be seen in Fig. 1, four curves of RS methods
on different four datasets approximately follow the rule we
mentioned above. But the curves are very gently, so the trends
are not distinct. Four curves ofAA are all follow the rule well
except the curve of TR45. Curves of FM on datasets FBIS and
REI follow the rule of DF very well, while curves of TR45
and TR4 1 surge randomly.
So as to the result of our first experiment, AA is the best
validity criterion. And we can see from the result that text
classification performance varies greatly on different dataset.
The performance of FBIS and REI are much better than the
others. And if we only consider the result of FBIS and REI,
AA and FM validity criterions are both good, and FM may
even better. So in the experiments below, we will mainly use
FBIS and REI datasets, as well as AA and FM validity
criterions.
V. EVALUATION OF FEATURE SELECTION METHODS
The following experiments we conducted are to compare
the unsupervised feature selection methods DF, TC, TVQ and
TV.
We chose K-means to be the clustering algorithm.Since K-means clustering algorithm is easily influenced by selection of
initial centroids, we random produced 5 sets of initial
centroids for each dataset and averaged 5 times performance
as the final clustering performance.
The AA and FM results on FBIS and REI are shown in Fig.
2 to Fig. 5.
From these figures, first, we can see that the unsupervised
feature selection methods can improve the clustering
performance when a certain terms are removed. For all
methods in our experiments, at least 70% terms can be
removed with no loss in clustering performance on bothdatasets. And for most feature selection methods, when certain
features are removed, the clustering performances can be
improved. For instance, when 20% terms of FBIS are
removed by TC method, it can achieve 9.4% FM value
improvement.
Second, TC is the steadiest method in all. The performance
of clustering will not descend distinctly when terms are
removed. The results of TC method are shown in Fig. 6.
Third, TV method is a little worse than TC, but much better
than DF and TVQ. DF method drop quickly when more than
60% terrns are removed, and the performance of TVQ is very
bad when more than 70% terms are removed from RE 1
dataset. The results of TV method are shown in Fig. 7. When
no more than 80% terms are removed from datasets by TV
method, there will be no loss in clustering performance.

Averaged Accuracy:
Rand Statistic:
Folkes and Mallows index:

In Section IV and Section V, we will use DF, TC, TVQ and
TV methods to reduce the feature dimensionality of four
datasets: FBIS, REI, TR45 and TR41. Then cluster validity
criterions will be used to evaluate the effect of these feature
selection methods.
A. Datasets
Text classification performance varies greatly on different
dataset. So we chose four different text datasets to evaluate
the performance of the feature selection methods. The
characteristics of the various document collections used in our
experiments are summarized in Table 1.
Data set FBIS is from the Foreign Broadcast Information
Service data of TREC-5 [16]. Data sets REI is from Reuters-
21578 text categorization test collection Distribution 1.0 [17].
Data sets TR45 and TR4 1 are derived from TREC-6
collections. For all data sets, we used a stop-list to remove
common words, and the words were stemmed using Porter's
suffix-stripping algorithm [18].

As we talked in Section III, there are many cluster validitycriterions can be used to evaluate the performance of
clustering algorithms. But the performance of cluster validity
criterions themselves is different. In this section, we will first
evaluate these validity criterions by applying a single feature
selection method DF on different datasets on which the
performance has already reached a compatible view in this
research field.
DF is a simple but effective feature selection method. When
applying DF on text datasets, if minority of terms are removed,
the clustering performance will be improved or no loss. When
more terms removed, the clustering performance will drop
quickly.
The values of different validity criterions when applying DF
on different datasets are showed in Fig. 1.
The results of AA, RS, FM are respectively range from
0.5714 to 0.7201, from 0.7370 to 0.8928, from 0.1422 to
0.5157. As can be seen in Fig. 1, four curves of RS methods
on different four datasets approximately follow the rule we
mentioned above. But the curves are very gently, so the trends
are not distinct. Four curves ofAA are all follow the rule well
except the curve of TR45. Curves of FM on datasets FBIS and
REI follow the rule of DF very well, while curves of TR45
and TR4 1 surge randomly.
So as to the result of our first experiment, AA is the best
validity criterion. And we can see from the result that text
classification performance varies greatly on different dataset.
The performance of FBIS and REI are much better than the
others. And if we only consider the result of FBIS and REI,
AA and FM validity criterions are both good, and FM may
even better. So in the experiments below, we will mainly use
FBIS and REI datasets, as well as AA and FM validity
criterions.
V. EVALUATION OF FEATURE SELECTION METHODS
The following experiments we conducted are to compare
the unsupervised feature selection methods DF, TC, TVQ and
TV.
We chose K-means to be the clustering algorithm.Since K-means clustering algorithm is easily influenced by selection of
initial centroids, we random produced 5 sets of initial
centroids for each dataset and averaged 5 times performance
as the final clustering performance.
The AA and FM results on FBIS and REI are shown in Fig.
2 to Fig. 5.
From these figures, first, we can see that the unsupervised
feature selection methods can improve the clustering
performance when a certain terms are removed. For all
methods in our experiments, at least 70% terms can be
removed with no loss in clustering performance on bothdatasets. And for most feature selection methods, when certain
features are removed, the clustering performances can be
improved. For instance, when 20% terms of FBIS are
removed by TC method, it can achieve 9.4% FM value
improvement.
Second, TC is the steadiest method in all. The performance
of clustering will not descend distinctly when terms are
removed. The results of TC method are shown in Fig. 6.
Third, TV method is a little worse than TC, but much better
than DF and TVQ. DF method drop quickly when more than
60% terrns are removed, and the performance of TVQ is very
bad when more than 70% terms are removed from RE 1
dataset. The results of TV method are shown in Fig. 7. When
no more than 80% terms are removed from datasets by TV
method, there will be no loss in clustering performance.

4436/5000

From: English

To: Indonesian

Results (Indonesian) 1: [Copy]

Copied!

bAveraged Accuracy:Rand Statistic:Folkes and Mallows index:In Section IV and Section V, we will use DF, TC, TVQ andTV methods to reduce the feature dimensionality of fourdatasets: FBIS, REI, TR45 and TR41. Then cluster validitycriterions will be used to evaluate the effect of these featureselection methods.A. DatasetsText classification performance varies greatly on differentdataset. So we chose four different text datasets to evaluatethe performance of the feature selection methods. Thecharacteristics of the various document collections used in ourexperiments are summarized in Table 1.Data set FBIS is from the Foreign Broadcast InformationService data of TREC-5 [16]. Data sets REI is from Reuters-21578 text categorization test collection Distribution 1.0 [17].Data sets TR45 and TR4 1 are derived from TREC-6collections. For all data sets, we used a stop-list to removecommon words, and the words were stemmed using Porter'ssuffix-stripping algorithm [18].As we talked in Section III, there are many cluster validitycriterions can be used to evaluate the performance ofclustering algorithms. But the performance of cluster validitycriterions themselves is different. In this section, we will firstevaluate these validity criterions by applying a single featureselection method DF on different datasets on which theperformance has already reached a compatible view in thisresearch field.DF is a simple but effective feature selection method. Whenapplying DF on text datasets, if minority of terms are removed,the clustering performance will be improved or no loss. Whenmore terms removed, the clustering performance will dropquickly.The values of different validity criterions when applying DFon different datasets are showed in Fig. 1.The results of AA, RS, FM are respectively range from0.5714 to 0.7201, from 0.7370 to 0.8928, from 0.1422 to0.5157. As can be seen in Fig. 1, four curves of RS methodson different four datasets approximately follow the rule wementioned above. But the curves are very gently, so the trendsare not distinct. Four curves ofAA are all follow the rule wellexcept the curve of TR45. Curves of FM on datasets FBIS andREI follow the rule of DF very well, while curves of TR45and TR4 1 surge randomly.So as to the result of our first experiment, AA is the bestvalidity criterion. And we can see from the result that textclassification performance varies greatly on different dataset.The performance of FBIS and REI are much better than theothers. And if we only consider the result of FBIS and REI,AA and FM validity criterions are both good, and FM mayeven better. So in the experiments below, we will mainly useFBIS and REI datasets, as well as AA and FM validitycriterions.V. EVALUATION OF FEATURE SELECTION METHODSThe following experiments we conducted are to comparethe unsupervised feature selection methods DF, TC, TVQ andTV.
We chose K-means to be the clustering algorithm.Since K-means clustering algorithm is easily influenced by selection of
initial centroids, we random produced 5 sets of initial
centroids for each dataset and averaged 5 times performance
as the final clustering performance.
The AA and FM results on FBIS and REI are shown in Fig.
2 to Fig. 5.
From these figures, first, we can see that the unsupervised
feature selection methods can improve the clustering
performance when a certain terms are removed. For all
methods in our experiments, at least 70% terms can be
removed with no loss in clustering performance on bothdatasets. And for most feature selection methods, when certain
features are removed, the clustering performances can be
improved. For instance, when 20% terms of FBIS are
removed by TC method, it can achieve 9.4% FM value
improvement.
Second, TC is the steadiest method in all. The performance
of clustering will not descend distinctly when terms are
removed. The results of TC method are shown in Fig. 6.
Third, TV method is a little worse than TC, but much better
than DF and TVQ. DF method drop quickly when more than
60% terrns are removed, and the performance of TVQ is very
bad when more than 70% terms are removed from RE 1
dataset. The results of TV method are shown in Fig. 7. When
no more than 80% terms are removed from datasets by TV
method, there will be no loss in clustering performance.

Being translated, please wait..

Results (Indonesian) 2:[Copy]

Copied!

b

dirata-ratakan Akurasi:
Rand Statistik:
Folkes dan Mallows Indeks:

Dalam Bagian IV dan Bagian V, kita akan menggunakan DF, TC, TVQ dan
TV metode untuk mengurangi dimensi fitur empat
dataset: FBlS, REI, TR45 dan TR41. Kemudian klaster validitas
kriteria akan digunakan untuk mengevaluasi efek dari fitur ini
metode seleksi.
A. Dataset
kinerja klasifikasi Teks bervariasi pada berbagai
dataset. Jadi kami memilih empat dataset teks yang berbeda untuk mengevaluasi
kinerja metode seleksi fitur. The
karakteristik berbagai koleksi dokumen yang digunakan dalam kami
percobaan dirangkum dalam Tabel 1.
Data yang mengatur FBlS adalah dari Broadcast Luar Negeri Informasi
Data Jasa dari TREC-5 [16]. Data set REI adalah dari Reuters-
21.578 teks koleksi tes kategorisasi Distribusi 1.0 [17].
Data set TR45 dan TR4 1 yang berasal dari TREC-6
koleksi. Untuk semua set data, kami menggunakan stop-daftar untuk menghapus
kata-kata umum, dan kata-kata itu berasal menggunakan Porter
akhiran-stripping algoritma [18].

Ketika kami berbicara dalam Bagian III, ada banyak validitycriterions cluster dapat digunakan untuk mengevaluasi kinerja dari
klastering algoritma. Tapi kinerja validitas klaster
kriteria sendiri berbeda. Pada bagian ini, pertama-tama kita akan
mengevaluasi kriteria validitas ini dengan menerapkan fitur tunggal
metode seleksi DF pada dataset yang berbeda di mana
kinerja sudah mencapai tampilan yang kompatibel dalam
bidang penelitian.
DF adalah metode seleksi fitur sederhana namun efektif. Ketika
menerapkan DF pada dataset teks, jika minoritas istilah dihapus,
kinerja pengelompokan akan ditingkatkan atau tidak ada kerugian. Ketika
lebih istilah dihapus, kinerja pengelompokan akan turun
dengan cepat.
Nilai-nilai kriteria validitas yang berbeda ketika menerapkan DF
pada dataset yang berbeda menunjukkan pada Gambar. 1.
Hasil AA, RS, FM adalah masing-masing berkisar dari
,5714 ke ,7201, 0,7370-0,8928, 0,1422 ke dari
,5157. Seperti dapat dilihat pada Gambar. 1, empat kurva metode RS
di berbagai empat dataset sekitar mengikuti aturan yang kami
sebutkan di atas. Tapi kurva sangat lembut, sehingga tren
yang tidak berbeda. Empat kurva ofAA semua mengikuti aturan dengan baik
kecuali lekukan TR45. Kurva FM pada dataset FBlS dan
REI mengikuti aturan DF sangat baik, sementara kurva dari TR45
dan TR4 1 gelombang acak.
Jadi, untuk hasil percobaan pertama kami, AA adalah yang terbaik
kriteria validitas. Dan dapat kita lihat dari hasil bahwa teks
kinerja klasifikasi sangat bervariasi pada dataset yang berbeda.
Kinerja FBlS dan REI jauh lebih baik daripada yang
lain. Dan jika kita hanya mempertimbangkan hasil FBlS dan REI,
AA dan FM kriteria validitas keduanya baik, dan FM mungkin
bahkan lebih baik. Jadi dalam percobaan di bawah ini, kami akan terutama menggunakan
FBlS dan REI dataset, serta AA dan validitas FM
kriteria.
V. EVALUASI METODE SELEKSI FITUR
berikut Percobaan kami dilakukan adalah untuk membandingkan
metode seleksi fitur tanpa pengawasan DF, TC, TVQ dan
TV.
Kami memilih K-cara untuk menjadi pengelompokan algorithm.Since K-means algoritma clustering mudah dipengaruhi oleh pemilihan
awal centroid, kami random menghasilkan 5 set awal
centroid untuk setiap dataset dan rata-rata 5 kali kinerja
sebagai kinerja pengelompokan akhir.
The AA dan FM hasilnya pada FBlS dan REI ditunjukkan pada Gambar.
2 ke Gambar. 5.
Dari angka-angka ini, pertama, kita dapat melihat bahwa tanpa pengawasan
metode seleksi fitur dapat meningkatkan pengelompokan
kinerja ketika istilah tertentu dihapus. Untuk semua
metode dalam percobaan kami, setidaknya 70% hal dapat
dihapus tanpa kehilangan mengelompokkan kinerja pada bothdatasets. Dan untuk sebagian besar metode seleksi fitur, ketika tertentu
fitur dihapus, pertunjukan clustering dapat
ditingkatkan. Misalnya, ketika 20% dari segi FBlS yang
dihapus dengan metode TC, dapat mencapai 9,4% nilai FM
perbaikan.
Kedua, TC adalah metode steadiest di semua. Kinerja
clustering tidak akan turun jelas ketika istilah yang
dihapus. Hasil metode TC ditunjukkan pada Gambar. 6.
Ketiga, metode TV adalah sedikit lebih buruk dari TC, tapi jauh lebih baik
daripada DF dan TVQ. Metode DF drop dengan cepat ketika lebih dari
60% terrns dihapus, dan kinerja TVQ sangat
buruk ketika lebih dari 70% dari segi dikeluarkan dari RE 1
dataset. Hasil metode TV ditunjukkan pada Gambar. 7. Bila
tidak lebih dari 80% dari segi dikeluarkan dari dataset oleh TV
metode, tidak akan ada penurunan performa clustering.

Being translated, please wait..

Results (Indonesian) 3:[Copy]

Copied!

Being translated, please wait..

Other languages

The translation tool support: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese, Chinese Traditional, Corsican, Croatian, Czech, Danish, Detect language, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar (Burmese), Nepali, Norwegian, Odia (Oriya), Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Turkmen, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu, Language translation.