Results (
Indonesian) 1:
[Copy]Copied!
A. Document Frequiency (DF)Document frequency is the number of documents in which aterm occurs in a dataset. It is the simplest criterion for termselection and easily scales to a large dataset with linearcomputation complexity. A basic assumption of this method isthat terms appear in minority documents are not important orwill not influence the clustering efficiency. It is a simple buteffective feature selection method for text categorization [9].B. Term Contributtion (TC)Because the simple method like DF assumes that each termis of same importance in different documents, it is easilybiased by those common terms which have high documentfrequency but uniform distribution over different classes. TCis proposed to deal with this problem [10].We will introduce TF.IDF (Term Frequency InverseDocument Frequency) first [11]. TF.IDF syntheticallyconsiders the frequency of a term in a document and thedocument frequency of the term. It believes that if a termappears in too many documents, it's too common and notimportant for clustering. So Inverse Document Frequency isconsidered. That is, if the frequency of a term in a document ishigh and it does not appear in many documents, the term isimportant. A common form of TF.IDF isThe result of text clustering is highly dependent on thedocuments similarity. So the contribution of a term can beviewed as its contribution to the documents' similarity. Thesimilarity between documents Di and D is computed by dotproduct:Term variance quality method is introduced by lnderjitDhillon, Jacob Kogan and Charles Nicholas [12]. It followsthe ideas of Salton and McGill [13]. The quality of the term tis measured as follows:Where n is the number of documents in which t occurs atleast once, and fij>=I,j=1,...,n.
Being translated, please wait..
