At first, the system splits all of sentences, which are title of
articles in the expert publication database. It splits sentences to
sub sentences by punctuations.
Second, the system parses clauses from those split
sentences by Stanford NLP toolkits [11]. There are several
types in those clauses. However, there are two reasons that let
us choice two type clauses, “Noun+Noun” and “(Adj |
Noun)+Noun," in the system. One is that queries are some
terms of domain, and they are nouns. The other is they are
short terms of noun in general.
Third, the system makes candidates of extend queries by Cvalue
method. C-value is an Automatic Term Recognition
(ATR) measure. It suits the measure that the input is a large
corpus; output is terms of the domain, and domain is very
specific [9]. C-value method ranks all clauses base on the
frequency of clause and the times of nested clause [10]. The
equation is as (1).
In the equation (1), f(c) is the frequency of clause c.
f (nested) is the frequency of nested clause. |c| is the length of
clause c. Tc is the set of clauses that contains c. |Tc| means the
number of the set of clauses.
Fourth, the system extends query items according to the Cvalue
of candidate clauses. It sets the average of C-value as the
threshold. Then it picks out all of the clauses that the C-value is
greater than the average. Every selected clause must be
different. After this phase, the system can get a set of extended
queries { } 1 ,..., n Q = q q .