Exercise 4.1
If we need T log2
T comparisons (where T is the number of termID–docID pairs) and
two disk seeks for each comparison, how much time would index construction for
Reuters-RCV1 take if we used disk instead of memory for storage and an unoptimized sorting algorithm (i.e., not an external sorting algorithm)? Use the system
parameters in Table 4.1.
Exercise 4.2 [⋆]
How would you create the dictionary in blocked sort-based indexing on the fly to
avoid an extra pass through the data?