The algorithm parses documents into

The algorithm parses documents into termID–docID pairs and accumulates the pairs in memory until a block of a ﬁxed size is full (PARSENEXTBLOCK
in Figure 4.2). We choose the block size to ﬁt comfortably into memory to
permit a fast in-memory sort. The block is then inverted and written to disk.
Inversion involves two steps. First, we sort the termID–docID pairs. Next,
we collect all termID–docID pairs with the same termID into a postings list,
where a posting is simply a docID. The result, an inverted index for the block
we have just read, is then written to disk. Applying this to Reuters-RCV1 and
assuming we can ﬁt 10 million termID–docID pairs into memory, we end up
with ten blocks, each an inverted index of one part of the collection.
In the ﬁnal step, the algorithm simultaneously merges the ten blocks into
one large merged index. An example with two blocks is shown in Figure 4.3,
where we use di
to denote the i
th document of the collection. To do the merging, we open all block ﬁles simultaneously, and maintain small read buffers
for the ten blocks we are reading and a write buffer for the ﬁnal merged index we are writing. In each iteration, we select the lowest termID that has
not been processed yet using a priority queue or a similar data structure. All
postings lists for this termID are read and merged, and the merged list is
written back to disk. Each read buffer is reﬁlled from its ﬁle when necessary.
How expensive is BSBI? Its time complexity is Θ(T log T) because the step
with the highest time complexity is sorting and T is an upper bound for the
number of items we must sort (i.e., the number of termID–docID pairs). But

0/5000

From: -

To: -

Results (Thai) 1: [Copy]

Copied!

อัลกอริทึมการวิเคราะห์เอกสารเป็น termID–docID คู่ และสะสมคู่ในหน่วยความจำจนกว่าจะบล็อกขนาด ﬁxed (PARSENEXTBLOCK
ในรูป 4.2) เราเลือกบล็อกขนาด ﬁt สบายในหน่วยความจำ
อนุญาตให้มีการเรียงลำดับอย่างรวดเร็วในหน่วยความจำ บล็อคแล้วกลับ และถูกเขียนลงดิสก์
กลับเกี่ยวข้องกับขั้นตอนที่สอง ครั้งแรก เราเรียงลำดับคู่ termID–docID ถัดไป,
เรารวบรวมทั้งหมด termID–docID คู่กับ termID เดียวรายการลงรายการบัญชี,
docID เพียงการลงรายการบัญชี ผล ดัชนีกลับสำหรับบล็อค
เราอ่าน จะเขียนไปยังดิสก์ ใช้นี้รอยเตอร์ RCV1 และ
สมมติว่า เราสามารถ termID–docID ﬁt 10 ล้านคู่ในหน่วยความจำ เราเอย
กับสิบบล็อก แต่ละดัชนีกลับส่วนหนึ่งของคอลเลกชันของ
ในขั้นตอน ﬁnal อัลกอริทึมการผสานกันบล็อกสิบเป็น
หนึ่งดัชนีรวมขนาดใหญ่ ตัวอย่าง มี 2 บล็อกจะแสดงในรูปที่ 4.3,
ที่เราใช้ di
แสดง i
th เอกสารของชุด การผสาน เราเปิด ﬁles บล็อกทั้งหมดพร้อมกัน แล้วรักษาบัฟเฟอร์เล็กอ่าน
สำหรับบล็อกสิบ เราจะอ่าน และเขียนบัฟเฟอร์สำหรับดัชนีรวม ﬁnal ที่เราเขียน เนื่อง เราเลือก termID ต่ำที่มี
ไม่ประมวลผลยัง ใช้คิวลำดับความสำคัญหรือโครงสร้างข้อมูลที่คล้ายกัน ทั้งหมด
ลงรายการสำหรับ termID นี้จะอ่านผสาน และรายชื่อผสาน
เขียนกลับไปยังดิสก์ อ่านบัฟเฟอร์เป็น reﬁlled จาก ﬁle ของเมื่อจำเป็น.
BSBI จะแพงอย่างไร ความซับซ้อนของเวลาคือ Θ (บันทึก T T) เนื่องจากขั้นตอนการ
เรียงลำดับเวลาสูง ซับซ้อน และ T จะเป็นขอบเขตบนสำหรับการ
หมายเลขของสินค้าที่เราต้องเรียงลำดับ (เช่น จำนวนคู่ termID–docID) แต่

Being translated, please wait..

Results (Thai) 2:[Copy]

Copied!

Being translated, please wait..

Results (Thai) 3:[Copy]

Copied!

วิธีวิเคราะห์เอกสารใน termid – docid คู่และสะสมคู่ในหน่วยความจำจนกว่าบล็อกของจึง xed ขนาดเต็ม ( parsenextblock
ในรูปที่ 4.2 ) เราเลือกขนาดบล็อกเพื่อความสะดวกสบายในความทรงจำ

จึงไม่อนุญาตให้รวดเร็วในการจัดเรียงหน่วยความจำ บล็อกเป็นแล้วเอามาเขียนไปยังดิสก์ .
เมื่อเกี่ยวข้องกับสองขั้นตอน ครั้งแรกที่เราเรียง termid – docid คู่ ต่อไป
เรารวบรวมทั้งหมด termid – docid คู่กับ termid ลงในรายการการโพสต์
ที่โพสต์เป็นเพียง docid . ผล กลับดัชนีบล็อก
เราเพิ่งได้อ่าน แล้วเขียนไปยังดิสก์ ใช้สิ่งนี้เพื่อ reuters-rcv1 และ
t 10 ล้านบาท จึงสรุปได้ว่า เรา termid – docid คู่เข้าไปในหน่วยความจำ เราสิ้นสุดขึ้น
10 บล็อก แต่ละกลับดัชนีส่วนหนึ่งของคอลเลกชัน .
ในนาลจึงก้าวขั้นตอนวิธีผสานลงในบล็อกพร้อมกัน 10
หนึ่งขนาดใหญ่รวมดัชนี ตัวอย่างสองบล็อกจะแสดงในรูปที่ 4.3 ที่เราใช้ดี

ผม
th เพื่อแสดงเอกสารคอลเลกชัน ทำรวมกัน เราเปิดบล็อกจึงเลสพร้อมกันและรักษา
บัฟเฟอร์อ่านขนาดเล็กสำหรับสิบบล็อกที่เราอ่านและเขียนบัฟเฟอร์สำหรับจึง นาล ผสานดัชนีเรากำลังเขียนในแต่ละซ้ำ เราเลือกที่ถูกที่สุด termid ที่มี
ไม่ได้รับการประมวลผลการใช้แถวคอยลำดับความสำคัญหรือโครงสร้างของข้อมูลที่คล้ายคลึงกัน
ประกาศรายชื่อทั้งหมดนี้ termid จะอ่านและผสานและรวมรายชื่อ
เขียนกลับไปยังดิสก์ อ่านแต่ละบัฟเฟอร์เป็นจึงฆ่าจากจึงเลอเมื่อจำเป็น .
bsbi น่ะแพงแค่ไหน เวลาของความซับซ้อนΘ ( T เข้าสู่ระบบ T ) เพราะขั้นตอน
กับเวลาที่มีความซับซ้อน การเรียงลำดับ และ t คือขอบเขตบน
จำนวนรายการที่เราต้องเรียง ( เช่น จำนวนของ termid – docid คู่ ) แต่

Being translated, please wait..

Other languages

The translation tool support: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Catalan, Cebuano, Chichewa, Chinese, Chinese Traditional, Corsican, Croatian, Czech, Danish, Detect language, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Frisian, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Kinyarwanda, Klingon, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar (Burmese), Nepali, Norwegian, Odia (Oriya), Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scots Gaelic, Serbian, Sesotho, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Turkmen, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh, Xhosa, Yiddish, Yoruba, Zulu, Language translation.