Abstract of MS Thesis - Sairam Gurajada

Tuesday 10 April 2012 at 11:01 pm.

With the advent of the search engines, searching the web became very easy andefficient. The search engines provide the user with the relevant results from billionsof web pages within fraction of a second. But, search engines lack the ability toreturn all the relevant results from the web. The main reason is the dynamic natureof the web. Everyday millions of new pages are added to web, and roughly a similarnumber of web pages are modified. The traditional use of static indexing or off-lineindexing might not work for today's dynamic web. To accommodate the changes tothe web, the search engines rebuild the entire index. Though it is a costly approach,it provides a very low query execution times.

Although the use of rebuild solves the problem of keeping the search engines upto date, it is an inefficient process in the current day scenarios. According to Google,currently the web holds a 1 trillion unique web pages. Rebuilding the entire indexagain from scratch is infeasible. Instead, search engines are moving towards updatingthe index incrementally. This approach is often referred to as on-line indexing. Onlineindexing provides a better solution for indexing the in-flowing documents, and for appending the new indexes to the existing index. In addition, on-line indexingallows the use of the index for answering the queries while it is being constructed.

In this thesis, we describe the Query Log based On-line indexing techniques proposedby us. They overcome the drawbacks of the existing on-line indexing approaches byoffering a better trade-off between the query performance and index maintenancecosts. Query log approaches take advantage of the real world scenario, where most ofthe queries that form the query log are repetitive. The proposed approaches achievehigher query performance by having a separate frequent-term index for frequentlyqueried terms, and maintaining them in fewer partitions. On the other hand, wemaintain large infrequent-term indexes, comprising terms which are rarely queried,in large number of partitions to reduce index maintenance costs. This classificationof indexes into frequent-term and infrequent-term index is done through a techniquecalled Horizontal partitioning. By following different index maintenance strategiesfor frequent-term and infrequent-term indexes, query log based approaches achievesignfiicant improvement in query performance with low index maintenance costs comparedto existing approaches like Geometric partitioning, and logarithmic merge.The proposed query log based on-line indexing approaches assume that the querylog taken for classifying the terms remains static over time. In reality, query logchanges from time to time due to the changes in the frequencies of the old queriesposed and new queries being added. The new query log will have different set of frequentand infrequent terms. This dynamic nature of query log may result in performancedegradation of proposed approaches. To overcome this limitation, we proposean on-line index tuning algorithm which reclassifies the terms whose frequencies gotchanged.