A Review Of Journal Article On Search Engine Operations (Part 2)

Continued from previous post here, Lewandowski (2012) in his research work on web search engine credibility, he discussed the issue of delivering highly credible search results to search engine users. He further emphasized the criteria by which search engines decide upon including documents in their indices. These criteria include:
·        Text-based matching: matches queries and documents to find documents that fulfill the query.
·        Popularity: which are measured based on clicks, links that lead to the page etc.
·        Freshness: the newness and up-to-date of the document.
·        Locality: knowing the locality of the user is paramount in giving useful results.
·        Personalisation: giving results based on user’s search habits.
He argued that popularity lies at the heart of these systems.
Search engines use a lot of page ranking algorithms to carry out the indexing of web pages.
Chandra, Suaib & Beg (2015) outlined and briefly discussed Google search algorithm updates against web spam some of which include: Page Rank, panda and penguin among a list of many others. According to them, Page Rank counts the number and quality of links to a page to calculate a rough estimate of a website's global importance. They further said that it can be assumed that important websites are more likely to receive high number of links from other websites and that initially, Google's search engine was based on Page Rank and signals like title of page, anchor text and links etc.  Chandra, Suaib & Beg (2015) further stated that currently, Google search engine uses more than 200 signals for ranking of web pages as well as to combat web spam. Google also uses the huge amount of usage data (consisting of query logs, browser logs, ad-click logs etc.) to interpret complex intent of cryptic queries and to provide relevant results to end user.
In their research, they explained that the panda update aimed to lower rank of low quality websites and increased ranking of news and social networking sites. Panda is the filter to down rank sites with thin content, content farms, doorway pages, affiliates websites, sites with high ads-to-content ratio and number of other quality issues. Panda update affects ranking of entire website rather than individual page. It includes new signals like data about the site users blocked via search engine result page directly or via the chrome browser.
Another important algorithm update is the penguin update. This update is purely web spam algorithm update. It adjusts a number of spam factors including keyword stuffing, in-links coming from spam pages, anchor text/link relevance. Penguin detects over optimization of tags and internal links, bad neighborhood, bad ownership etc. Subscribe here for more.

A Review Of Journal Article On Search Engine Operations (Part 1)

This post is a review of journal articles that discuss various subjects on search engine operations.
Lewandowski (2006) discusses Web search engines; mainly the challenges in indexing the World Wide Web, the user behavior, and the ranking factors used by these engines. He divided these ranking factors into query-dependent and query-independent factors, the latter of which have become more and more important within recent years. The possibilities of these factors are limited, mainly of those that are based on the widely used link popularity measures. He concluded his article with an overview of factors that should be considered to determine the quality of Web search engines.  He stated that the challenges in indexing pages in the World Wide Web are firstly, the size of the database of the search engine which is shown by the number of pages indexed after crawling. He further explained that the size does not necessarily portray the overall quality of the engine as an ideal search engine should know all the pages of the Web, but there are contents such as duplicates or spam pages that should not be indexed.
Secondly, another challenge according to him is the being up-to-date of search engines’ databases, where he further explained that the contents on the Web change very fast and therefore, new or updated pages should be indexed as fast as possible. Search engines face problems in keeping up to date with the entire Web, and because of its enormous size and the different update cycles of individual websites, adequate crawling strategies are needed.
Third, is the problem posed by web content. He argued that on the Web, documents are written in many different programming languages, that many different file types are used on the web, and that search engines today not only index documents written in HTML, but also PDF, Word, or other Office files. Each file format provides certain difficulties for the search engines. In the overall ranking, all file formats have to be considered.
Lewandowski (2006) say that “The Invisible Web is defined as the part of the Web that search engines do not index. This may be due to technical reasons or barriers made by website owners, e.g. password protection or robots exclusions.”
Lastly, according to Lewandowski (2006), spam is another major challenge for search engines as the search engines have to filter these spam pages to keep their indexing clean.
The behavior of users of search engines very considerably. Research made by Lewandowski (2006) showed that a greater percentage of search engine users are not sophisticated in their use of search engines. Most users don’t know advance searching techniques and a great percentage of those that know seldom use these advanced searching techniques. He further said that most users don’t go over the first page of indexed results.
Lewandowski (2006) in his article discussed the ranking factors of search engines. He classified all ranking factors into two major categories which are the query-dependent factors and the query-independent factors. According to him, query-dependent factors are factors that are one way or the other related to the user search. They include factors such as word document frequency, search term distance, search term order, position of the query terms, metatags, and position of the search terms within the document, emphasis on terms within the document etc. query-independent factors are used to determine document quality regardless of the query. According to Lewandowski (2006), such factors include link popularity, directory hierarchy, numbers of incoming links, click popularity, how up to date the page is, document length, file format and size of the website.
Lastly, Lewandowski (2006) discussed certain critical factors that determine the quality of a search engine. These factors according to him includes index quality; which is the aggregate of database size, indexing depth, how up to date the indexing is, low indexing bias etc., advanced search features, which is not a commonly used parameter but quite useful in determining the quality of a search engine.
Web search engines apply a variety of ranking signals to achieve user satisfaction, i.e., results pages that provide the best-possible results to the user. Have a nice day.

Join over 37,700 friends and followers on X @STAYJID2000

Buy Me A Coffee