Continued from previous post here, Lewandowski
(2012) in his research work on web search engine credibility, he
discussed the issue of delivering highly credible search results to
search engine users. He further emphasized the criteria by which search
engines decide upon including documents in their indices. These criteria
include:
· Text-based matching: matches queries and documents to find documents that fulfill the query.
· Popularity: which are measured based on clicks, links that lead to the page etc.
· Freshness: the newness and up-to-date of the document.
· Locality: knowing the locality of the user is paramount in giving useful results.
· Personalisation: giving results based on user’s search habits.
He argued that popularity lies at the heart of these systems.
Search engines use a lot of page ranking algorithms to carry out the indexing of web pages.
Chandra,
Suaib & Beg (2015) outlined and briefly discussed Google search
algorithm updates against web spam some of which include: Page Rank,
panda and penguin among a list of many others. According to them,
Page Rank counts the number and quality of links to a page to calculate a
rough estimate of a website's global importance. They further said that
it can be assumed that important websites are more likely to receive
high number of links from other websites and that initially, Google's
search engine was based on Page Rank and signals like title of page,
anchor text and links etc. Chandra, Suaib & Beg (2015) further
stated that currently, Google search engine uses more than 200 signals
for ranking of web pages as well as to combat web spam. Google also uses
the huge amount of usage data (consisting of query logs, browser logs,
ad-click logs etc.) to interpret complex intent of cryptic queries and
to provide relevant results to end user.
In their research, they
explained that the panda update aimed to lower rank of low quality
websites and increased ranking of news and social networking sites.
Panda is the filter to down rank sites with thin content, content farms,
doorway pages, affiliates websites, sites with high ads-to-content
ratio and number of other quality issues. Panda update affects ranking
of entire website rather than individual page. It includes new signals
like data about the site users blocked via search engine result page
directly or via the chrome browser.
Another important algorithm
update is the penguin update. This update is purely web spam algorithm
update. It adjusts a number of spam factors including keyword stuffing,
in-links coming from spam pages, anchor text/link relevance. Penguin
detects over optimization of tags and internal links, bad neighborhood,
bad ownership etc. Subscribe here for more.
Digital Solutions, Articles And Posts On Agribusiness, Blogging, Business, Web development, Digital Skills Upgrade, Social media marketing, Trends, Currencies Assets, NF Tokens, Tips, Strategies, Opportunities, Tools, Artificial Intelligence and More.
A Review Of Journal Article On Search Engine Operations (Part 2)
A Review Of Journal Article On Search Engine Operations (Part 1)
This post is a review of journal articles that discuss various subjects on search engine operations.
Lewandowski
(2006) discusses Web search engines; mainly the challenges in indexing
the World Wide Web, the user behavior, and the ranking factors used by
these engines. He divided these ranking factors into query-dependent and
query-independent factors, the latter of which have become more and
more important within recent years. The possibilities of these factors
are limited, mainly of those that are based on the widely used link
popularity measures. He concluded his article with an overview of
factors that should be considered to determine the quality of Web search engines. He stated that the challenges in indexing pages in the World
Wide Web are firstly, the size of the database of the search engine
which is shown by the number of pages indexed after crawling. He further
explained that the size does not necessarily portray the overall
quality of the engine as an ideal search engine should know all the
pages of the Web, but there are contents such as duplicates or spam
pages that should not be indexed.
Secondly, another challenge
according to him is the being up-to-date of search engines’ databases,
where he further explained that the contents on the Web change very fast
and therefore, new or updated pages should be indexed as fast as
possible. Search engines face problems in keeping up to date with the
entire Web, and because of its enormous size and the different update
cycles of individual websites, adequate crawling strategies are needed.
Third,
is the problem posed by web content. He argued that on the Web,
documents are written in many different programming languages, that many
different file types are used on the web, and that search engines today
not only index documents written in HTML, but also PDF, Word, or other
Office files. Each file format provides certain difficulties for the
search engines. In the overall ranking, all file formats have to be
considered.
Lewandowski (2006) say that “The Invisible Web is defined
as the part of the Web that search engines do not index. This may be
due to technical reasons or barriers made by website owners, e.g.
password protection or robots exclusions.”
Lastly, according to
Lewandowski (2006), spam is another major challenge for search engines
as the search engines have to filter these spam pages to keep their
indexing clean.
The behavior of users of search engines very
considerably. Research made by Lewandowski (2006) showed that a greater
percentage of search engine users are not sophisticated in their use of
search engines. Most users don’t know advance searching techniques and a
great percentage of those that know seldom use these advanced searching
techniques. He further said that most users don’t go over the first
page of indexed results.
Lewandowski (2006) in his article discussed
the ranking factors of search engines. He classified all ranking factors
into two major categories which are the query-dependent factors and the
query-independent factors. According to him, query-dependent factors
are factors that are one way or the other related to the user search.
They include factors such as word document frequency, search term
distance, search term order, position of the query terms, metatags, and
position of the search terms within the document, emphasis on terms
within the document etc. query-independent factors are used to determine
document quality regardless of the query. According to Lewandowski
(2006), such factors include link popularity, directory hierarchy,
numbers of incoming links, click popularity, how up to date the page is,
document length, file format and size of the website.
Lastly,
Lewandowski (2006) discussed certain critical factors that determine the
quality of a search engine. These factors according to him includes
index quality; which is the aggregate of database size, indexing depth,
how up to date the indexing is, low indexing bias etc., advanced search
features, which is not a commonly used parameter but quite useful in
determining the quality of a search engine.
Web search engines apply a
variety of ranking signals to achieve user satisfaction, i.e., results
pages that provide the best-possible results to the user. Have a nice day.
