Search in a Knowledge Base – Introduction & Lexical Search




The ability to search online for information or products is part of everyday life for anyone with access to the internet. As consumers in that realm, we are used to typing a few words into a search engine and getting useful results. To many, that is search and that is sufficient.

Popular search engines help you quickly refine a search once you begin typing a search term. Type “kidney” and you’ll be offered a list of medical topics to choose from – kidney stones, pain, cancer, etc. The search engine knows what “most” people search for about kidneys. Add a second word and the engine responds to your refinement with its own: Type “kidney shaped” and the offered list includes kidney shaped desks, pools, nuts, etc. Pick one and results are listed.

This is excellent assistance for most searchers most of the time but the more complex our domain of interest is and the more specific our search is, the less satisfying are the results of such generic search and offerings. More sophisticated methods of search are necessary.


Terms such as “knowledge,” “knowledge assets” and “knowledge bases” that are used in the next sections were introduced in the our previous blog article.


Categories of search


The topic of “search” is broad and complex and is itself the subject of numerous references from articles to textbooks. In this introductory article we will distinguish between two broad types of search available in knowledge bases: lexical and semantic.


Lexical search “looks for literal matches of the query words typed by the user or variants of them, without making an effort to understand what the whole query actually means.[1]” At its simplest, typing text into a search box initiates a lexical search: find assets with these words! This is often sufficient for finding most knowledge assets. Several key variants of lexical search are described below in this article.


Semantic search looks for meaning beyond the text typed by a searcher. Such meaning may be derived from how knowledge assets are encoded and represented within a knowledge base and also by inference based on the application of built-in or user-developed rules. This is a more complex topic that will be the subject of a future blog article.


Lexical Search


The most basic (and obvious) type of search is to seek assets containing simple values: “Find assets that contain the text ‘sodium’ or ‘15,247.’ ”

We quickly realize the need for more specificity. Can the value occur anywhere in an asset, or only in its name, or in selected properties? We need range search: “dates of publicationbetween January 1, 2017 and June 11, 2019.” We would also like to be able to find assets with no values for selected properties.


Lexically “something like…”


Beyond such basics, we’d like the ability to search for inexact matches. How close can an asset’s value be to the search term to be considered a match?


Wildcard search is familiar to many:

  • The search term “*scope” (“*” = multiple letter replacement) returns “telescope,” “endoscope” but not “scoped.”

  • “Microscop%” (“%”= single letter replacement) returns “microscope” and “microscopy” but not “microscopist.”

A “Fuzzy search” search can be tuned to return “grips” “grasp” and “grab” when given the search term “grip.” Such fuzziness is defined in terms of “edit distance” which specifies how many letters must be added, removed or changed in a word to match the search term.


A “Proximity search” allows the search for text where two words occur within “x” words of each other. Seeking text where the word “not” appears within 5 words of the word “successful” is valuable. It returns assets with phrases such as “not even close to successful” or “not thought to be successful” – both are phrases that the strict search for “not successful” would miss. Limiting the proximity to 5 words avoids including results where longer articles happen to include these words but unrelated to each other.


The ability to search across multiple languages is critical to many organizations, as is the ability to easily find text that includes misspelled words. When seeking “hierarchy,” for example, one might like to be shown assets including the commonly misspelled version “heirarchy” also.


Next...


In this article we have introduced two types of search, lexical and semantic, and focused on introducing key capabilities of lexical search. A future article will introduce semantic search.

[1] Bast, Hannah; Buchhold, Björn; Haussmann, Elmar (2016). "Semantic search on text and knowledge bases". Foundations and Trends in Information Retrieval. 10 (2–3): 119–271.