Distributional Semantics: The Magic of Keyword Analysis
November 8, 2018
Most users on the Internet understand there is some kind of magic that happens behind the scene when they use Google and other search engines to identify websites containing the keywords they seek. After all, cyberspace is a mega-size world. Google estimates that it can locate more than 130 trillion web pages across the World Wide Web, including the "Deep Web". While that may sound like an astrophysicist talking about the number of stars in the universe, the semantically indexed Internet contains at least 4.5 billion web pages on a daily basis. So, how do these digital giants do it? It's simple. They rely on post-World War II models to calculate, collate, and rank web pages that most likely contain the words entered in a user's search.
During the middle of the twentieth century, a British linguist named J.R. Firth drew attention to the context-dependent nature of semantic meaning based on his theory of context of situation and the collocational meaning of words. Today, Firth may be best known for his quote... You shall know a word by the company it keeps. A notation that paved the way for the semantic theory of language usage, which states — words that are used and occur in the same contexts tend to purport similar meanings. As a distributional hypothesis, the expanded model suggests the more semantically similar two words are, the more distributionally similar they will be, and the more they will tend to occur in similar linguistic content.
The idea of a positive correlation between distributional and semantic similarity paved the way for numerous usages that make up the heart and soul of the World Wide Web. Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. After construction of an occurrence matrix, words are compared by taking the cosine of the angle between the two vectors formed by any two rows of the model. LSI overcomes the most problematic constraints of Boolean keyword queries (synonymy and polysemy) by understanding that word usage has a predominant sense.
Since the algorithms strictly use a mathematical approach for indexing, LSI is inherently independent of the language of a web page and content text does not need to be in sentence form, thus keyword occurrences in lists can be indexed. In the end, websites that take full advantage of including LSI keywords in the right locations gain a competitive edge, as the search engines understand that a website for children's shoes, also sells kid's shoes, even though the latter does not appear in the content anywhere on that website. Although the math may seem like rocket science, all you need to do is feed the algorithms the words the spiders seek and let Google and Bing work their magic. Users around the globe have instant access and can connect with you online from anywhere, any time. And, nobody has to apply matrix algebra to calculate linear equations to get results.