Way back in 1991, Tim Berners-Lee, then a young English software developer working at CERN in Geneva, Switzerland, came up with an intriguing way of combining a communication protocol for retrieving content (HTTP) with a descriptive language for embedding such links into documents (HTML). Shortly thereafter, as more and more people began to create content on these new HTTP servers, it became necessary to be able to provide some kind of mechanism to find this content.
Simple lists of content links worked fine when you were dealing with a few hundred documents over a few dozen nodes, but the need to create a specialized index as the web grew led to the first automation of catalogs, and by extension led to the switch from statically retrieved content to dynamically generated content. In many respects, search was the first true application built on top of the nascent World Wide Web, and it is still one of the most fundamental.
Web Crawler, Yahoo, Altavista, Google, Bing, and so forth emerged over the course of the next decade as progressively more sophisticated “search engines”, Most were built on the same principle – a particular application known as a spider would retrieve a web page, then would read through the page to index specific terms. An index in this particular case is a look-up table, taking particular words or combinations of words as keys that were then associated with a given URL. When the term is indexed, the resulting link is then weighted based upon various factors that in turn determined the search ranking of that particular URL.
One useful way of thinking about an index is that it takes the results of very expensive computational operations and stores them so these operations need to be done infrequently. It is the digital equivalent of creating an index for a book, where specific topics or keywords are mentioned on certain pages, so that, rather than having to scan through the entire book, you can just go to one of the page numbers of the book to get to the section that talks about “search” as a topic.
There are issues with this particular process, however. The first is syntactical – there are variations of words that are used to specify different modalities of comprehension. For instance, you have verb tenses – “perceives”, “perceived”, “perceiving”, and so on – that indicate different forms of the word “perceive” based upon how they are used in a sentence. The process of identifying that these are all variations off of the same base is called stemming, and even the most rudimentary search engine does this as a matter of course.
A second issue is that phrases can change the meaning of a given word: Captain America is a superhero, Captain Crunch is a cereal. A linguist would properly say that both are in fact “characters”, and that most languages will omit qualifiers when context is known. Significantly Captain Crunch the character (who promotes the Captain Crunch Cereal) is a fictional man dressed in a dark blue and white uniform with red highlights. But then again, this also describes Captain America (and to make matters even more intriguing, the superhero also had his own cereal at one point).
This ambiguity of semantics and reliance upon context has generally meant that, even if documents had an underlying structure that was consistent, that straight lexical search generally has an upper limit of relevance. Such relevance can be thought of as the degree to which the found content matches the expectation of what the searcher was specifically looking for.
This limitation is an important point to consider – straight keyword matching obviously has a higher degree of relevance than a purely random retrieval, but after a certain point, lexical searches must be able to provide a certain degree of contextual metadata. Moreover, search systems need to infer the contextual cloud of sought metadata that the user has in his or her head, usually by analysis of previous search queries made by that individual.
There are five different approaches to improving the relevance of such searches:
- Employ Semantics. Semantics can be thought of as a way to index “concepts” within a narrative structure, as well as a way of embedding non-narrative information into content. These embedded concepts provide ways of linking and tagging common conceptual threads, so that the same concept can link related works together. It also provides a way of linking non-narrative content (what’s typically thought of as data) so that it can be referenceable from within narrative content.
- Machine Learning Classification. Machine learning has become increasingly useful as a way of identifying associations that occur frequently in topics of a specific type, as well as providing the foundation for auto-summarization – building summary content automatically, using existing templates as guides.
- Text Analytics. This involves the use of statistical analysis tools for the building of concordances, for identifying Bayesian assemblages, and for TF-IDF Vectorization, among other uses.Natural Language Processing. This bridges the two approaches, using graphs constructed by partially indexed content in order to extract semantics while taking advantage of machine learning to winnow out spurious connections. Typically such NLP systems do require the development of corpuses or ontologies, though word embedding and similar machine language based tools such as Word2Vec for vectorization illustrate that the dividing line between text analytics and NLP is decreasing.
- Markup Utilization. Finally, most contemporary documents contain some kind of underlying XML representation. Most major office software shifted to zipped-XML content in the late 2000s, and a significant amount of content processing systems today take advantage of this to perform structural lexical analysis.
Arguably, much of the focus in the 2010s tended to be on data manipulation (and speech recognition) at the expense of document manipulation, but the market is ripe for a re-focusing on document and semi-conversational structures such as meeting notes and transcripts that cross the chasm between formal documents and pure data structures, especially in light of the rise of screen mediated meetings and conferencing. The exact nature of this renaissance is still somewhat unclear, but it likely will involve unifying the arenas of XML, JSON, RDF (for Semantics), and machine-learning mediated technologies in conjunction with transformational pipelines (a successor to both XSLT 3.0 and OWL 2).
What does this mean in practice? Auto-transcription of speech content, visual identification of video content, and increasingly automated pipelines for doing both dynamically generated markup and semantification make most forms of media content more self-aware and contextually richer, significantly reducing (or in many cases eliminating outright) the overhead of manual curation of content. Tomorrow’s search engines will be able to identify not only the content that most closely matches based upon keywords, but will even be able to identify the part in a video or the location in a meeting where a superhero appeared or an agreement was made.
Combine this with event-driven process automation. When data has associated metadata, not just in terms of features or properties but in terms of conceptual context, that data can ascertain how best to present itself, without the specific need for costly dashboards or similar programming exercises, can check itself for internal consistency, and can even establish the best mechanisms for building dynamic user interfaces for pulling in new data when needed.
In other words, we are moving beyond search, where search can then be seen primarily as the way that you frame the type of information you seek, and the output then taking the resulting data and making it available in the most appropriate form possible. In general, the conceptual difficulties usually come down to ascertaining the contexts for all concerned, something we are getting better at doing.
Kurt Cagle is the Community Editor of Data Science Central, and has been an information architect, author and programmer for more than thirty years.