Enterprise Search: Connecting File Data and Knowledge, Part II

Enterprise Search: Connecting File Data and Knowledge, Part II

If you’re like me, you use the autosuggestion box in Google (or your favorite search engine) to quickly confirm or learn new facts even without having to see the search results. Not sure how to spell the name of that Seattle Seahawks running back? Start entering in the first few letters, “mars,” and Google provides the suggestion “marshawn lynch.” The author of A Tale of Two Cities, Charles something or other?

Google suggests, of course, Charles Dickens. But how the dickens does it know this information? As I pointed out last time, the autosuggestions are based heavily, but not exclusively, on similar queries that are currently being entered by other Googlers.

Google is doing more than just looking at the collective keyword stream. It and other search engine players are also analyzing information from the web pages themselves and, as we’ll soon see, from other resources as well.

If I mistakenly enter “marshall lynch” Google returns instead the results of the football playing Mr. Lynch. Its algorithms knew there’s much more content associated with a slight variation on those keywords, thereby concluding that I’m likely interested in the marshawn variant: “showing results for marshawn lynch” is how Google gently reminds me of this.

So the web pages form a kind of knowledge base that are used also to guide the autosuggestions. Anyone who’s been following Google’s autosuggestions and search results over the last year may have noticed they’ve become even more brilliant.

Take a Semantic Walk With Me

It almost seems as if Google has an understanding of the meaning behind the search keywords. To see Google’s overall intelligence in action, try entering “charles dickens” in the search box.dickens-bio

Google decides that these keywords refer to a human, who is also an author. It displays an information-rich box to the right of the results showing the picture of this human with the books he wrote.

Or better yet type in “charles dickens age” and Google returns the answer of the great author’s age at the time of his death above the search results

How the heck does Google know all this?

As non-silicon life forms, we all have a knowledge map of the world in our minds. We know that A Tale of Two Cities is a novel, which is a type of book, and books are associated with authors. Authors are people, etc. Google, by the way, has its own digital version of the knowledge map. Computer scientists refer to it as a semantic schema, which provides the skeletal structure for organizing information.

In 2013, Google released its Hummingbird update, which introduced semantic search ideas into its algorithms—there’s a nice explanation of it here. This major revamp actual built on it its existing Knowledge Graph schema initiative.

Less well known about the Hummingbird update is that it’s based partially on work done by Freebase, a startup that Google purchased a few years back.

Freebase developed a complex and, I might add, comprehensive schema that organizes lots of knowledge domains: movies, books, geography, etc.  You can think of it as organized metadata. It’s a network-style database — this is not your dad’s relational database — in which every nugget of knowledge or property is linked to another.

Those who want to get a taste of Freebase can go its website, which Google has decided to continue to keep alive (for now). If you type in “charles dickens” you’ll see an enormous number of properties, organized into large groupings known as types. Yes, it’s a complicated knowledge map, but it’s rich in Dickens’ information.

Knowledge geeks who are interested in exploring and querying the knowledgebase themselves can play around with Freebase’s MQL (metaweb query language). Warning: steep learning curve!

Automatic Enterprise Knowledge

Getting back to autosuggestions, now we can begin to understand what Google is really doing. It’s using the keywords to search its own knowledge graph which has incorporated the Freebase data, and then uses the semantic map to guide the autosuggestions and filter the content it finds.

Here’s a good example of the power of these schemas. If I type in “citizen kane 1” Google immediately knows I’m referring to the movie, and that the first number is related to a numeric property of movies, likely the release data. And in fact, that’s exactly the suggestion it provides: citizen kane 1941. Brilliant (and it read my mind).

How can enterprise search perform similar magic and tap into the corporate knowledge found in file data?

Like Google, enterprise search would look at keyword popularity as a starting point for guiding autosuggestions. In my last post I left, as an open question, what would the parallel be to location, which Google uses to adjust its suggestions?

The answer: groups and departments that are maintained in Active Directory would do a fine job. I’d certainly want my autosuggestions tuned by what others in the technical marketing area are entering.

If Cindy and the rest of the team are searching for “product roadmap spreadsheet 2015” than a cruder search, “roadmap,” could be expanded based on the crowd wisdom contained in the marketing searches.

Enterprise Metadata and Autosuggestions

Of course, metadata would have an important role in making the autosuggestions still smarter. How?  It would be helpful to know the file access activities of all employees and organize those with similar patterns, regardless of which department they belong to, into the same virtual group. By the way, Varonis DatAdvantage, using our Metadata Framework, does just that in making its data ownership recommendations.

These groupings could then adjust autosuggestions in enterprise search. For example, I spend a lot of time accessing files under the sales and competitive research folders.  So I’d want my autosuggestions to be weighted more heavily on the keyword popularity of those employees who share my file access patterns.

Make sense, right? Those of us in this virtual group would have similar content tastes and so would benefit from each other’s keywords.

Future Searches

A semantic schema of general knowledge would also boost enterprise search autosuggestion just as it does for the general Internet.

Perhaps this scenario can play out soon:  I need to write a post on updates to data security laws and regulations, of which there’s much content spread out across the file system. Wait, what’s the name of that law involving credit cards, “US credit reporting act” or something?

A well-stocked schema informs the search that ‘US act’ refers to US laws, and suggests the “Fair Credit Report Act of 1970.”  An enterprise search would then bring up relevant mentions of the law, including references to FCRA, and then produce a Google-style info box on the regulations.

Pretty cool.

For those coders who really want to explore this idea in their own work, they can try Freebase’s own autosuggest widget.

In my next post, I’ll show how semantic information embedded in unstructured file data can be pulled out to make the enterprise search experience fabulously brilliant.

Tap into the power of enterprise search. Try DatAnswers free for 30 days!

Image credit: Keith Allison

Get the latest security news in your inbox.