The search function is a very powerful tool, assuming you have concrete keywords or concepts to find in your data.
And that does not even take into account the size of the information you might be searching. Business professionals are often working with massive amounts of information. In that circumstance, how would you compile all the keywords you want to search? That just accounts for what you know you’re looking for. How do you find hidden insights or patterns from massive amount of unstructured data that you don’t know that you’re looking for? How do you empower your organization to take information and use it to find relevant information quickly that is both known or unknown to you and gain a competitive edge? IBM Content Analytics solutions do this through both intuitive data visualizations and an interactive user interface.
Three steps to analyzing your content
There are three essential steps to content analytics: information extraction from the unstructured data source, indexing of the data and data visualization for interactive mining.
Let’s talk about the first step. Using technologies such as natural language processing (NLP) can greatly improve information extraction. For instance, if we have a traffic accident report and the document starts with the sentence, “This investigation focused on the intersection crash of a 2009 Nissan Versa and a 2002 Chevrolet Impala.” The system wants to know whether “2009 Nissan Versa” and “2002 Chevrolet Impala” are car names used in this report. If we have a dictionary of all car names, including manufacturer names and the year of manufacture, it is easy to extract.
However, there are so many variations and misspellings for such car names that we are bound to not find every instance, which is why there are two approaches to extract car names from text documents: the rule-based approach, and the machine-learning-based approach.
A rule-based approach is what most content analytics solutions have done in the past. These operate on creating rules for information extraction. In this example, we can create a rule such as “year + manufacturer + car brand.” If we have a dictionary of manufacturers and their car brands, such car names can be extracted.
Rules are concrete and specific and help when we know what we’re looking for. But what happens if we can’t define parameters concretely or know exactly what we’re looking for? We assume that every record that we are looking at is following the rules of how we are looking to extract car names (“year + manufacturer + car brand”). The problem with rules is that rules work until they do not or cannot. That’s especially true with some of the more complicated information that we would want to understand.
IBM Watson Explorer can do the rule-based approach described above, but it can also use the more flexible and powerful machine-learning-based approach. With machine learning, the user does not need any resources such as dictionaries and/or rules, but rather it uses large amounts of text examples with annotations. So if we have hundreds of documents with car name annotations, machine learning technologies can create a model for information extraction.
The IBM Watson Knowledge Studio is a cloud-based offering that has the capability to create such machine learning models with intuitive UI, and it also has a great editor for creating example documents called “ground truth.” These models can then be exported to enable it to be used as a machine-learning annotator in your collections. You can try the free edition of IBM Watson Knowledge Studio here to see how easy it is to create your own models, and exporting the options are available in the standard or premium editions.
The cognitive search and content analytics capabilities of IBM Watson Explorer will allow your organization to discover those hidden insights from both structured and unstructured data. Download the white paper to learn how Watson Explorer powers your organization to gain swift insights to make better business decisions.