Intelligence from large texts:
transforming unstructured data


Software release

Reuters release open API for the Calais Web Service
The Calais Web service enables publishers and other text workers to automatically metatag the people, places, facts and events in their content. The aim is to increase its search relevance and accessibility on the Web.

Lexalytics releases Salience 4.0
Lexalytics, Inc. (www.lexalytics.com) has released Salience 4.0. This latest version of the software builds in significant improvements to entity extraction, sentiment scoring and thematic extraction, in addition to several other new features.

Webinars etc

Beyond Buzzword Bingo: Discover the Real Business Value of Search and Intelligence. Archived at Inxight.

On-demand presentations

Applying Text Analytics Solutions for Effective Claims Analysis. Attensity

Google/Inxight webinar. June 20. Slides available from Inxight.

Text Analytics Conference

The 7th Annual Text Analytics Summit was held in Boston in May. Some info on Text Analytics Summit West is available.


What is Text Analytics?

6 AUGUST 2006

Data crunched by companies and government agencies is typically quantitative. These numbers are manipulated within relational databases to yield useful information. However, the intelligence potentially available to organizations is much larger than what is garnered from these traditional sources. Note the phrase “potentially available”. How do we get access to this vast potential resource? The problem is that useful business intelligence is buried within large amounts of text data, such as company documents, emails, customer survey reports, and so on. Text documents are structured for reading by people, but they are unstructured as far as data extraction is concerned.  The essence of text analytics is to take very large unstructured text documents and extract useful business intelligence.

Before examining text analytics in more detail, let’s consider a range of ways to extract data from large texts. We can distinguish two broad approaches: queries and transformations.

Queries. One way to extract information from large texts is to formulate a query. Once a query is specified, software routines trawl through the text to provide a response to the query. An example of a response may be something as simple as a list of all instances of the words “IBM” and “UIMA” that occur within a certain span of words, say strings of 10 words or so. The queries and the responses may be more complex than this, but what characterizes a query is the obvious fact that you have to specify the query. In order to formulate a good query, you have to know what you want to know, and then from that decide how to structure the query, following the constraints of the query system software, to obtain the desired results. You have to decide what you want to know, and you have to make assumptions about the kind of information contained in the text documents.

Transformations. A query can be considered to be a request to reveal specified data patterns hidden within a text. An alternative way to deal with texts is to give a request along the lines of: “transform yourself to reveal interesting data patterns”. A simple example of this notion might be a request for a summary of a document.  Following this transformation metaphor, the summarization software can be viewed as a request to a document to transform itself into a summary.

Both queries and transformations are useful and have their place. One interesting aspect of a transformation approach is that few assumptions are made about the content of the data patterns in the texts. if you want a broad picture of the content of texts, then in adopting a transformational approach, you are giving the data patterns a chance to reveal themselves.   If, on the other hand, you know you want to find out about IBM and UIMA, then a query is the right way to go. You know what you are looking for and you know which entities are relevant. Read more

Text Analytics Case Studies 

6 AUGUST 2006

Finding the best reviewers for particular grant applications (pdf) Content Analyst

^ Top | © 2009 Michael Barlow | css | xhtml | dvd