MULTI-LINGUAL AND CROSS-LINGUAL TEXT ANALYSIS

CAAT provides language analytics on a number of levels. As a fully Unicode®-B-compliant platform, CAAT is multilingual and supports any language that can be represented in the Unicode encoding system. This means that CAAT works within languages that are Unicode-compliant: searches can be performed in any language on text written in that same language, a feature becoming increasingly important in a global business world.

multi-lingualCAAT also includes a Primary Language Identification module, wherein CAAT will identify the primary language in which a document is written through a unique application of its concept-based categorization capabilities. This greatly speeds workflows where documents need to be sorted before further processing, such as Machine Translation. Threshold settings allow CAAT’s Primary Language Identification or PLI to focus on only the primary language in the document, and even place documents with heavily mixed language usage into an “unidentified” bucket.

CAAT also works across languages offering a unique and compelling crosslingual language analytics capability. The same mechanism that allows CAAT to train itself to identify concepts across an index of documents is employed to train CAAT to correlate those languages themselves. Through use of parallel corpora (sets of identical documents that have been already translated into different languages), CAAT learns how a given concept is expressed across languages. This training is a simple indexing function. Once trained, CAAT can search for concepts across languages: it can use a concept expressed in English to find similar concepts in French, German, and Spanish, with no translations required. Other CAAT functions like Concept-based Categorization and Dynamic Clustering also operate in this crosslingual mode.

Content Analyst makes several starter packs of parallel corpora available to help our partners to take advantage of this powerful and unique crosslingual language analytics capability:

  • European: is based on roughly 2500 documents from European Union Proceedings, and includes Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese, and Swedish
  • UN: is based on roughly 3000 documents pertaining to UN discussions, and includes Arabic, English, French, Russian, Spanish, and Chinese
  • Asian: is based on approximately 1000 news articles and includes English, Japanese, Korean, and Chinese

These parallel corpora are essentially “starter kits” that cover basic language correlation. In order for CAAT to understand specific technology and issues in this crosslingual mode, groups of identical translated documents covering these topics can easily be added to the corpora. CAAT can then be “trained” to understand these specific concepts and topics and users can search in one language yet find relevant documents present in other languages without prior translation.

 

spotlight