Learning relationships between organizations

Business Understanding

Our team had to tackle a case in which large text collections had to be searched efficiently. They case was particularly focused key concepts like Organizations, People, Locations, but also relations between them. ML methods are able to learn from already annotated examples how to extract relations expressed in text. These methods however need large amount of expert annotations, which are quite expensive.

The current case is an exploration of the idea whether an AI algorithm can be trained to infer the relations off a distant-labeled set, auto-generated via DBPedia.

Data Understanding

The dataset provided contained the following structure:

Company 1 Company 2 TextSnippet Isparent
Centene Corporation Health Net Centene closed the deal with Health Net Inc. True
Health Net Centene Corporation Centene closed the deal with Health Net Inc. False
Aetna Inc. Health Net Aetna and Health Net are competitors.1 False

The overall set contained roughly 89,000 records and was obviously biased towards negative examples so we knew we had to do something about this.

Rendering the length histogram of the provided snippets was also needed, so we could get a feel of what the distribution looked like and respectively adapt the inputs of our models.

Generally, this was one of the tasks with the highest quality data.

Data Preparation

Looking at the snippets column and browsing through its contents, we realized we had to do a number of things:

  • Do general clean-up: remove new lines, end of lines, punctuation, etc
  • Substitute all company names with predefined tokens, so the model could focus solely on the task at hand – extract & classify relationship context
  • Preserve some of the text that supports the relationship we had to model – i.e. text like Microsoft_-owned_ or Microsoft‘s
  • Do a stop-word filtering, carefully omitting stop words that carry information related to the task at hand – e.g. its, his, has, etc.

Overall we inserted or preserved a total of 329,317 tokens into our ‘clean’ dataset, which after processing consists of 2,152,379 words.

Once we had the clean data, we trained 100 dimensional word2vec embeddings on top of Wikipedia and then fine-tuned these with our clean set, extending the vocabulary with our special tokens and task-specific corpora.


For building our model, we were mostly influenced by this paper: Context-Aware Representations for Knowledge Base Relation Extraction, Sorokin D., Gurevych I.

Link to the paper: http://bit.ly/2nTzdrO

Link to the paper’s implementation: https://github.com/ukplab

The ideas we used from the paper were:

  • Use an LSTM cell as an encoder
  • Enrich the text input to the model with the entities' positions

Based on that knowledge, we built a number of models that each consumes 2 inputs – the snippets data and the entities' postitions. Although one of these models closely resembles the baseline LSTM from the paper above, our best-performing models use stacked bLSTM cells instead and combine these with an attention layer for additional gains, esp. helpful in the context of the rather long input sequences.

Our attention implementation is inspired by this paper: Hierarchical Attention Networks for Document Classification, Yang Z. Yang D., Xiaodong H., Smola. A, Hovy E.

Based on our data analysis, we’ve concluded that our input sequences should be 75 words long as that well captures the distribution we’ve seen. To combat bias, all our models make use of dropout.

Other than the input sequence length, the rest of the hyper parameters have been the result of a hyper-paramater search using TPE (hyperas).

The diagram below illustrates the convergence of one of our models during hypertuning:


Here’re the results our ensemble achieves:

 Negative Recall: 0.9557852882703778
 Positive Recall: 0.8780861244019139
 Average Recall: 0.9169357063361459
 Accuracy:  0.9329775280898877


Our ensemble can easily be pruned and distilled into a single, resource-efficient model that’s deployable and servable at scale.

Tech Stack

To solve this case and build the models described above, we’ve extensively used the following frameworks and tools:

  • Keras
  • TensorFlow
  • Pandas
  • Hyperas
  • Gensim
  • PySpark

Twitter Sentiment Analysis

sentiment analysis stats

This demo has been created by Centroida based on a specific request from two financial institutions. Both banks were interested to conduct real-time sentiment analysis of textual data at scale.

Two separate engineering challenges have been addressed during the development of the demo:

  1. Building an NLP model that achieves near state-of-art results in sentiment analysis
  2. Building a high-performant and scalable data pipeline that utilizes the model and provides real-time analytics over twitter data

Based on a standardized dataset in this area (SemEval 2017) we achieve near state-of-art results of 67,23% validation accuracy and 65.24% recall. To put this in perspective, if you’d take Microsoft Azure’s Text Analytics API and evaluate it on the same dataset, it scores about 54% in accuracy and 57% in recall.

Our pipeline also doesn’t fall short. To illustrate a sample load, we’ve loaded 336mln tweets (to simulate load) and process them in a batch manner, inferring the tweets’ sentiment using our model. With the current demo we’re squeezing out an average throughput of about 50,000 tweets per second which number can be further scaled given sufficient resources.

Even though the demo has been tailored to Twitter data at present, the underlying approach can be applied to any textual dataset - e.g., Facebook feeds, news feed, user reviews, comments, etc. As a result, this approach is applicable to and renders significant value across various domains.