Guide

Concept Drift: Notes for the practicioner

In this article, I share notes on handling concept drift for machine learning models. Introduction Concept drift occurs in an online supervised learning setting, when the relationship between the input data X and output data y is altered to the extent that a model mapping X to y can no longer do so with the same efficacy. In online supervised learning, there are three types of drift that can occur: (1) feature drift, i.e. distribution of X, (2) real concept drift i.e. relation between X and y or p(y|X), and (3) change in the prior distribution p(y), e.g. new classes arrived. While both feature and prior distribution changes may be interesting to monitor, for purposes that extend beyond understanding changes in the problem space, it is only real concept drift that we are chiefly concerned with. Consider the following scenario: Google decides to increase the price of their flagship Android devices by 20%, making them more appealing to certain segments and less to others whom will detract to lower end versions of the brand or switch brands altogether to acquire devices within the original price range. As a result, the distribution of users signing up from specific mobile platforms may change. This would be a feature drift. If, however, this change is not sufficient to cause the model to err in its ability to, for example, predict user retention or quality of experience because despite the shifts in demographics most users will remain within the same device price range, and therefore have a similar initial experience, there may very well be no real concept drift. ...

Understanding Data

Rich Metadata on Data I have been learning many things about dealing with data at a large scale. At first, I kept using the term quality to describe the state of data. However, it quickly became clear that the term had various dimensions to it, and it could not summarise the issues one can observe. I have come to use the expression understanding data instead because (1) it captures the state I wish to describe and (2) speaks the scientific and functional purposes of that state. ...

Indexing UIMA Annotated Docs, with Solr

In this post, I’m going to walk you through the process of indexing UIMA annotated documents with Solr. In a previous post, Finding Movie Starts, I demonstrated how we can use UIMA to find and tag structured information in unstructured data. In most scenarios, once we have that data we extracted, we want to be able to query it. To do this, we can put our data into a data store, be it an RDMS, document store, graph, or other. Likewise, it is also very common for us to need some kind of search capability on our system, so that we or our users can find relevant information. This is what we’re going to do. Having UIMA annotations with information on what directors and actors are mentioned in a review, we want to be able to search for reviews that mention specific actors, directors, or just search for reviews that do mention screenwriters. ...

Finding Movie Stars: Named Entity Recognition with UIMA & OpenNLP

In this post, we are going to use text analysis tools UIMA and OpenNLP to identify film personas, like directors and screenwriters, from a corpus of movie reviews. Warning: Working knowledge of Java is necessary for completing this guide. Estimated required time: ~60-90 minutes Overview of Natural Language Processing Since the 70s, experts and businesses had realised the potential that exists in gathering, storing, and processing information about their operations to add and create new value for their organisation and their customers. During the first few decades though, the focus was set on structured data. ...

Connecting Data: multi-domain graphs

In a post where I talked about modelling data in graphs, I stated that graphs can be used to model different domains, and discover connections. In this post, we’ll do an exercise in using multi-domain graphs to find relationships across them. An institution can have data from multiple sources, in different formats, with different structure and information. Some of that data, though, can be related. For example, we can have public data on civil marriages, employment records, and land/property ownership titles. What these data sets would have in common is the identity of individuals in our society, assuming of course, they were from the same locality or country. In these scenarios, in order to run a cross-domain analysis we need to find ways to connect the data in a meaningful way, to uncover new information, or discover relationships. We could do that to answer questions like “What percent of the married people own property vs those that don’t”, or more interestingly “who is recently married, and bought property near area X while changing jobs”. ...

12 Steps for a Performant Graph, with Neo4j

In recent posts, I wrote about data stores, specifically, about choosing the right one for the right problem domain. I also wrote about modelling data in graphs, with Neo4j. I the last post, on modelling graphs, I promised to discuss how we can get good performance in Neo4j. I will be addressing that in this post, by presenting 12 steps you can follow to attain high performance when using Neo4j, especially in a large data volume setting. ...

HTTP Status Codes Explained: A Daily Life Translation

If you browse the web, I’m willing to bet you’ve encountered of an HTTP status code at some point in time. A dreadful 404 when the page is missing; 301/302 when you’re redirected to another page; or a good old 200 when you actually get to see the page. Well, I decided to do a translation of the meaning of some of most common HTTP codes into examples that non-techies can possibly relate to. Here we go! ...

Modelling Graphs, with Neo4j

On an early post, I described a non-exhaustive taxonomy of data store types, as well as the types of problem domains each one was best suited for. On this post, I will address some approaches to modelling data in graph data stores, particularly with Neo4j. Graph data stores have been increasingly adopted over the past couple of years in several business domains, ranging from logistics to bio-informatics. Their power lies in their ability to model complex networks and tree structures, with data points ranging from hundreds to millions of nodes and edges. ...

Data Store Types, and their Modelling Use Cases

This post will list a non-exhaustive taxonomy of data store types, and outline how they can be used to model different problem domains. Data Modelling In database design, modelling can be defined as the process of mapping the entities and events from a particular domain, into a representational format that can be stored into a database. The goal is to be able to answer relevant questions with data once it’s stored. Suitably, when model our domain, we must think of data in the way that it will be processed, as opposed to presented; and depending on what our domain is, some data store types may be more practical in helping us answer certain types of questions better than other. Understanding the strengths and weaknesses of each data store type can greatly ease the decision of which would be more suitable for a particular task we have in hand. ...