Data | ThinkingThread

Generating Samples using Synthetic Multivariate Distributions

The problem I address in this post is generating samples from multivariate distributions, without having any data. Motivation Generative models are capable of generating new data. Unlike discriminative models, which determine the likelihood of an outcome given a set of input features $P(Y|X)$, a generative model learns the joint distribution between variables $P(X,Y)$. In product development, they can be used for various use cases, including imputing missing data (e.g. with conditional models), determining the likelihood of an observed sample, or creating random samples of data. The last use case is the focus of this post. ...

A fable of an Aggregator

Many parallel data computing tasks can be solved with one abstract data type (ADT). We will describe how an Aggregator does that by walking through a problem we want to solve with parallelism and uncovering the ideal properties of an ADT that enable us to do so. Relevance of Aggregations: The Desirable ADT In the world of analytics and machine learning, data processing makes up a significant chunk of the plumbing required to do both. In the world of big data, or medium-sized data for that matter, parallel processing enables efficient usage of disparate computing resources. Quite frequently, the data we’re referring to is represented by a collection of records. ...

COVID19

We are in the middle of a viral outbreak. There are many things that we’re learning as we go along about this new corona virus. In the meantime, health organizations from around the world are scrambling to learn as much as they can from the cases that are known about. In Stockholm, where I live, the number of cases has been growing over the past two weeks. During this time, I have found many tools and dashboards that have been created to track the number of total infections. ...

Understanding Data

Rich Metadata on Data I have been learning many things about dealing with data at a large scale. At first, I kept using the term quality to describe the state of data. However, it quickly became clear that the term had various dimensions to it, and it could not summarise the issues one can observe. I have come to use the expression understanding data instead because (1) it captures the state I wish to describe and (2) speaks the scientific and functional purposes of that state. ...

Effects of severe weather on health and economics in the US

An exploratory analysis on the Effects of severe weather on health and economics in the US

YelpSí: Visualizing Yelpers Daily Activity

Find out what the most popular places yelpers check-in to with YelpSí, a visualization tool that let’s you explore Yelpers past daily check-in activity across different cities in the US and Europe. It was built with Shiny, using the Yelp Dataset Challenge academic dataset.

Indexing UIMA Annotated Docs, with Solr

In this post, I’m going to walk you through the process of indexing UIMA annotated documents with Solr. In a previous post, Finding Movie Starts, I demonstrated how we can use UIMA to find and tag structured information in unstructured data. In most scenarios, once we have that data we extracted, we want to be able to query it. To do this, we can put our data into a data store, be it an RDMS, document store, graph, or other. Likewise, it is also very common for us to need some kind of search capability on our system, so that we or our users can find relevant information. This is what we’re going to do. Having UIMA annotations with information on what directors and actors are mentioned in a review, we want to be able to search for reviews that mention specific actors, directors, or just search for reviews that do mention screenwriters. ...

Finding Movie Stars: Named Entity Recognition with UIMA & OpenNLP

In this post, we are going to use text analysis tools UIMA and OpenNLP to identify film personas, like directors and screenwriters, from a corpus of movie reviews. Warning: Working knowledge of Java is necessary for completing this guide. Estimated required time: ~60-90 minutes Overview of Natural Language Processing Since the 70s, experts and businesses had realised the potential that exists in gathering, storing, and processing information about their operations to add and create new value for their organisation and their customers. During the first few decades though, the focus was set on structured data. ...

Connecting Data: multi-domain graphs

In a post where I talked about modelling data in graphs, I stated that graphs can be used to model different domains, and discover connections. In this post, we’ll do an exercise in using multi-domain graphs to find relationships across them. An institution can have data from multiple sources, in different formats, with different structure and information. Some of that data, though, can be related. For example, we can have public data on civil marriages, employment records, and land/property ownership titles. What these data sets would have in common is the identity of individuals in our society, assuming of course, they were from the same locality or country. In these scenarios, in order to run a cross-domain analysis we need to find ways to connect the data in a meaningful way, to uncover new information, or discover relationships. We could do that to answer questions like “What percent of the married people own property vs those that don’t”, or more interestingly “who is recently married, and bought property near area X while changing jobs”. ...

12 Steps for a Performant Graph, with Neo4j

In recent posts, I wrote about data stores, specifically, about choosing the right one for the right problem domain. I also wrote about modelling data in graphs, with Neo4j. I the last post, on modelling graphs, I promised to discuss how we can get good performance in Neo4j. I will be addressing that in this post, by presenting 12 steps you can follow to attain high performance when using Neo4j, especially in a large data volume setting. ...