JsonURI: json serialization and deserialization for logging

May 28, 2016

A while back, while working on the infrastructure of a ecommerce recommendations service provider, we ran into problems handling traffic from our clients in real time. As a simple solution, we decided to send data to logs through AWS S3, by appending HTTP URL parameters to GET requests to a tiny image file; something that should only be done when you’re not dealing with sensitive data. However, we had a minor issue: we had complex nested JSON objects, with objects inside fields; however, JavaScript and JQuery standard libraries only supported serialization of flat JSON objects. ... Read more

YelpSí: Visualizing Yelpers Daily Activity

January 29, 2016

Find out what the most popular places yelpers check-in to with YelpSí, a visualization tool that let’s you explore Yelpers past daily check-in activity across different cities in the US and Europe. It was built with Shiny, using the Yelp Dataset Challenge academic dataset.

Indexing UIMA Annotated Docs, with Solr

January 7, 2016

In this post, I’m going to walk you through the process of indexing UIMA annotated documents with Solr. In a previous post, Finding Movie Starts, I demonstrated how we can use UIMA to find and tag structured information in unstructured data. In most scenarios, once we have that data we extracted, we want to be able to query it. To do this, we can put our data into a data store, be it an RDMS, document store, graph, or other. ... Read more

Falcon 9

December 31, 2015

Being up late at night, days before Christmas holidays, crunching in school projects was a long forgotten memory for me. Though, being back at university after a period of work, it was a less dreaded experience now. Still, I was ready to welcome any distractions. Having been told that night was the night SpaceX would be attempting a miraculous landing, I welcomed it with open arms. After all, watching live rocket launches beat the heck of figuring our seasonal components for time series at 2 in the morning. ... Read more


December 26, 2015

The art of writing was invented, I suppose, so that we could communicate with the future, i.e. record the past and present, and in the process create history. Software programs, on the other hand, are written for the purpose of defining the future. One that is meant to be interpreted by machines. The tapping of a keypad turns a blank page into a blueprint. It starts with one file, and can quickly grow larger. ... Read more

Finding Movie Stars: Named Entity Recognition with UIMA & OpenNLP

November 26, 2015

In this post, we are going to use text analysis tools UIMA and OpenNLP to identify film personas, like directors and screenwriters, from a corpus of movie reviews. Warning: Working knowledge of Java is necessary for completing this guide. Estimated required time: ~60-90 minutes Overview of Natural Language Processing Since the 70s, experts and businesses had realised the potential that exists in gathering, storing, and processing information about their operations to add and create new value for their organisation and their customers. ... Read more

Connecting Data: multi-domain graphs

September 9, 2015

In a post where I talked about modelling data in graphs, I stated that graphs can be used to model different domains, and discover connections. In this post, we’ll do an exercise in using multi-domain graphs to find relationships across them. An institution can have data from multiple sources, in different formats, with different structure and information. Some of that data, though, can be related. For example, we can have public data on civil marriages, employment records, and land/property ownership titles. ... Read more


September 7, 2015

I travelled recently, and while I waited at a terminal for my connecting flight, I noticed something intriguing about air travelling: Airline. Let’s start with definitions. Airline is the language used by airlines to communicate with their customers. As an investigator/scientist at heart, I could not help but to take it upon myself the selfless duty of documenting this obscure language. And so I paid close attention throughout my trip, and made an effort to document and analyse it as much as I could, and I now share with the world my efforts in translating some of the most important terms in Airline into plain English. ... Read more

12 Steps for a Performant Graph, with Neo4j

July 16, 2015

In recent posts, I wrote about data stores, specifically, about choosing the right one for the right problem domain. I also wrote about modelling data in graphs, with Neo4j. I the last post, on modelling graphs, I promised to discuss how we can get good performance in Neo4j. I will be addressing that in this post, by presenting 12 steps you can follow to attain high performance when using Neo4j, especially in a large data volume setting. ... Read more