Understanding Data

April 3, 2018 by guidj

Rich Metadata on Data

I have been learning many things about dealing with data at a large scale. At first, I kept using the term quality to describe the state of data. However, it quickly became clear that the term had various dimensions to it, and it could not summarise the issues one can observe. I have come to use the expression understanding data instead because (1) it captures the state I wish to describe and (2) speaks the scientific and functional purposes of that state.

Without consolidated information about data and mechanisms for surfacing information on aspects such as reliability, changes (variance), semantic entities in an easily accessible and understood manner, data consumers are left to rely on inefficient methods to find the data they need (e.g. luck, word of mouth, failing jobs) and producers need to do extensive work to surface their datasets to potentially interested consumers. This results in missed opportunities that can be costly to an organization, as well as company-wide wasted efforts for both producers and consumers.

I have come to synthesize these ideas around four dimensions, which I describe next.

Growth

The rate at which your intake of data changes, measured by volume (dataset) and cardinality (dataset, field level). Answers questions such as:

  • By what measure has the volume of logX dataset increased over the past 12 months (knowing you have expanded to new markets)?
  • Will N machines be sufficient to run the same computational job in 4 months?

Insight: capacity planning for infrastructure; understanding user and business growth in relation to infrastructure costs;

Measures: Bytes produced per dataset, changes in cardinality of fields (e.g. user-id, platform, region), number of fields (added and removed), etc.

Usage & Dependencies

The consumption pattern of datasets by services, teams and functions (departments/branches), measured against the production rate of that data.

Answers questions such as:

  • What’s the ratio of consumption/production of your inventory summary dataset? Can you produce it every instead of daily, according to how its been used?
  • Which datasets have high utilization and low reliability (high frequency of breaking SLOs)?
  • Which services would be disrupted if you were to do a major replacement of dataset X?

Insight: which datasets are critical and need resources; which ones you can stop producing; what should be your realistic expectation of an SLO for a given dataset;

Measures: production frequency, scans/day, number of consumers, average lateness, etc.

Internal Structure

The content of the datasets you have, at a high level, for quick insight into its nature.

Answers questions such as:

  • Which fields have suddenly started reporting missing or unknown values?
  • How many unique values are there in the region field and what are they?
  • What are the min-max values of latency?

Insight: which values populate a given field; what’s the variance of latency; anomalies in distributions of data; pre-processing in ML applications

Measures: min, max, mean, variance, distribution (histogram), cardinality, enums

Semantics

Last, but not least, we have semantics. It captures information about the entities and events (facts) that your data represent, as well as the relationship between them.

Answers questions such as:

  • Which datasets about your users contain geo-location data?
  • Do you have any marketing impression (datasets) for your products in a given region?

Insight: discovering relevant datasets about a specific domain; understanding the landscape of information you have, as well as duplication of efforts and missed opportunities.


These four dimensions summarize my thinking around data understanding. I believe they follow an order of information richness that starts from the point of production going all the way up to consumption. Several companies have described their approach to tackling different subsets of these problems (see Google GOODS, Netflix: Scaling Data Quality presentation), and there are open source tools/frameworks address different subsets of these issues.

In the end, you need to come up with a solution that works with your organization’s data-warehousing and data practices. While this is a non-trivial undertaking, the benefits can be enriching to building a data and insights driven culture within your organization.