Unriddle Data: Hammering a fuzzy space

Data… Magic or Facts misunderstood?

There’s surely an aura of magic around data projects at least in people’s minds. According to a study around 85% of the Data initiatives fail; Yet the market is flooded with big promises in the name of Machine Learning, AI and Big Data Analytics. Is the charm misunderstood? At Least the figures say so.

Navigating through unknowns

In previous article, I spoke about the three stages of a data intensive project namely, the acquisition, the comprehension and the application. As BAs, do we have a repeated framework to fall back upon and navigate through this space? In this article of mine I will try to simulate the path with an oversimplified example

The Chicken or the egg problem…..

Problems/opportunities are no longer apparent, one needs to dig for them, deep down in the details. Quizzing the details isn’t fun, you might be up for a rabbit hole unless you know what you are looking for.

I don’t need to know everything; I Just need to know where to find it, when I need it — Albert Einstein

The real magic lies in nailing the right problem with the right tool.

Purely from my experience, the trend however is a race to acquire everything and then look for something that indicates a problem, most data projects end up eyeing a data lake of sorts to begin their data empowered journey without nudging hard on what problems do they want to solve with the data?

Photo by Autumn Mott Rodeheaver on Unsplash

The common problem that we encounter, is this loop where businesses struggle to articulate a problem statement in the absence of data. How do I say what is wrong unless I have the data?

What if I don’t have enough to articulate a problem? Do I still have questions that can potentially lead to problem statements? So how do I start quizzing the unknowns?

Photo by Grooveland Designs on Unsplash

Breaking the vicious circle

So what are we supposed to do? How do we reach the next level? More often than not organizations come up with bold and aspirational mission statements when they talk about data. There is an intent to build this capability where information from various facets of the business can be assimilated to paint a meaningful picture. So how to break that ambitious statement into small executable blocks for the path between the two is fairly fuzzy.

For the purpose of an example, let us assume that an existing health and wellness chain wants to leverage data. Let’s call them “Healthy Karma” for the sake of storytelling. Healthy Karma has a mission statement “empowering a healthier way of life”. The underlying objectives read as

  1. Leveraging data to provide personalized tips for our customers towards healthier life
  2. Use data as a hook to influence and attract new prospects
  3. Influence and intercept people to spread brand awareness and cultivate advocacy

In a nutshell, it wants to leverage data to intercept prospects and existing customers to build brand awareness, promote retention and cultivate advocacy via personalisation.

If we simplify the problem further, the organization, like any other would want more and more people to use these services and make them stick around, providing brand lift and advocacy in the long run. The ultimate motivation is to maximize revenue and life time value in exchange for customer value.

Now that we have understood the key motivation, let us zoom in further, we will now put the marketing funnel to use that we discussed in my previous write up.

Let us start with the business motivation

As a CEO of the Healthy Karma, I want to maximize revenue

In order to do so, I have a bunch of Initiatives in my mind

  • As a CEO I want to increase brand awareness to reach out to more prospects so that I have wider funnel and a better conversion
  • As a CEO I want to acquire more customers so that I can increase revenue
  • As a CEO I want to reduce the customer churn so that I can sustain my existing revenue streams
  • As a CEO I want to reduce my operational costs
  • As a CEO I want to increase plan upgrades so that I have increased revenue from my existing members

For the sanity of the discussion, let us assume that the CEO feels that reducing the customer churn is the easiest to attack based on some ground research. Tying back to the funnel, it belongs to the retention lane.

What do I need to acquire to understand customer churn?

  1. How many customers have churned out in last 1–2 year
  2. Capture the most granular time grain available ( every hour, day, week, month etc)

Having asked those fundamental questions, can I add few dimensions to this information?

  • Total tenure of the customer with the brand
  • How did they join the brand ?(heard about us from a friend, read it in a newspaper ad or may be an ad on FB)
  • When did they join? (Time of the year)
  • Access to gender, age, address, region, other demographics , having other active subscribers in friends and family
  • Membership contract details ( type of membership, subscription fee)
  • Kind of program joined (weight reduction, thyroid control, Diabetes Control) etc.
  • Preserving time grain corresponding to slow moving dimension is also key

Certain facts that I might be interested in could be

  • Interactions with the nutritionist in sort of time series
  • Vital stats during the course of the program in time series

By now we have identified one of the initiatives “Churn reduction” that would contribute to the larger motivation. We have also have an understanding about the high level data points we need to dig this space better. Each of these data points can now lead to one or more datasets. Having said so, let us begin the data acquisition phase.

Let’s Capture…..

We are about to begin the data acquisition phase. Data we spoke about could be scattered across several systems, how would your data engineer choose to acquire this data will depend upon the following questions

  • How often do you expect to look at this data?
  • What sort of data recency do you expect? ( near real time, every few hours etc.)

Beyond that, engineering the data acquisition process will take into account

  • Data growth
  • Whether you tend to get incremental data or full dump?
  • What is the shape of your data? if you are dealing with facts, you would want to append data to what you have collected previously, you may to choose to look at a Dimensions may be a snapshot view works better
  • Regardless of the above one would ensure how to reconcile between data acquired and data shared
  • How to maintain data in its rawest form when received?

Having acquired the data that can potentially help us understand the customer churn better. We need to model this data to make it more. Piggy backing the trending analogy for data being the new oil. Raw data could be equated to crude oil. This raw data needs to be distilled before it can be put to use.

Getting ready to comprehend…

Building further on our case study, in order to understand the customer churn better, a CEO may be interested in

  1. How long does a customer stay?
  2. Correlation between personal traits of the customer and the length of association
  3. Effect of external factors like seasons, global events, recession etc.

Based on the initial questions asked, the raw data is now modeled to some point in time analysis.

A point in time analysis is a powerful tool to spot patterns and evolution in a subject matter(churn in our case) as time progresses, as you add dimensions to this plot for e.g age, gender, demographic, socioeconomic background, occupation, type of subscription, tenure etc. The plot grows complex, however it enriches your primary data point that is churn to be looked at with various lenses. Using a combination of these lenses over time, an analyst is able to spot patterns that weren’t very evident earlier.

Based on the size and complexity of the datasets, a BA would be pair up with a Data Scientist, to make sense of the data since business context and the numbers have to talk.

When numbers speak, you may be tempted to conclude, however one must be aware that data does not come clean, there is always noise in the data like an age of 200. It is important to treat the noise upfront since it can heavily muddle the picture the data could potentially paint.

The act of cleansing is not a one time activity, as datasets evolve, tech intervention has to be provisioned in the ecosystem to spot noise and duly clean it before a datasets are up for analysis, hence a repeated mechanism needs to be developed to spot outliers or new values so that data in context can be sanitized.

This is a continuous cycle where businesses and tech would work hand in hand define noise and cleanse it.

To take my case study to the next level, let us assume that we could spot a pattern, say we observe that “the age group of 20–25 tend to stay subscribed when they have friends using the same service” This could be a good starting point to think about the application.

Let’s apply….

“Healthy karma” now launches a competition among a cohort of subscribers where it connects subscribers with similar goals with an expectation that at least 10% of these would extend their subscriptions. It also invites existing subscribers to extend referrals for their friends to join the subscription for free for the period of the competition. By now we have a mechanism to continually look at pouring in from the datasets identified during acquisition, as the competition kick ins, the data starts flowing in too, the organization can now zoom in to see if the competition is moving them closer to the goal or has little impact, giving Health Karma levers to pivot and to change course.

Did I end abruptly? yes, I did and on purpose for each stage can be a chapter by itself and this was an oversimplified example!


  • Data isn’t magic but a powerful tool to uncover hidden opportunities
  • Having said so, it is still very important to identify a high level problem space that one would choose to break using data
  • Starting without a high level motivation can lead you to a rabbit hole
  • Data acquisition is not a goal by itself, it is a step towards making data driven decisions
  • Tech decisions are clearly driven by business choices
  • It is imperative to ask questions so that raw data can be modeled to fetch answers for those in an efficient manner
  • It is essential to deal with noise in data in a repeated fashion to derive unbiased inferences
  • Comprehension to application is an exploratory space, an exploration may reveal that all the data acquired has no or little impact and that is a valid outcome
  • Marketing funnel is a framework that I found useful whenever I had to apply data to business in my experience so far, there could more ways to navigate through this space

One might argue that even now we ended up acquiring lot of data to understand just the churn, so how is this different from building a lake and then pinning a problem? The difference lies in how we choose to proceed.Being a novice cook, my favorite analogy is about shopping veggies, I pick stuff based on what I want to cook rather than going all out and wondering what all can I cook?