Understanding the Basics of Data Science (when you are not a data scientist but work with or manage data scientists) — Part II

Sabahat Iqbal
6 min readDec 29, 2020

A few side notes about those tasks we talked about in Part I, the steps in the data mining process, and then data-related terms that often overlap with data mining. In Part III, we start getting into the meat of data mining tasks.

The tasks fall into one of two categories — supervised or unsupervised.

A task is supervised when we are trying to find out the value of a target variable. So, to continue with the telecom customer example, let’s say we want to find out whether a customer will respond to a marketing offer or not. The target variable is Will Respond and it’s possible values are Yes or No. We want to build a model that allows us to predict that value for each customer. In order to do this, we need previously collected customer data that has both values, Yes and No, for the target variable, Will Respond.

A task is unsupervised when we are not sure how customers might break down into groups, i.e. what similarities exist between customers that could identify them as part of a group. There is no target variable. The data mining task will decide what customers make the most sense to group together.

So how are data mining tasks categorized?

Types Of Tasks

Building a Model VS Using a Model — two distinct steps of the data mining process

This is a simple but critical point that can get lost on newbies: we mine data in order to extract patterns and build models. Then we use those models on new data to predict what an entity (such as a telecom customer) will do or not do, by how much or how little, or what sort of other groups of customers they are most like.

Apart from the building and then use of a model, there are other steps in the data mining process. CRISP-DM is the iterative, industry standard for the data mining process. It looks similar to software development life-cycles but with SDLC, the result is new system/software design. Data mining is closer to research and development and may involve pilot studies or throwaway prototypes before final deployment.

CRISP-DM
  • Business Understanding: Here, business analysts ask fundamental questions. What business problem are we facing? How can this problem be solved? What information would help us solve it? Once we have the information, how will we test whether our problem has been solved or not?
  • Data Understanding: Do we already have the data to fulfill the information needs identified above? Is it in the format we need? Or do we need to obtain/clean/collate it? If so, how much will that cost and will it be a worthwhile investment? Once we have the data, what data mining task(s) will match the desired business goal — a supervised or unsupervised task? The business analysts and data analysts will have to figure this out together.
  • Data Preparation: Data has to be converted into tabular format, missing data has to be inferred, extraneous data has to be removed.
  • Modeling: Much more on this later. For now, suffice it to say (as has been said many time before, at this point) that a model captures patterns in the data.
  • Evaluation: Patterns can be found in almost any dataset. Models have to be tested to ensure the pattern will appear in new datasets and is not an anomaly or a coincidence.
  • Sign-off from key stakeholders requires data scientists to be able to provide a comprehensible explanation of their model to non-data scientists. Most sign-offs occur in a test environment that mirrors production as much as possible. Assessments in production environment may be impossible or undesirable (due to cost, for example).
  • However, a team could decide to deploy the model on a random selection of customers in their production database, keeping the rest as a control. In addition to verifying that the model works as expected, this selective deployment can look for behavior changes resulting from the model deployment, i.e. in an instance where the model is supposed to detect fraud, the counter-party could figure out a loophole in the model rendering it useless quickly.
  • Finally, revisiting the questions posed in the Business Understanding step ensures the model is fixing the problem as it was originally identified and it meets the expectation of the business analysts in a real-world setting.
  • Deployment: A team can deploy either the model or the data mining techniques into production environment. In the first case, a model would continue to make predictions based on a pattern developed in the test environment. In the latter case, the data mining techniques would continuously look for patterns and (test and) build new models. Why deploy the data mining techniques (and not the model)? Because sometimes, real world data may change faster than a team of data scientists can keep up. Additional fail-safe measures have to be put in place to ensure the data mining tasks are doing what they are supposed to (as defined by the Business Understanding step).
  • Deploying a model typically requires it to be re-coded for compatibility with production environment. This step — and all steps from this point forward — will probably be handled by a team of developers working closely with data scientists, at least at first.

At each step in the CRISP-DM phase, new insights gained into the business problem, the tasks, or the data may require re-evaluation of earlier assumptions or solutions. This is expected and is characteristic of the iterative process.

How Do Other Analytic Techniques Fit Into the Data Mining Context?

  • Statistics: A field of study concerning the collection, organization, analysis, interpretation, and presentation of data. Many of the data mining tasks have roots in Statistics. Patterns uncovered by data mining tasks can be viewed as hypothesis generation — can we find patterns in the data? Hypothesis generation should be followed by hypothesis testing — a key tool from Statistics — which helps determine whether a pattern uncovered by a data mining task is valid or just a coincidence and how confidently that assessment can be made.
  • Regression Analysis: When used in the field of Econometrics, the goal is to explain a specific dataset (i.e. why relationships between data points might exist). Data scientists, however, use this method to extract patterns that will generalize to other data. Typically, the goal is to predict a value for data not already in the analyzed dataset. This sets up a tension between finding a useful pattern in a specific dataset and testing that pattern on new data (i.e. generalizing).
  • Machine Learning: Data Mining is an offshoot of Machine Learning which is a subset of the field of Artificial Intelligence.
Relationship between AI, ML, DM.
  • Other tools related to data mining: Database Querying — exactly what it sounds like. A business analyst has a question, such as, “Who are all the male customers in NY over the age of 45?”. She could use Structured Query Language (SQL) to get a list of those customers from a database. She is querying the database. How does data mining fit in? A data mining task could have first uncovered the pattern that, “male customers in NY over the age of 45” are the most profitable. Based on that information, a monthly SQL query could run to find these customers to target for better marketing offers.
  • Data Warehousing — data is collected from various databases. For example, if the Sales and Billing database can be integrated with the HR database, there is potential to find the characteristics/patterns related to the most effective salespeople. Data Warehousing can be an expensive investment and is not necessary for data mining but its existence may lead to more integrated use of data mining techniques in departmental functions.

This is the second in a series of blogs explaining the basics of data science. More to be added in the days and weeks to come. Primary resource (and book that I highly recommend): Data Science For Business; Provost and Fawcett. Other parts in this series: part I.

--

--