github linkedin email
Process of data science - Hypothesis
Aug 7, 2020
6 minutes read

Process of Data Science

There are various processes which determine success in every field. A surgeon, pilot and scientist undergo many hours of training, spanning years, to practice. The hours of training are generally used to establish skills towards automaticity. The development of automaticity enables quick decisions and efficient management of tough situations. This idea of automaticity is enshrined in many theories of cognitive science such as the two process model by Schneider and Shiffrin in 1977 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.470.2718&rep=rep1&type=pdf). This article does not explore the science of developing automaticity but will espouse the practices a data scientist could leverage through practical training.

There are various duties or responsibilities for a data scientist such as ensuring success of an established experiment through monitoring (one among them being baseline evaluation), modeling the results of an experiment (through Machine Learning or other modeling), designing an experiment (through A/B testing or bandit procedures) or ensuring success through data stewardship. The primary concern or focus for a data scientist could change in each role, checking off a list and ensuring diligence with the process should guarantee success when success is achievable.

The processes established can be broken down into 10 steps as shown below.

  1. Hypothesis (discussed here)
  2. Measurement variables
  3. Latent or unobservable factors
  4. Experimental design (0 to 1)
    1. Controlling other factors to observe primary effect.
  5. Collection and analysis of data for pattern discovery
    1. Hypothesis driven Exploration
  6. Modeling of patterns for prediction
    1. Numerical Analysis for error reduction
    2. Qualitative modeling
  7. Generalizing or scaling the experiment (1 to n)
  8. Establishing a baseline
  9. Monitoring through controls and baselines
  10. Ethics and governance

Note: The process above outlines a general framework and can involve fewer or more steps depending on the established responsibilities of the role or the scope of work.

An illustration follows with application of steps listed above to an example of fraud detection for an insurance firm. Let’s call this hypothetical firm mysurance.

1. Hypothesis

A hypothesis (https://en.wikipedia.org/wiki/Hypothesis) as used in science is a conjecture or statement, which if answered or verified could provide a deeper understanding of the issues or insights that are not readily available. Thus, goal of hypothesis is focused on obtaining value which can be multi-faceted and not apparent.

Problem definition

For example, when trying to detect fraud, a hypothesis can seem apparent — detect fraud. A fraudulent claim for insurance could cause monetary losses, it could also affect other insured personnel with similar assets through reassessment of the risk by inclusion of false signals in actuarial models. Thus, the value of measuring and detecting a fraud, using a model or otherwise, is a good proposition for a data scientist and the firm.

Thinking of the model and contingencies before a hypothesis is jumping the gun or premature optimization. From my experience, sometimes firms think about scaling a solution and other commoditizing tasks before ensuring the product works. Solving the zero to one problem as popularized by the book Zero to One (https://www.amazon.com/Zero-One-Notes-Startups-Future/dp/0804139296) is important before scaling the solution. Therefore, a principal question for a data scientist should be regarding a “qualitative and quantitative definition" of fraud.

Literature Survey

A scientist should start with a hypothesis to frame the problem under study. In academic circles a good literature survey precedes hypothesis generation to understand approaches and results obtained, but industry circles don’t have a well-documented hypothesis repository across firms due to the competitive nature of business. In such circumstances, identifying issues through discussion with end users and product owners would go a long way.

A sample analysis of fraud

A generic fraud is multi-faceted and can vary depending on the domain of insurance. A data scientist (DS) can start off by asking the product manager/owner (https://en.wikipedia.org/wiki/Product_manager), person responsible for success of the product, about the types of fraud to gain a better understanding about the insurance procedures.

DS: What types of insurance fraud should I focus on to get maximum returns?

This product owner could be the domain expert or communicates to a domain expert to enumerate the types of fraud. Alignment on such factors are an important aspect of any job to avoid scope and feature creep.

Product Owner:

  • Billing: Overestimated billing

  • Misrepresentation: Claims on nonexistent assets

  • Damage: Intentional damage

DS Follow up: Are the fraud types using a hard or soft classification?

The question of hard and soft classification/categorization comes up quite often in science. A hard classification or category defines rigid boundaries around set membership with an element belonging to one set, $onion \in Vegetables$ whereas a soft classification defines multiple memberships, $tomato \in Fruit, Vegatable$. Such issues were particularly observable in the transgender bathroom rights issue (https://www.vox.com/2016/5/5/11592908/transgender-bathroom-laws-rights). Such questions are important to clarify in order to plan the steps taken based on these choices. For simplicity, the categories are assumed to be hard.

DS: Does a claims processor follow a process for fraud classification? Can I interview and walk through the process with them?

It is important to walk through the human process for classification to understand which features can be “engineered” from available data and which other features would require additional human effort for classification. A human making the decision would use some judgement in classification and when this judgement is based on inference of information presented, the features can be engineered.

Agreement on outcomes and results of hypothesis

Fraud is context dependent and thus verifying fraudulent activity is a stochastic process at best. A common issue to consider with human classification is interrater agreement (reference: https://en.wikipedia.org/wiki/Inter-rater_reliability). A stochastic judgement or classification can suffer from variability across people or across contexts. A data scientist could use classifications from existing data to determine a concentration measure (ideally a probabilistic lower and upper bound - http://www.stat.cmu.edu/~larry/=sml/Concentration.pdf) providing an agreement of the outcome. If reliability or agreement can be between 10% and 100% a naive approach to identifying fraud would suffice, a random coin flip. However if agreement between 90% and 100% is required for usability (and compliance) a sophisticated distributional analysis would have to be undertaken.

Low hanging fruits

A good hypothesis for success could be “misrepresentation is different from intentional damage”.

A reason to work with a minimal set of hypotheses is to narrow the set of issues to consider when looking at a new problem. However, in any domain there can be other factors which should be considered such as availability of data to generate robust predictions, fiscal efficiency in costs of modeling and savings from model, usability of predictions from a legal and regulatory standpoint. A data scientist could help the firm by exploring the patterns and developing models for claim services allowing changes and improvement in fraudulent claim processing.


The hypothesis step is sometimes overlooked or shortened due to financial and/or resource constraints. Under such circumstances, it would be prudent to start with a vague hypothesis as opposed to no hypothesis. A good rule of thumb is allocating about between one to two weeks for generation of a hypothesis ensuring success for future steps.

A sound hypothesis aligns the problem to objectives providing a “big-picture” view of the steps to come. The hypothesis stage should have an outcome; this could be manifold but should have an attached qualitative or quantitative value. For example, knowing the types of fraud and implementing accurate measures could improve customer satisfaction by yyy and reduce costs by xxx.

This utility of outcomes to business should be tangible through the measurement variables which should ideally be validated by a product owner or business associate to ensure a return on effort and investment of data science.

In the next series of blogs I will express my views on other steps in approaching to solve a problem in data science.


Back to posts


comments powered by Disqus