How DEEP is your Data Science project?

How DEEP is your Data Science project?

…., I am frustrated with my Data Science job …I get blamed for every project failure… I don’t want to be a Data Scientist.….I wanna quit.” 

 

 

One of the fun parts of coaching people in Data Science/Machine Learning field is that, we get loads ‘n’ loads of questions not just in data science/ML arena, but also issues related to their jobs, their managers, sometimes family issues etc.

Last weekend (Saturday – early morning), I got a call from one of my students/mentees.. He starts of with …

…., I am frustrated with my Data Science job …I get blamed for every project failure… I don’t want to be a Data Scientist.….I wanna quit.” 

I was taken aback. If I remember correctly he was one of smartest and brightest chaps. He was one of the geeks in Machine Learning and specifically an expert on neural nets.

Well, I slowly, steadily calmed him and started asking questions (in a ‘5 Whys’ manner). Something serious came out of the discussion.  Looks like, his management provided some data and asked him to come up with proposed solutions…. No problem statement, No goal statement, No project objectives, No understanding of Business domain, No process flow diagrams, No knowledge on data source, No data dictionary.. Nothing about vision etc.

They believed that if you have the title as Data Scientist, you are a magician. You look into the data and are expected to give the management “solutions” immediately. I wish if this was true. On further discussing with him I found that due to time pressure, the person directly jumped into solution mode. No data science framework used nor any methodology being followed.

Typically, data scientists are so engrossed in building algorithms that they tend to miss out the bigger picture. Well, importantly we have noticed that there are several institutions that conduct and train on data science, machine learning algorithms etc but very few give importance/teach on the Data Science Methodology or framework.

The call with my mentee compelled me to write the below piece on Data Science Framework/methodology. Thought of coining this as D.E.E.P (no, its not Deep Learning)

DEEP stands for

  • Define Phase
  • Explore Phase
  • Exploit Phase
  • Productionize Phase

Let’s look into each phase in detail

  • Define Phase:

Define phase is one of the most critical phase in a data science project. Unfortunately, this phase is the most neglected by Data Science team. In Define we do the following

  • Problem statementBefore starting the project, it is extremely important we have a short description of the issues that need to be addressed by a Data Science team and should be presented to them (or created by them) before they try to solve a proble
  • Objective / Goal – An Objective is a high-level statement that provides overall context for what the Data Scientist is trying to achieve and should align to business goals.
  • Business ProcessA high level process flow that captures the business activities, data capture and importantly customer interaction “moment of truth”.
  • Data SourceUnderstanding data source helps the team identify possible sources of predictive patterns.
  • Data DictionarySpecifically for Structure data, creating a data dictionary is one of the most important part of Define phase. It consists of a set of information describing the contents, format, and structure of the data and their relationship between its attributes.

Based on the discussion that I had with my mentee (person who had called in) looks like corners were cut in the Define phase.

  • Explore Phase:

In Explore phase, we carry out most of the dirty work (oops sorry, I shouldn’t use this). This is the phase wherein most of the data scientists try to cut corners

  • Data CleansingIn one of my previous article ‘Data Biryani’, a whole lot was written on Data Preparation/Cleaning. In general, Data Cleansing is the step where we check the data for completeness and cleanliness.
  • Data imputation – In data imputation step, we try to find if there are any missing values and have strategy either to replace it with mean or median or most likely values or sometimes delete the record (which is not recommended).
  • Label/OneHot EncodingIn Label encoding, all the categorical data will be converted to numeric format. One hot encoding transforms categorical features to a columnar format that works better with classification and regression algorithms.
  • Data TransformationFor specific clustering, classification and regression algorithms we need data to be normalized or scaled or minmax etc. These are some of the activities associated in Data transformation.
  • Statistical Exploration – In Statistical Exploration we try to understand individual attribute patterns (mean, max, range, min), Understand relationships between attributes (for Reasoning), any Outliers or any possible errors.
  • Inferential Analytics – Inferential statistics highlights valid inferences about a population based on an analysis of a representative sample of that population. One of my favorite was Chi-Square test, which was largely used to draw inferences between categorical data.
  • Exploit Phase:

This is the phase wherein the Data Scientist plays with data, builds model and importantly thinks this is only the most value added activity that he/she should do (which is not right). Some of the key activities of Exploit phase are

  • Data StratificationStratification is the process of dividing members of the population into homogeneous subgroups before sampling. Based on Statistical Exploration, Stratification will be carried out.
  • Feature EngineeringFeature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of model building.
  • Model Building & Machine LearningMultiple models based on different machine learning algorithms and on different datasets. Importantly, we Test the models for accuracy, resources, response time, processing time etc. Later we identify best performing models by fine tuning the parameters of machine learning algorithms (there is lot of stuff available on internet on Model building).
  • Model PredictionOnce models are built, we now predict outcomes for new data. We keep validating model accuracy to make sure accuracy levels are consistent for different variations in data.
  • Cross ValidationCross Validation is a very useful technique for assessing the performance of machine learning models. It helps in knowing how the machine learning model would generalize to an independent data set.
  • VisualizationData visualization is a way used to communicate information by encoding into as visual objects (Ex- ROC Curve, AUC, scatter plot, regression plots) contained in graphics.
  • ReportingAt the end of the Exploit phase, reports need to be provided to the project owners on the algorithms used, prediction results and expected benefits. This is one of the critical element but unfortunately most of the data science teams fail to prepare any kind of reports.
  • Productionize Phase

All looks good, all learnings done, model built, reported to management etc. Time to move the model into production/live environment with live data feed. Below are the key activities within Productionize phase

  • Data Product Development planLike a typical software engineering project, once the machine learning models are developed and tested, we need to put this back into staging environment where we get live or test data feed. Often we convert our model into a webservice which does the prediction on live data and sends back the results.
  • Testing of solution against Data FeedOnce a webservice is developed, the services need to be tested. This activity performs unit and system testing against the data feed. Once we get the prediction against the sata feed, we often validate this with a Domain Expert.
  • DeploymentOn successful testing, we deploy the solution/model into production environment.
  • Model MaintenanceThe final continuous activity we perform is the model maintenance. We keep getting new set of data, we validate the model and maintain the model (attribute reselections, different parameter tuning etc).

Lastly, as said in one of the above steps, visualization is the best way of describing information; the above methodology can be debriefed with a single view (visual) as below.

 

  I hope the above methodology/framework gives us a discipline while setting and developing Data Science project.

Happy DEEPing your Data Science Project!

Author

Safdar Hussain BE,MS(Oxford,UK),M.Tech (IIT),M.Phil,PGMP(IIM),PMP,Lean Expert,Six Sigma Black Belt(ASQ),MBB(GE),CPP,ITIL V3 Expert, PgMP

Email: Hussain.pmp@gmail.com / safdar.oxford@gmail.com