Data science life cycle





Introduction

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data in order to get the patterns or insights which can help a business to optimize operations or improvise decisions.
Data is everywhere. From the smallest photon interactions to galaxy collisions, from mouse movements on a screen to economic developments of countries, we are surrounded by the sea of information. The human mind cannot comprehend this data in all its complexity; since ancient times people found it much easier to reduce the dimensional, to impose a strict order, to arrange the data points neatly on a rectangular grid: to make a data table.
But once the data has been collected into a table, it has been tamed. It may still need some grooming and exercise, essentially so it is no longer scary. Even if it is really Big Data, with the right tools you can approach it, play with it, bend it to your will, and master it.

There are seven major steps in the life cycle of data science



1. Business understanding

The business understanding stage of the Team Data Science Process (TDSP). This process provides a recommended life cycle that you can use to structure your data-science projects. The life cycle outlines the major stages that projects typically execute.
There are two main tasks addressed in this stage:

Define objectives:

Work with your customer and other stakeholders to understand and identify the business problems. Formulate questions that define the business goals that the data science techniques can target. 

Identify data sources:

Find the relevant data that helps you answer the questions that define the objectives of the project.

Define objectives

1.     A central objective of this step is to identify the key business variables that the analysis needs to predict. We refer to these variables as the model targets, and we use the metrics associated with them to determine the success of the project. Two examples of such targets are sales forecasts or the probability of an order being fraudulent.

2.     Define the project goals by asking and refining "sharp" questions that are relevant, specific, and unambiguous. Data science is a process that uses names and numbers to answer such questions. You typically use data science or machine learning to answer five types of questions:

o   How much or how many? (regression)
o   Which category? (classification)
o   Which group? (clustering)
o   Is this weird? (anomaly detection)
o   Which option should be taken? (recommendation)
Determine which of these questions you're asking and how answering it achieves your business goals.


3.     Define the project team by specifying the roles and responsibilities of its members. Develop a high-level milestone plan that you iterate on as you discover more information.

4.    Define the success metrics. For example, you might want to achieve a customer churn prediction. You need an accuracy rate of "x" percent by the end of this three-month project. With this data, you can offer customer promotions to reduce churn. The metrics must be SMART:
o   Specific
o   Measurable
o   Achievable
o   Relevant
o   Time-bound

Identify data sources

Identify data sources that contain known examples of answers to your sharp questions. Look for the following data:
  • Data that's relevant to the question. Do you have measures of the target and features that are related to the target?
  • Data that's an accurate measure of your model target and the features of interest.
For example, you might find that the existing systems need to collect and log additional kinds of data to address the problem and achieve the project goals. In this situation, you might want to look for external data sources or update your systems to collect new data.




2. Data Mining

This is one of the most important steps for solving a data science problem because you have to think of a problem and then eventually think of solving it. One of the best ways to get the data is scraping the data from the internet, collect data from multiple sources.





Data mining libraries

Some of the most important Python libraries that used to get or scrape the data:

3. Data Cleaning

Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from data and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Here are just a few types of dirty Data: 
  • Duplicate data.
  • Incorrect types.
  • Empty rows.
  • Stale data.
  • Abbreviations.
  • Outliers.
  • Typos.
  • Uniqueness.
  • Missing values.
  • Extra spaces. 

Data Cleaning libraries

Some main Python libraries that we are involved in the data cleaning process are as shown below:

4. Data Exploration 

Exploratory Data Analysis or (EDA) is understanding the informational indexes by abridging their fundamental attributes regularly plotting them outwardly. In other words, you are exploring the data in a deeper and concise (clear) way. Through the procedure of EDA, we can request to characterize the issue proclamation or definition of our informational collection which is significant.




Some of the most important Python libraries that are used while performing EDA are as shown below:


5. Data Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns. With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.


An effective chart is one which:

  • Conveys the right and necessary information without distorting facts.
  • Simple in design, you don't have to strain in order to get it.
  • Aesthetics support the information rather than overshadow it.
  • Not overloaded with information.

Some of the most important Python libraries that are used in data visualization

6. Feature Engineering

It’s often said that “data is the fuel of machine learning.” This isn’t quite true: data is like the crude oil of machine learning which means it has to be refined into features [predictor variables] to be useful for training a model. Without relevant features, you can’t train an accurate model, no matter how complex the machine learning algorithm. The process of extracting features from a raw dataset is called feature engineering.

Feature Engineering can simply be defined as the process of creating new features from the existing features in a dataset

The Feature Engineering Process

Feature engineering, the second step in the machine learning pipeline, takes in the label times from the first step [prediction engineering] and a raw dataset that needs to be refined. Feature engineering means building features for each label while filtering the data used for the feature based on the label’s cutoff time to make valid features. These features and labels are then passed to modeling where they will be used for training a machine learning algorithm.

one of the most important Python libraries for Feature engineering is

6. Predictive Model 

Predictive modeling, also called predictive analytics, is a mathematical process that seeks to predict future events or outcomes by analyzing patterns that are likely to forecast future results. The goal of predictive modeling is to answer this question: “Based on known past behavior, what is most likely to happen in the future?
we can consider the predictive model as a data science product and the process itself is  like cycle or loop  , once you build the model you reach the final step but you will stick in the loop for long time ( tuning , updating the height parameters and development the model ) is the loop iterations till you reach the best accuracy


Based on the business logic's and the collected Data types, we can decide which model will solve the business issues and will fit the needs of the clients or searchers, let's take a look on Top 5 Predictive Analytics Models we found in our world 

  1. Classification Model
  2. Clustering Model
  3. Forecast Model
  4. Outliers Model
  5. Time Series Model
And there’s some famous algorithms that used in these Models 
  1. Random Forest
  2. Generalized Linear Model (GLM) for Two Values
  3. Gradient Boosted Model (GBM)
  4. K-Means
  5. Prophet 
Some of the most important Python libraries that are used in predictive modeling

For More details in this section you can Read this article and this

 

Conclusion

The life cycle of data science is more complex and has many hidden layers. Once you go deep inside, you will find many processes and steps that you must take into account to reach the best results that achieve accuracy and efficiency. This life cycle will not end until this complexity is solved and the final results are presented in its simplest form for everyone to understand.

1 comment:

Pages