These are the notes I took on the Data Science Methodology course on cognitiveclass.ai!
MODULE 1: FROM PROBLEM TO APPROACH
Business Understanding
Spend time to seek clarification to attain a business understanding
Establish a clearly defined question
Determine a goal (i.e. reduce costs – goal is to improve efficiency or profitability?)
Figure out objectives to support goal
Break down objectives
Structure how to tackle problem
Involve the key business sponsors so they can:
Set overall direction
Remain engaged and provide guidance
Ensure necessary support
Analytic Approach
Selecting the approach depends on the question being asked
Seek clarification
Select approach in the context of business requirements
Pick analytic approach based on type of question
Descriptive
Current status
Show relationships
Diagnostic (statistical analysis)
What happened?
Why is this happening?
Problems that require counts (yes/no answer → classification approach)
Predictive (forecasting)
What if these trends continue?
What will happen next?
Prescriptive
How do we solve it?
Machine learning
Learning without being explicitly programmed
Identifies relationships and trends in data that may not be accessible or identified otherwise
Uses clustering association approaches
Learning about human behaviour
Case study – decision tree classification
Predictive model
Decision tree classification
Categorical outcome
Explicit decision path showing conditions leading to high risk
Likelihood of classified outcome along with the predicted outcome
Easy to understand and apply
Lab – From Problem to Approach
Business understanding stage is important because it helps clarify the goal of the entity asking the question.
Outstanding features of data science methodology diagram above:
The flowchart is highly iterative
The flowchart never ends.
Analytic stage is important because it helps identify what type of patterns will be needed to address the question most effectively.
Decision trees
Built using recursive partitioning to classify data
Use the most predictive feature to partition data
Predictiveness is based on decrease in entropy, gain in information or impurity
Tree stops growing at a node when:
Pure or nearly pure
No remaining variables
Tree has reached (preselected) size limit
MODULE 2: FROM REQUIREMENTS TO SELECTION
Data Requirements
What is required, how to source or collect it, how to understand or work with it, how to prepare data to meet the desired outcome
Define data requirements for decision tree classification:
Identify the necessary data content, formats, and sources for initial data collection
Content, formats, representations suitable for decision tree classifier
Think ahead and consider future stages, as requirements may affect preparation
Data Collection
After initial data collection is performed, an assessment by the data scientist takes place to determine if requirements are met (have what is needed)
Data requirements are revised and decisions are made re: collect more or less data
Descriptive statistics and visualization can be applied to the data set to assess the content, quality, and initial insights about data
Gaps will be identified → plans to fill gaps or make substitutions
Data collection requires knowledge of the source or where to find needed data elements
It is acceptable to defer decisions about unavailable data and attempt to acquire it at a later stage (i.e. after getting intermediate results from predictive modeling)
DBAs and programmers work together to extract data from various sources and merge it — remove redundant data
Move on to data understanding
Case study – from requirements to collection in Python and R
Web scraping of online food recipes: http://yongyeol.com/papers/ahn-flavornet-2011.pdf
Once data collection is complete, descriptive statistics and visualization techniques are used to better understand data. Data is explored to:
Understand its content
Assess its quality
Discover any interesting preliminary insights
Determine the need for additional data
MODULE 3: FROM UNDERSTANDING TO PREPARATION
Data Understanding
Encompasses all activities related to constructing the data set
Answers the question: is the collected data representative of the problem to be solved?
Prepare/clean the data
Run statistics against the data columns (variables in this model)
Stats include Hearst, univariates, statistics on each variable (mean, median, min, max, stdev)
Pairwise correlations used
How closely certain variables are related
If highly correlated, then redundant
Histograms of variables examined to understand the distributions
Determine which sort of data preparation may be needed to make the variable more useful in a model (i.e. consolidation)
Data quality
Missing values
Invalid or misleading values
Use stats, univariates, histograms
Data Preparation
Unwanted elements are removed
Together with data collection and understanding, data preparation is the most time-consuming phase of a data science project (70-90% of project time)
Automation can reduce to about 50% of project time
Transforming data in the preparation phase makes the data easier to work with
Must address missing or invalid values, remove duplicates, format properly
Feature engineering – use domain knowledge of the data to create features that make the machine learning algorithms work
Critical when machine learning tools are being applied to analyze the data
Feature is a characteristic that may help when solving a problem
Text analysis steps for coding data required to manipulate data
What is needed within dataset to address the question
Ensure proper groupings
Ensure programming is not overlooking what is hidden within the data
MODULE 4: FROM MODELLING TO EVALUATION
Modelling
Descriptive or predictive models
Descriptive
If a person did x, they are likely to prefer y
Predictive
Yields more yes/no or stop/go outcomes
Models are based on the analytic approach (statistically driven or ML driven)
Training set is used for predictive modelling
Acts as a gauge to determine if the model needs to be calibrated
Try out different algorithms to ensure that the variables in play are actually required
Understanding the question
1. Understand the data
2. Appropriate analytical approach
3. Obtain, understand, prepare data
Data supports answering the question
Constant refinement necessary within each step
Evaluation
Evaluation is done during model development and before it is deployed
Assess quality
Ensure it meets the initial request → if not, then adjust
1. Diagnostic measures
ensure the model is working as intended
Adjustments required?
Predictive – use decision tree to evaluate if the answer is aligned with the initial design
Descriptive – testing set with known outcomes can be applied and model is refined as needed
2. Statistical significance testing
Ensure data is properly handled and interpreted within the model
Avoid unnecessary second guessing when the answer is revealed
ROC curve – receiver operating characteristic curve
Diagnostic tool
Determine the optimal classification model
Quantifies how well a binary classification model performs
Declassifies the yes and no outcomes when some discrimination criterion is varied
Optimal model at maximum separation
Confusion matrix
Summary of how well the categories are classified
Sheds light on what may be confused with a different category
Stats —
Type I error: false-positive
Type II error: false-negative
MODULE 5: FROM DEPLOYMENT TO FEEDBACK
Deployment
Make answer relevant by getting the stakeholders familiar with the tool produced
Solution owner, marketing, app developers, IT admin
Once the model is evaluated, it is deployed
Option: limited test group in a test environment
Feedback and refinement over time
Feedback
Feedback helps refine the model and asses it for performance and impact
Value of the model is dependent on incorporating feedback and adjusting accordingly
Ultimate test: actual real-time use in the field
Comentários