Bellabeat Health-Tracker Analysis 👣
Scenario
Stakeholders and products
Stakeholders
- Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
- Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
- Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.
Products
- Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
- Leaf : Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
- Time : This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
- Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
- Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
About the company
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
Ask
Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis:
- What are some trends in smart device usage?
- How could these trends apply to Bellabeat customers?
- How could these trends help influence Bellabeat marketing strategy?
Exploratory Data Analysis (EDA)
1. Our data
From the datasets provided, I have selected the datasets that would bring most insights to important metrics for a healthcare application, which are;
daily_activity
- provides information about their daily activities (during the day time).The columns in this dataframe include; Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories
sleep_day
- provides their night time information, may be crucial to consider sleeping behavior.The columns in this dataframe include; Id, SleepDay, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
heart_rate
- provides information on the clients’ heart rates as recorded by the trackers.The columns in this dataframe include; Id, Time, Value
weight_log
- provides information about clients’ weights and body mass index.The columns in this dataframe include; Id, Date, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId
Note;
- Our data has the names of clients removed for anonymity, so we will be working with assigned IDs
- Not all clients made their data available for each of the datasets, we will explore further on this on the next section
Understanding some general information on our data collection
How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.
n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(heart_rate$Id)
## [1] 14
n_distinct(weight_log$Id)
## [1] 8
How many observations are there in each dataframe?
nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413
nrow(heart_rate)
## [1] 2483658
nrow(weight_log)
## [1] 67
Key takeaways;
- Only 8 members provided their weight info
heart_rate
data has over 2 million rows
We can see that most of our data is numeric so our visualization and exploration methods would reflect that nature
2. Data Visualization
density plots
daily_activity
sleep_day
heart_rate
weight_log
Key takeaways;
Unimodal distributions
TotalMinutesAsleep
TotalTimeInBed
Multi-modal distributions include
SedentaryMinutes
Calories
WeightKg
WeightPounds
BMI
Skewed distributions include
TotalSteps
TotalDistance
TrackerDistance
Value
BMI
uniform distribution
Fat
histograms
daily_activity
weight_log
Key takeaways;
- The most common sedentary minutes are around 600-800, 1050-1300, 1450
- Peak sedentary minutes recorded at maximum of 1450
- Most common calories around 2000 and 3000
- Most common weights were recorded at 55-65 kgs and 85-95 kgs
- Outlier weight around 130 kgs
- Most BMI vary between 24-26 with an outlier at 48
box plots
daily_activity
weight_log
Key takeaways;
cluster of data points on the zero mark on the graphs for TotalSteps, TotalDistance, TrackerDistance & on maximum value of SedentaryMinutes, might imply no data recorded, clients might not be wearing their trackers often
we have three outliers in the boxplots for TotalSteps, TotalDistance, TrackerDistance. I t would be interesting to see if its the same three individuals in all three charts, They would be our top performers.
we have a single outlier on both WeightKg and BMI
time-frame bar plots
sleep_day
heart_rate
Inspect if there are patterns in the data with 0 activity
Inspect if there are patterns in the data with max SedentaryMinutes
Key takeaways;
from our
sleep_day
dataset we can see general elevated resting periods on Sundays.from our
heart_rate
dataset we can see heigtened fluctuations in heart rate values on Mondays.by inspecting the patterns of the data that registered 0 active minutes and maximum sedentary minutes, they seem to have come from the same clients hence the matching patterns of occurence, however the patterns do not tell us any much more.
Inspect the outliers
Key takeaways;
- the three outlier points were readings from two candidates
- we might need store the IDs in case we will need them for future analysis or for a possible reward incentive
Cluster Analysis
Lets perform cluster analysis with our daily_activities
dataset since we have
the most users data.
1. create statistics per client
2. scale our variables
3. perform clustering
- complete method is more suitable for our data as it clusters out data points better(better distribution of the data points)
- initial exploratory analysis suggested that we have 2-3 groups in our data(ref. the multimodal distributions derived from the density plots)
4. split our data and visualize
We are going to go with the 2 group cluster (on the left) as having a third cluster with only one client does not make much sense.
5. assign our data with their respective clusters
6. compute general stats for each cluster
We can see contrasting behavior between cluster 1 and cluster 2, for instance;
- cluster 1 one clients have a more active behavior overall.
- cluster 2 clients clocked almost twice the amount of
SedimentaryMinutes
. - cluster 1 generally burnt more calories as a result.
- we have 0
VeryActiveMinutes
from cluster 2 which might mean they do not have an exercise schedule or they might be taking the trackers off during exercise, further survey might be required.
The main point of clustering the data was to be able to segment our customers so we know what marketing approach is more suitable for which customer, and now that we have all our data labelled between cluster 1 & 2 for a more targeted approach.
- We might decide to label our data into more descriptive labels for the clusters like “active” and “less active”.
Combined Key Findings
Lets consolidate all our key takeaways
- Only 8 members provided their weight info
heart_rate
data has over 2 million rowsTotalMinutesAsleep
- Unimodal distributionTotalTimeInBed
- Unimodal distributionSedentaryMinutes
- Multi-modal distributionCalories
- Multi-modal distributionWeightKg
- Multi-modal distributionWeightPounds
- Multi-modal distributionBMI
- Multi-modal distributionTotalSteps
- Skewed distributionTotalDistance
- Skewed distributionTrackerDistance
- Skewed distributionValue
- Skewed distributionsBMI
- Skewed distributionsFat
- uniform distribution- The most common sedentary minutes are around 600-800, 1050-1300, 1450
- Peak sedentary minutes recorded at maximum of 1450
- Most common calories around 2000 and 3000
- Most common weights were recorded at 55-65 kgs and 85-95 kgs
- Outlier weight around 130 kgs
- Most BMI vary between 24-26 with an outlier at 48
- cluster of data points on the zero mark on the graphs for
TotalSteps
,TotalDistance
,TrackerDistance
& on maximum value ofSedentaryMinutes
, might imply no data recorded, clients might not be wearing their trackers often - we have three outliers in the boxplots for
TotalSteps
,TotalDistance
,TrackerDistance
. It would be interesting to see if its the same three individuals in all three charts, They would be our top performers. - we have a single outlier on both WeightKg and BMI
- from our
sleep_day
dataset we can see general elevated resting periods on Sundays. - from our
heart_rate
dataset we can see heigtened fluctuations in heart rate values on Mondays. - by inspecting the patterns of the data that registered 0 active minutes and maximum sedentary minutes, they seem to have come from the same clients hence the matching patterns of occurence, however the patterns do not tell us any much more.
- the three outlier points were readings from two candidates (not from three candidates as previously assumed)
- we might need store the IDs in case we will need them for future analysis or for a possible reward incentive
- now that we have our customers segmented into two definitive clusters (“active” and “less active”), We can derive our targeted marketing strategies from the above key points and apply to the most relevant cluster!