January 23, 2022

1719 words 9 mins read

Bellabeat Health-Tracker Analysis 👣


Scenario

Stakeholders and products

Stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
  • Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy. You joined this team six months ago and have been busy learning about Bellabeat’’s mission and business goals — as well as how you, as a junior data analyst, can help Bellabeat achieve them.

Products

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
  • Leaf : Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
  • Time : This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
  • Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

About the company

Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.

Ask

Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis:

  1. What are some trends in smart device usage?
  2. How could these trends apply to Bellabeat customers?
  3. How could these trends help influence Bellabeat marketing strategy?

Exploratory Data Analysis (EDA)


1. Our data

From the datasets provided, I have selected the datasets that would bring most insights to important metrics for a healthcare application, which are;

  1. daily_activity - provides information about their daily activities (during the day time).

    The columns in this dataframe include; Id, ActivityDate, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDistance, VeryActiveDistance, ModeratelyActiveDistance, LightActiveDistance, SedentaryActiveDistance, VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes, SedentaryMinutes, Calories

  2. sleep_day - provides their night time information, may be crucial to consider sleeping behavior.

    The columns in this dataframe include; Id, SleepDay, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed

  3. heart_rate - provides information on the clients’ heart rates as recorded by the trackers.

    The columns in this dataframe include; Id, Time, Value

  4. weight_log - provides information about clients’ weights and body mass index.

    The columns in this dataframe include; Id, Date, WeightKg, WeightPounds, Fat, BMI, IsManualReport, LogId

Note;

  • Our data has the names of clients removed for anonymity, so we will be working with assigned IDs
  • Not all clients made their data available for each of the datasets, we will explore further on this on the next section
Understanding some general information on our data collection

How many unique participants are there in each dataframe? It looks like there may be more participants in the daily activity dataset than the sleep dataset.

n_distinct(daily_activity$Id)
## [1] 33
n_distinct(sleep_day$Id)
## [1] 24
n_distinct(heart_rate$Id)
## [1] 14
n_distinct(weight_log$Id)
## [1] 8

How many observations are there in each dataframe?

nrow(daily_activity)
## [1] 940
nrow(sleep_day)
## [1] 413
nrow(heart_rate)
## [1] 2483658
nrow(weight_log)
## [1] 67
Key takeaways;
  • Only 8 members provided their weight info
  • heart_rate data has over 2 million rows

We can see that most of our data is numeric so our visualization and exploration methods would reflect that nature


2. Data Visualization

density plots

  1. daily_activity

  1. sleep_day

  1. heart_rate

  1. weight_log

Key takeaways;

Unimodal distributions

  • TotalMinutesAsleep
  • TotalTimeInBed

Multi-modal distributions include

  • SedentaryMinutes
  • Calories
  • WeightKg
  • WeightPounds
  • BMI

Skewed distributions include

  • TotalSteps
  • TotalDistance
  • TrackerDistance
  • Value
  • BMI

uniform distribution

  • Fat

histograms

  1. daily_activity

  1. weight_log

Key takeaways;
  • The most common sedentary minutes are around 600-800, 1050-1300, 1450
  • Peak sedentary minutes recorded at maximum of 1450
  • Most common calories around 2000 and 3000
  • Most common weights were recorded at 55-65 kgs and 85-95 kgs
  • Outlier weight around 130 kgs
  • Most BMI vary between 24-26 with an outlier at 48

box plots

  1. daily_activity

  1. weight_log

Key takeaways;
  • cluster of data points on the zero mark on the graphs for TotalSteps, TotalDistance, TrackerDistance & on maximum value of SedentaryMinutes, might imply no data recorded, clients might not be wearing their trackers often

  • we have three outliers in the boxplots for TotalSteps, TotalDistance, TrackerDistance. I t would be interesting to see if its the same three individuals in all three charts, They would be our top performers.

  • we have a single outlier on both WeightKg and BMI


time-frame bar plots

  1. sleep_day

  1. heart_rate

Inspect if there are patterns in the data with 0 activity

Inspect if there are patterns in the data with max SedentaryMinutes

Key takeaways;
  • from our sleep_day dataset we can see general elevated resting periods on Sundays.

  • from our heart_rate dataset we can see heigtened fluctuations in heart rate values on Mondays.

  • by inspecting the patterns of the data that registered 0 active minutes and maximum sedentary minutes, they seem to have come from the same clients hence the matching patterns of occurence, however the patterns do not tell us any much more.


Inspect the outliers

Key takeaways;
  • the three outlier points were readings from two candidates
  • we might need store the IDs in case we will need them for future analysis or for a possible reward incentive

Cluster Analysis


Lets perform cluster analysis with our daily_activities dataset since we have the most users data.

1. create statistics per client

2. scale our variables

3. perform clustering

  • complete method is more suitable for our data as it clusters out data points better(better distribution of the data points)
  • initial exploratory analysis suggested that we have 2-3 groups in our data(ref. the multimodal distributions derived from the density plots)

4. split our data and visualize

We are going to go with the 2 group cluster (on the left) as having a third cluster with only one client does not make much sense.

5. assign our data with their respective clusters

6. compute general stats for each cluster

We can see contrasting behavior between cluster 1 and cluster 2, for instance;

  • cluster 1 one clients have a more active behavior overall.
  • cluster 2 clients clocked almost twice the amount of SedimentaryMinutes.
  • cluster 1 generally burnt more calories as a result.
  • we have 0 VeryActiveMinutes from cluster 2 which might mean they do not have an exercise schedule or they might be taking the trackers off during exercise, further survey might be required.

The main point of clustering the data was to be able to segment our customers so we know what marketing approach is more suitable for which customer, and now that we have all our data labelled between cluster 1 & 2 for a more targeted approach.

  • We might decide to label our data into more descriptive labels for the clusters like “active” and “less active”.

Combined Key Findings

Lets consolidate all our key takeaways

  • Only 8 members provided their weight info
  • heart_rate data has over 2 million rows
  • TotalMinutesAsleep - Unimodal distribution
  • TotalTimeInBed - Unimodal distribution
  • SedentaryMinutes - Multi-modal distribution
  • Calories - Multi-modal distribution
  • WeightKg - Multi-modal distribution
  • WeightPounds - Multi-modal distribution
  • BMI - Multi-modal distribution
  • TotalSteps - Skewed distribution
  • TotalDistance - Skewed distribution
  • TrackerDistance - Skewed distribution
  • Value - Skewed distributions
  • BMI - Skewed distributions
  • Fat - uniform distribution
  • The most common sedentary minutes are around 600-800, 1050-1300, 1450
  • Peak sedentary minutes recorded at maximum of 1450
  • Most common calories around 2000 and 3000
  • Most common weights were recorded at 55-65 kgs and 85-95 kgs
  • Outlier weight around 130 kgs
  • Most BMI vary between 24-26 with an outlier at 48
  • cluster of data points on the zero mark on the graphs for TotalSteps, TotalDistance, TrackerDistance & on maximum value of SedentaryMinutes, might imply no data recorded, clients might not be wearing their trackers often
  • we have three outliers in the boxplots for TotalSteps, TotalDistance, TrackerDistance. It would be interesting to see if its the same three individuals in all three charts, They would be our top performers.
  • we have a single outlier on both WeightKg and BMI
  • from our sleep_day dataset we can see general elevated resting periods on Sundays.
  • from our heart_rate dataset we can see heigtened fluctuations in heart rate values on Mondays.
  • by inspecting the patterns of the data that registered 0 active minutes and maximum sedentary minutes, they seem to have come from the same clients hence the matching patterns of occurence, however the patterns do not tell us any much more.
  • the three outlier points were readings from two candidates (not from three candidates as previously assumed)
  • we might need store the IDs in case we will need them for future analysis or for a possible reward incentive
  • now that we have our customers segmented into two definitive clusters (“active” and “less active”), We can derive our targeted marketing strategies from the above key points and apply to the most relevant cluster!