What influences mental health scores the most?
Which lifestyle factors (screen_time, social_media, sleep_hours, physical_activity, family_time) show the strongest correlation with mental_health_score, and do these relationships vary by age group or gender?
Is there a threshold effect for any variables - for example, does mental health deteriorate significantly after a certain number of screen time hours or below a certain amount of sleep?
How do combinations of factors interact to influence mental health - for instance, does high physical activity offset the negative effects of high screen time, or does adequate family time buffer against social media usage?
Are there distinct behavioral profiles or clusters in the data that correspond to different mental health outcomes, and what characterizes high vs. low mental health score groups?
Does the anxiety variable serve as a mediator between lifestyle factors and mental health scores - meaning do certain behaviors influence anxiety levels, which then impact overall mental health?
This analysis examined lifestyle factors affecting mental health scores among 1,000 adolescents (ages 13-18) to identify the strongest predictors of mental wellbeing. The research employed multiple analytical approaches including correlation analysis, threshold detection, behavioral clustering, and mediation modeling to provide a comprehensive understanding of mental health influences.
Screen time emerged as the most powerful predictor of poor mental health outcomes. The analysis revealed:
- Critical threshold: 5.27 hours per day - beyond this point, mental health scores decline dramatically
- Consistent impact across demographics: Strong negative correlations observed across all gender groups (Male: -0.676, Female: -0.697, Other: -0.617)
- Mediation effect: Screen time influences mental health both directly (-2.532 points) and indirectly through increased anxiety (-0.917 points)
Social media usage showed the second-strongest negative relationship with mental health:
- Critical threshold: 3.75 hours per day
- Largest mediation effect: Social media has the strongest indirect effect through anxiety (-1.243 points), suggesting it significantly increases anxiety levels
- Gender consistency: Similar negative impacts across all demographic groups
Adequate sleep emerged as the most protective factor for mental health:
- Critical threshold: 6.29 hours - below this point, mental health deteriorates significantly
- Protective mediation: Sleep positively influences mental health both directly (2.412 points) and by reducing anxiety (1.223 points)
- Consistent benefit: Positive correlations observed across all age and gender groups
The clustering analysis identified two distinct behavioral patterns with dramatically different mental health outcomes:
The analysis revealed that combinations of factors matter significantly:
1. Sleep × Screen Time: Good sleep can buffer some negative effects of screen time
2. Family Time × Social Media: Quality family time can offset social media's negative impact
3. Physical Activity × Screen Time/Social Media: Exercise provides protective benefits against digital media overuse
Anxiety serves as a critical pathway through which lifestyle factors influence mental health:
- Social media has the strongest anxiety-mediated effect (-1.243 points)
- Sleep provides the strongest anxiety-reducing benefit (+1.223 points)
- All lifestyle factors show significant mediation through anxiety, indicating that managing anxiety is crucial for mental health improvement
The analysis reveals that digital media consumption (screen time + social media) represents the dominant influence on adolescent mental health, while sleep serves as the primary protective factor. The relationship is largely mediated through anxiety, suggesting that interventions targeting both digital media reduction and sleep improvement, combined with anxiety management strategies, would be most effective for improving mental health outcomes in this population.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats
from statsmodels.formula.api import ols
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
# Initialize list to store all Plotly figures
plotly_figs = []
print("=== COMPREHENSIVE MENTAL HEALTH ANALYSIS PIPELINE ===\n")
# ===== 1. DATA PREPROCESSING =====
print("1. DATA PREPROCESSING")
print("=" * 50)
# Create a copy of the original DataFrame
cleaned_df = df.copy()
# Remove the unnecessary index column
if 'Unnamed: 0' in cleaned_df.columns:
cleaned_df = cleaned_df.drop('Unnamed: 0', axis=1)
# Separate column types for targeted processing
numeric_cols = cleaned_df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = cleaned_df.select_dtypes(include='object').columns
print(f"Numeric columns: {list(numeric_cols)}")
print(f"Categorical columns: {list(categorical_cols)}")
# Handle missing values for numeric columns
for col in numeric_cols:
if cleaned_df[col].isnull().any():
cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].median())
print(f"Filled missing values in {col} with median: {cleaned_df[col].median()}")
# Handle missing values for categorical columns
for col in categorical_cols:
if cleaned_df[col].isnull().any():
mode_value = cleaned_df[col].mode()[0] if not cleaned_df[col].mode().empty else 'Unknown'
cleaned_df[col] = cleaned_df[col].fillna(mode_value)
print(f"Filled missing values in {col} with mode: {mode_value}")
# Ensure proper data types
cleaned_df['age'] = cleaned_df['age'].astype('int64')
cleaned_df['anxiety'] = cleaned_df['anxiety'].astype('int64')
cleaned_df['physical_activity'] = cleaned_df['physical_activity'].astype('int64')
float_cols = ['screen_time', 'social_media', 'sleep_hours', 'family_time', 'mental_health_score']
for col in float_cols:
if col in cleaned_df.columns:
cleaned_df[col] = cleaned_df[col].astype('float64')
cleaned_df['gender'] = cleaned_df['gender'].astype('str')
# Validate data ranges
print("\nData validation:")
print(f"Age range: {cleaned_df['age'].min()} - {cleaned_df['age'].max()}")
print(f"Screen time range: {cleaned_df['screen_time'].min():.2f} - {cleaned_df['screen_time'].max():.2f}")
print(f"Sleep hours range: {cleaned_df['sleep_hours'].min():.2f} - {cleaned_df['sleep_hours'].max():.2f}")
print(f"Mental health score range: {cleaned_df['mental_health_score'].min():.2f} - {cleaned_df['mental_health_score'].max():.2f}")
print(f"Unique gender values: {cleaned_df['gender'].unique()}")
print(f"Anxiety values: {sorted(cleaned_df['anxiety'].unique())}")
print(f"\nCleaned dataset shape: {cleaned_df.shape}")
print(f"Missing values per column:\n{cleaned_df.isnull().sum()}")
# ===== 2. STATISTICAL ANALYSIS =====
print("\n\n2. STATISTICAL ANALYSIS")
print("=" * 50)
lifestyle_factors = ['screen_time', 'social_media', 'sleep_hours', 'physical_activity', 'family_time']
# 2.1 CORRELATION ANALYSIS
print("2.1 CORRELATION ANALYSIS BY DEMOGRAPHICS")
print("-" * 40)
correlation_analysis = {}
# Overall correlations
corr_data = cleaned_df[lifestyle_factors + ['mental_health_score']].corr()
correlation_analysis['overall_correlations'] = corr_data['mental_health_score'][lifestyle_factors].to_dict()
print("Overall Correlations with Mental Health Score:")
for factor, corr in correlation_analysis['overall_correlations'].items():
print(f" {factor}: {corr:.3f}")
# Correlations by gender
correlation_analysis['by_gender'] = {}
for gender in cleaned_df['gender'].unique():
if pd.notna(gender):
gender_data = cleaned_df[cleaned_df['gender'] == gender]
gender_corr = gender_data[lifestyle_factors + ['mental_health_score']].corr()
correlation_analysis['by_gender'][gender] = gender_corr['mental_health_score'][lifestyle_factors].to_dict()
print(f"\nCorrelations for {gender}:")
for factor, corr in correlation_analysis['by_gender'][gender].items():
print(f" {factor}: {corr:.3f}")
# Correlations by age groups
cleaned_df['age_group'] = pd.cut(cleaned_df['age'], bins=[0, 16, 18, 25], labels=['Young (16)', 'Teen (17-18)', 'Adult (>18)'])
correlation_analysis['by_age_group'] = {}
for age_group in cleaned_df['age_group'].unique():
if pd.notna(age_group):
age_data = cleaned_df[cleaned_df['age_group'] == age_group]
if len(age_data) > 5:
age_corr = age_data[lifestyle_factors + ['mental_health_score']].corr()
correlation_analysis['by_age_group'][str(age_group)] = age_corr['mental_health_score'][lifestyle_factors].to_dict()
print(f"\nCorrelations for {age_group}:")
for factor, corr in correlation_analysis['by_age_group'][str(age_group)].items():
print(f" {factor}: {corr:.3f}")
# 2.2 THRESHOLD ANALYSIS
print("\n\n2.2 THRESHOLD ANALYSIS")
print("-" * 40)
threshold_analysis = {}
def find_threshold_piecewise(x, y, variable_name):
"""Find threshold using piecewise regression"""
try:
valid_idx = ~(pd.isna(x) | pd.isna(y))
x_clean = x[valid_idx]
y_clean = y[valid_idx]
if len(x_clean) < 10:
return None
x_sorted = np.sort(x_clean)
n_points = len(x_sorted)
best_r2 = -np.inf
best_threshold = None
start_idx = int(0.2 * n_points)
end_idx = int(0.8 * n_points)
for i in range(start_idx, end_idx):
threshold = x_sorted[i]
x1 = np.minimum(x_clean, threshold)
x2 = np.maximum(x_clean - threshold, 0)
X = sm.add_constant(np.column_stack([x1, x2]))
model = sm.OLS(y_clean, X).fit()
if model.rsquared > best_r2:
best_r2 = model.rsquared
best_threshold = threshold
return {'threshold': best_threshold, 'r_squared': best_r2}
except:
return None
threshold_vars = ['screen_time', 'sleep_hours', 'social_media', 'physical_activity']
for var in threshold_vars:
result = find_threshold_piecewise(cleaned_df[var], cleaned_df['mental_health_score'], var)
threshold_analysis[var] = result
if result:
print(f"{var.replace('_', ' ').title()} threshold: {result['threshold']:.2f} (R = {result['r_squared']:.3f})")
else:
print(f"{var.replace('_', ' ').title()}: No significant threshold detected")
# 2.3 INTERACTION ANALYSIS
print("\n\n2.3 INTERACTION ANALYSIS")
print("-" * 40)
interaction_analysis = {}
interactions_to_test = [
('physical_activity', 'screen_time'),
('family_time', 'social_media'),
('sleep_hours', 'screen_time'),
('physical_activity', 'social_media')
]
for var1, var2 in interactions_to_test:
try:
formula = f'mental_health_score ~ {var1} + {var2} + {var1}:{var2} + age + C(gender)'
model = ols(formula, data=cleaned_df).fit()
interaction_term = f'{var1}:{var2}'
if interaction_term in model.params.index:
coeff = model.params[interaction_term]
p_value = model.pvalues[interaction_term]
interaction_analysis[f'{var1}_x_{var2}'] = {
'coefficient': coeff,
'p_value': p_value,
'significant': p_value < 0.05,
'r_squared': model.rsquared
}
significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else ""
print(f"{var1.replace('_', ' ').title()} {var2.replace('_', ' ').title()}: = {coeff:.3f}, p = {p_value:.3f} {significance}")
except Exception as e:
print(f"Error testing {var1} {var2}: {str(e)}")
# 2.4 MEDIATION ANALYSIS
print("\n\n2.4 MEDIATION ANALYSIS (Anxiety as Mediator)")
print("-" * 40)
mediation_analysis = {}
for factor in lifestyle_factors:
try:
med_data = cleaned_df[[factor, 'anxiety', 'mental_health_score', 'age', 'gender']].dropna()
if len(med_data) < 20:
continue
# Path a: factor -> anxiety
formula_a = f'anxiety ~ {factor} + age + C(gender)'
model_a = ols(formula_a, data=med_data).fit()
# Path b: anxiety -> mental_health_score (controlling for factor)
formula_b = f'mental_health_score ~ anxiety + {factor} + age + C(gender)'
model_b = ols(formula_b, data=med_data).fit()
# Path c: factor -> mental_health_score (total effect)
formula_c = f'mental_health_score ~ {factor} + age + C(gender)'
model_c = ols(formula_c, data=med_data).fit()
a_coeff = model_a.params[factor] if factor in model_a.params else 0
b_coeff = model_b.params['anxiety'] if 'anxiety' in model_b.params else 0
c_coeff = model_c.params[factor] if factor in model_c.params else 0
c_prime_coeff = model_b.params[factor] if factor in model_b.params else 0
indirect_effect = a_coeff * b_coeff
mediation_analysis[factor] = {
'path_a_coeff': a_coeff,
'path_a_pvalue': model_a.pvalues[factor] if factor in model_a.pvalues else 1,
'path_b_coeff': b_coeff,
'path_b_pvalue': model_b.pvalues['anxiety'] if 'anxiety' in model_b.pvalues else 1,
'total_effect': c_coeff,
'direct_effect': c_prime_coeff,
'indirect_effect': indirect_effect,
'mediation_present': abs(indirect_effect) > 0.01 and model_a.pvalues.get(factor, 1) < 0.05 and model_b.pvalues.get('anxiety', 1) < 0.05
}
print(f"\n{factor.replace('_', ' ').title()}:")
print(f" Total effect: {c_coeff:.3f}")
print(f" Direct effect: {c_prime_coeff:.3f}")
print(f" Indirect effect (via anxiety): {indirect_effect:.3f}")
print(f" Mediation present: {mediation_analysis[factor]['mediation_present']}")
except Exception as e:
print(f"Error in mediation analysis for {factor}: {str(e)}")
# ===== 3. MACHINE LEARNING ANALYSIS =====
print("\n\n3. MACHINE LEARNING ANALYSIS")
print("=" * 50)
# Extract features for clustering
X_lifestyle = cleaned_df[lifestyle_factors].copy()
# Standardize features for clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_lifestyle)
# Determine optimal number of clusters
inertias = []
silhouette_scores = []
k_range = range(2, 8)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))
optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"Optimal number of clusters based on silhouette score: {optimal_k}")
# Apply K-means clustering with optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
# Apply Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=optimal_k, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(X_scaled)
# Add cluster labels to dataframe
df_clustered = cleaned_df.copy()
df_clustered['kmeans_cluster'] = kmeans_labels
df_clustered['hierarchical_cluster'] = hierarchical_labels
# Analyze K-means clusters
print("\n3.1 K-MEANS CLUSTERING ANALYSIS")
print("-" * 40)
kmeans_profiles = {}
for cluster in range(optimal_k):
cluster_data = df_clustered[df_clustered['kmeans_cluster'] == cluster]
cluster_size = len(cluster_data)
profile = {
'size': cluster_size,
'percentage': (cluster_size / len(df_clustered)) * 100,
'lifestyle_means': {},
'mental_health_stats': {},
'anxiety_stats': {}
}
for factor in lifestyle_factors:
profile['lifestyle_means'][factor] = cluster_data[factor].mean()
profile['mental_health_stats'] = {
'mean': cluster_data['mental_health_score'].mean(),
'std': cluster_data['mental_health_score'].std(),
'median': cluster_data['mental_health_score'].median()
}
profile['anxiety_stats'] = {
'anxiety_rate': (cluster_data['anxiety'] == 1).mean() * 100,
'mean_anxiety': cluster_data['anxiety'].mean()
}
kmeans_profiles[f'Cluster_{cluster}'] = profile
print(f"\nCluster {cluster} (n={cluster_size}, {profile['percentage']:.1f}%):")
print(f" Mental Health Score: {profile['mental_health_stats']['mean']:.2f} {profile['mental_health_stats']['std']:.2f}")
print(f" Anxiety Rate: {profile['anxiety_stats']['anxiety_rate']:.1f}%")
print(" Lifestyle Profile:")
for factor in lifestyle_factors:
print(f" {factor}: {profile['lifestyle_means'][factor]:.2f}")
# High vs Low Mental Health Groups Analysis
mental_health_median = cleaned_df['mental_health_score'].median()
high_mh_group = cleaned_df[cleaned_df['mental_health_score'] >= mental_health_median]
low_mh_group = cleaned_df[cleaned_df['mental_health_score'] < mental_health_median]
print(f"\n3.2 HIGH vs LOW MENTAL HEALTH GROUPS")
print("-" * 40)
print(f"Mental Health Score Median: {mental_health_median:.2f}")
print(f"High Mental Health Group: n={len(high_mh_group)} ({len(high_mh_group)/len(cleaned_df)*100:.1f}%)")
print(f"Low Mental Health Group: n={len(low_mh_group)} ({len(low_mh_group)/len(cleaned_df)*100:.1f}%)")
# Analyze characteristics of high vs low mental health groups
high_mh_characteristics = {
'size': len(high_mh_group),
'mental_health_mean': high_mh_group['mental_health_score'].mean(),
'anxiety_rate': (high_mh_group['anxiety'] == 1).mean() * 100,
'lifestyle_means': {},
'age_gender_distribution': {}
}
low_mh_characteristics = {
'size': len(low_mh_group),
'mental_health_mean': low_mh_group['mental_health_score'].mean(),
'anxiety_rate': (low_mh_group['anxiety'] == 1).mean() * 100,
'lifestyle_means': {},
'age_gender_distribution': {}
}
for factor in lifestyle_factors:
high_mh_characteristics['lifestyle_means'][factor] = high_mh_group[factor].mean()
low_mh_characteristics['lifestyle_means'][factor] = low_mh_group[factor].mean()
print("\nHigh Mental Health Group Characteristics:")
print(f" Mean Mental Health Score: {high_mh_characteristics['mental_health_mean']:.2f}")
print(f" Anxiety Rate: {high_mh_characteristics['anxiety_rate']:.1f}%")
print(" Lifestyle Factors:")
for factor in lifestyle_factors:
print(f" {factor}: {high_mh_characteristics['lifestyle_means'][factor]:.2f}")
print("\nLow Mental Health Group Characteristics:")
print(f" Mean Mental Health Score: {low_mh_characteristics['mental_health_mean']:.2f}")
print(f" Anxiety Rate: {low_mh_characteristics['anxiety_rate']:.1f}%")
print(" Lifestyle Factors:")
for factor in lifestyle_factors:
print(f" {factor}: {low_mh_characteristics['lifestyle_means'][factor]:.2f}")
# Statistical tests to compare groups
print("\n3.3 STATISTICAL COMPARISONS")
print("-" * 40)
for factor in lifestyle_factors:
high_values = high_mh_group[factor]
low_values = low_mh_group[factor]
t_stat, p_value = stats.ttest_ind(high_values, low_values)
significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else ""
print(f"{factor}: t-statistic = {t_stat:.3f}, p-value = {p_value:.4f} {significance}")
# Create results dictionaries
clustering_results = {
'optimal_clusters': optimal_k,
'silhouette_scores': dict(zip(k_range, silhouette_scores)),
'kmeans_profiles': kmeans_profiles,
'cluster_comparison': {
'kmeans_silhouette': silhouette_score(X_scaled, kmeans_labels),
'hierarchical_silhouette': silhouette_score(X_scaled, hierarchical_labels)
}
}
profile_characteristics = {
'high_mental_health': high_mh_characteristics,
'low_mental_health': low_mh_characteristics,
'median_threshold': mental_health_median,
'lifestyle_factors_analyzed': lifestyle_factors
}
# ===== 4. COMPREHENSIVE VISUALIZATIONS =====
print("\n\n4. COMPREHENSIVE VISUALIZATIONS")
print("=" * 50)
# Performance optimization - sample if dataset is too large
if len(cleaned_df) > 50000:
viz_df = cleaned_df.sample(5000, random_state=42)
else:
viz_df = cleaned_df.copy()
# 4.1 Overall correlation heatmap
print("Creating correlation analysis visualizations...")
fig_corr_overall = go.Figure(data=go.Heatmap(
z=corr_data.values,
x=corr_data.columns,
y=corr_data.columns,
colorscale='RdBu',
zmid=0,
text=np.round(corr_data.values, 2),
texttemplate='%{text}',
textfont={"size": 10},
hoverongaps=False
))
fig_corr_overall.update_layout(
title='Overall Correlation Matrix: Lifestyle Factors vs Mental Health Score',
xaxis_title='Variables',
yaxis_title='Variables',
width=600,
height=500
)
plotly_figs.append(fig_corr_overall)
# 4.2 Correlation by gender
fig_corr_gender = make_subplots(
rows=1, cols=2,
subplot_titles=('Male', 'Female'),
shared_yaxes=True
)
for i, gender in enumerate(['Male', 'Female']):
gender_data = viz_df[viz_df['gender'] == gender]
if len(gender_data) > 0:
corr_gender = gender_data[lifestyle_factors + ['mental_health_score']].corr()
fig_corr_gender.add_trace(
go.Heatmap(
z=corr_gender.values,
x=corr_gender.columns,
y=corr_gender.columns,
colorscale='RdBu',
zmid=0,
text=np.round(corr_gender.values, 2),
texttemplate='%{text}',
textfont={"size": 8},
showscale=True if i == 1 else False
),
row=1, col=i+1
)
fig_corr_gender.update_layout(
title='Correlation Analysis by Gender',
width=800,
height=400
)
plotly_figs.append(fig_corr_gender)
# 4.3 Threshold effect visualizations
print("Creating threshold effect visualizations...")
fig_threshold = make_subplots(
rows=2, cols=2,
subplot_titles=('Screen Time vs Mental Health', 'Sleep Hours vs Mental Health',
'Social Media vs Mental Health', 'Physical Activity vs Mental Health'),
vertical_spacing=0.1
)
# Screen time analysis
screen_bins = pd.cut(viz_df['screen_time'], bins=10)
screen_grouped = viz_df.groupby(screen_bins)['mental_health_score'].agg(['mean', 'std']).reset_index()
screen_grouped['screen_time_mid'] = screen_grouped['screen_time'].apply(lambda x: x.mid)
fig_threshold.add_trace(
go.Scatter(
x=screen_grouped['screen_time_mid'],
y=screen_grouped['mean'],
error_y=dict(type='data', array=screen_grouped['std']),
mode='lines+markers',
name='Screen Time',
line=dict(color='red')
),
row=1, col=1
)
# Sleep hours analysis
sleep_bins = pd.cut(viz_df['sleep_hours'], bins=10)
sleep_grouped = viz_df.groupby(sleep_bins)['mental_health_score'].agg(['mean', 'std']).reset_index()
sleep_grouped['sleep_hours_mid'] = sleep_grouped['sleep_hours'].apply(lambda x: x.mid)
fig_threshold.add_trace(
go.Scatter(
x=sleep_grouped['sleep_hours_mid'],
y=sleep_grouped['mean'],
error_y=dict(type='data', array=sleep_grouped['std']),
mode='lines+markers',
name='Sleep Hours',
line=dict(color='blue')
),
row=1, col=2
)
# Social media analysis
social_bins = pd.cut(viz_df['social_media'], bins=10)
social_grouped = viz_df.groupby(social_bins)['mental_health_score'].agg(['mean', 'std']).reset_index()
social_grouped['social_media_mid'] = social_grouped['social_media'].apply(lambda x: x.mid)
fig_threshold.add_trace(
go.Scatter(
x=social_grouped['social_media_mid'],
y=social_grouped['mean'],
error_y=dict(type='data', array=social_grouped['std']),
mode='lines+markers',
name='Social Media',
line=dict(color='purple')
),
row=2, col=1
)
# Physical activity analysis
activity_bins = pd.cut(viz_df['physical_activity'], bins=10)
activity_grouped = viz_df.groupby(activity_bins)['mental_health_score'].agg(['mean', 'std']).reset_index()
activity_grouped['physical_activity_mid'] = activity_grouped['physical_activity'].apply(lambda x: x.mid)
fig_threshold.add_trace(
go.Scatter(
x=activity_grouped['physical_activity_mid'],
y=activity_grouped['mean'],
error_y=dict(type='data', array=activity_grouped['std']),
mode='lines+markers',
name='Physical Activity',
line=dict(color='green')
),
row=2, col=2
)
fig_threshold.update_layout(
title='Threshold Effects: Lifestyle Factors vs Mental Health Score',
height=600,
showlegend=False
)
fig_threshold.update_xaxes(title_text="Hours/Units", row=1, col=1)
fig_threshold.update_xaxes(title_text="Hours", row=1, col=2)
fig_threshold.update_xaxes(title_text="Hours", row=2, col=1)
fig_threshold.update_xaxes(title_text="Hours/Units", row=2, col=2)
fig_threshold.update_yaxes(title_text="Mental Health Score")
plotly_figs.append(fig_threshold)
# 4.4 Interaction effect visualizations
print("Creating interaction effect visualizations...")
fig_interaction = px.scatter(
viz_df,
x='screen_time',
y='physical_activity',
color='mental_health_score',
size='family_time',
hover_data=['sleep_hours', 'social_media', 'anxiety'],
color_continuous_scale='RdYlBu_r',
title='Interaction Effects: Screen Time vs Physical Activity
Color: Mental Health Score, Size: Family Time'
)
fig_interaction.update_layout(
xaxis_title='Screen Time (hours)',
yaxis_title='Physical Activity (hours)',
width=700,
height=500
)
plotly_figs.append(fig_interaction)
# Sleep vs Social Media interaction
fig_sleep_social = px.scatter(
viz_df,
x='sleep_hours',
y='social_media',
color='mental_health_score',
facet_col='gender',
color_continuous_scale='RdYlBu_r',
title='Sleep Hours vs Social Media Usage by Gender
Color: Mental Health Score'
)
fig_sleep_social.update_layout(height=400)
plotly_figs.append(fig_sleep_social)
# 4.5 Cluster analysis visualizations
print("Creating cluster analysis visualizations...")
# Add cluster labels to viz_df
viz_df['cluster'] = df_clustered['kmeans_cluster'][:len(viz_df)]
fig_clusters = px.scatter(
viz_df,
x='screen_time',
y='mental_health_score',
color='cluster',
size='physical_activity',
hover_data=['sleep_hours', 'social_media', 'family_time', 'anxiety'],
title='Behavioral Clusters: Screen Time vs Mental Health Score
Size: Physical Activity Level'
)
fig_clusters.update_layout(
xaxis_title='Screen Time (hours)',
yaxis_title='Mental Health Score',
width=700,
height=500
)
plotly_figs.append(fig_clusters)
# Cluster profile radar chart
cluster_profiles = df_clustered.groupby('kmeans_cluster')[lifestyle_factors].mean()
fig_radar = go.Figure()
colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, cluster in enumerate(cluster_profiles.index):
fig_radar.add_trace(go.Scatterpolar(
r=cluster_profiles.loc[cluster].values,
theta=lifestyle_factors,
fill='toself',
name=f'Cluster {cluster}',
line_color=colors[i % len(colors)]
))
fig_radar.update_layout(
polar=dict(
radialaxis=dict(
visible=True,
range=[0, cluster_profiles.values.max()]
)),
showlegend=True,
title="Cluster Profiles: Average Lifestyle Factor Scores"
)
plotly_figs.append(fig_radar)
# 4.6 Mediation analysis visualization
print("Creating mediation analysis visualization...")
fig_mediation = make_subplots(
rows=2, cols=2,
subplot_titles=('Screen Time -> Anxiety -> Mental Health',
'Sleep Hours -> Anxiety -> Mental Health',
'Social Media -> Anxiety -> Mental Health',
'Physical Activity -> Anxiety -> Mental Health')
)
fig_mediation.add_trace(
go.Scatter(
x=viz_df['screen_time'],
y=viz_df['anxiety'],
mode='markers',
name='Screen Time vs Anxiety',
marker=dict(color='red', opacity=0.6)
),
row=1, col=1
)
fig_mediation.add_trace(
go.Scatter(
x=viz_df['sleep_hours'],
y=viz_df['anxiety'],
mode='markers',
name='Sleep vs Anxiety',
marker=dict(color='blue', opacity=0.6)
),
row=1, col=2
)
fig_mediation.add_trace(
go.Scatter(
x=viz_df['social_media'],
y=viz_df['anxiety'],
mode='markers',
name='Social Media vs Anxiety',
marker=dict(color='purple', opacity=0.6)
),
row=2, col=1
)
fig_mediation.add_trace(
go.Scatter(
x=viz_df['physical_activity'],
y=viz_df['anxiety'],
mode='markers',
name='Physical Activity vs Anxiety',
marker=dict(color='green', opacity=0.6)
),
row=2, col=2
)
fig_mediation.update_layout(
title='Mediation Analysis: Lifestyle Factors -> Anxiety -> Mental Health',
height=600,
showlegend=False
)
fig_mediation.update_yaxes(title_text="Anxiety Level")
plotly_figs.append(fig_mediation)
# 4.7 High vs Low Mental Health Comparison
fig_comparison = go.Figure()
high_means = [high_mh_characteristics['lifestyle_means'][factor] for factor in lifestyle_factors]
low_means = [low_mh_characteristics['lifestyle_means'][factor] for factor in lifestyle_factors]
fig_comparison.add_trace(go.Bar(
x=lifestyle_factors,
y=high_means,
name='High Mental Health',
marker_color='lightblue'
))
fig_comparison.add_trace(go.Bar(
x=lifestyle_factors,
y=low_means,
name='Low Mental Health',
marker_color='lightcoral'
))
fig_comparison.update_layout(
title='Lifestyle Factor Comparison: High vs Low Mental Health Groups',
xaxis_title='Lifestyle Factors',
yaxis_title='Average Hours/Units',
barmode='group',
width=700,
height=500
)
plotly_figs.append(fig_comparison)
# Display all visualizations
print("Displaying comprehensive visualizations...")
for i, fig in enumerate(plotly_figs):
print(f"\n=== VISUALIZATION {i+1} ===")
fig.show()
print(f"\nTotal visualizations created: {len(plotly_figs)}")
# ===== 5. SUMMARY RESULTS =====
print("\n\n5. ANALYSIS SUMMARY")
print("=" * 50)
print("Key Findings:")
print("1. Strongest correlations with mental health:")
for factor, corr in sorted(correlation_analysis['overall_correlations'].items(), key=lambda x: abs(x[1]), reverse=True):
print(f" - {factor}: {corr:.3f}")
print(f"\n2. Optimal number of behavioral clusters: {optimal_k}")
print(f" - K-means silhouette score: {clustering_results['cluster_comparison']['kmeans_silhouette']:.3f}")
print(f"\n3. Mental health groups (median threshold: {mental_health_median:.2f}):")
print(f" - High MH group: {len(high_mh_group)} individuals ({len(high_mh_group)/len(cleaned_df)*100:.1f}%)")
print(f" - Low MH group: {len(low_mh_group)} individuals ({len(low_mh_group)/len(cleaned_df)*100:.1f}%)")
print("\n4. Significant interactions found:")
for interaction, data in interaction_analysis.items():
if data['significant']:
print(f" - {interaction}: = {data['coefficient']:.3f}, p = {data['p_value']:.3f}")
print("\n5. Mediation effects (anxiety as mediator):")
for factor, data in mediation_analysis.items():
if data['mediation_present']:
print(f" - {factor}: indirect effect = {data['indirect_effect']:.3f}")
print("\nAnalysis completed successfully!")
print("All results stored in respective variables and visualizations created.")
Digital media consumption is the dominant factor influencing adolescent mental health, with screen time (r = -0.680) and social media usage (r = -0.638) representing the two strongest negative predictors. Sleep emerges as the most powerful protective factor (r = 0.556). The analysis reveals critical thresholds: mental health deteriorates significantly beyond 5.3 hours of daily screen time, 3.8 hours of social media use, or below 6.3 hours of sleep. These factors operate primarily through anxiety as a mediating pathway, meaning excessive digital consumption increases anxiety, which then compounds poor mental health outcomes.
Key Takeaways
Recommended Next Steps