📊 Statistical Analysis for Data Professionals
Statistical thinking is fundamental to data analysis. Master hypothesis testing, distributions, and statistical inference to derive meaningful insights from data.
Descriptive Statistics
Central Tendency
- Mean: Average value
- Median: Middle value (robust to outliers)
- Mode: Most frequent value
Dispersion
- Variance: Measure of spread
- Standard Deviation: Square root of variance
- Range: Max - Min
- IQR: Interquartile range (Q3 - Q1)
import numpy as np
import scipy.stats as stats
data = [1, 2, 2, 3, 4, 5, 5, 5, 6, 100]
# Summary statistics
mean = np.mean(data)
median = np.median(data)
std = np.std(data)
variance = np.var(data)
# Percentiles
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
Probability Distributions
Normal Distribution
- Bell-shaped, symmetric
- 68-95-99.7 rule
- Most common in nature
Poisson Distribution
- Count data, rare events
- Events at fixed rate
- Example: Email arrivals, bugs per code
Binomial Distribution
- Success/failure outcomes
- Fixed number of trials
- Example: Customer churn
from scipy.stats import norm, poisson, binom
# Normal distribution
prob = norm.cdf(100, loc=100, scale=15) # P(X <= 100)
# Poisson distribution
prob = poisson.pmf(5, mu=3) # P(X = 5 | λ = 3)
# Binomial distribution
prob = binom.pmf(3, n=10, p=0.5) # P(X = 3)
Hypothesis Testing
Null Hypothesis (H₀) vs Alternative Hypothesis (H₁)
- H₀: No effect or no difference
- H₁: There is an effect or difference
P-value
- Probability of observing result if H₀ is true
- p < 0.05: Typically reject H₀ (statistically significant)
- p >= 0.05: Fail to reject H₀
from scipy.stats import ttest_ind, chi2_contingency
# Two-sample t-test (comparing means)
t_stat, p_value = ttest_ind(group1, group2)
# Chi-square test (categorical data)
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
# Interpretation
if p_value < 0.05:
print("Result is statistically significant")
else:
print("Result is not statistically significant")
Correlation and Causation
Correlation Coefficients
- Pearson: -1 (perfect negative) to +1 (perfect positive)
- Spearman: Rank correlation, robust to outliers
- Kendall: Another rank correlation
from scipy.stats import pearsonr, spearmanr
# Pearson correlation
corr, p_value = pearsonr(x, y)
# Spearman correlation (for non-linear relationships)
corr, p_value = spearmanr(x, y)
Important Note
⚠️ Correlation ≠ Causation
- Confounding variables
- Reverse causality
- Coincidence
Time Series Analysis
Trend, Seasonality, Residuals
- Decompose time series into components
- Identify patterns
- Forecast future values
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(timeseries, model='additive', period=12)
trend = result.trend
seasonality = result.seasonal
residual = result.resid
Practice Tips
- Calculate confidence intervals for estimates
- Check assumptions before tests (normality, equal variance)
- Use effect size alongside p-values
- Create visualizations (histograms, Q-Q plots, scatter plots)
- Document assumptions and limitations
Last updated: April 12, 2026
Difficulty: Intermediate
Prerequisites: Basic math knowledge