Income Distribution and Healthcare Spending Analysis¶
1. Project Overview¶
This analysis explores the relationship between income levels and healthcare spending using a large synthetic healthcare dataset.
The objective is to understand how healthcare expenditures vary across income groups and identify patterns that may reflect differences in access to care, healthcare utilization, or preventive health behavior.
The dataset contains over 1.2 million patient records, making it suitable for distributed processing using Apache Spark.
1a. Business Questions¶
Healthcare spending varies significantly across populations, but the relationship between income level and healthcare utilization is not always straightforward.
Key questions explored in this analysis:
- How is income distributed across the population?
- Do higher income groups spend more on healthcare?
- At what income level does healthcare spending peak?
- Are there demographic differences in healthcare expenses?
Understanding these patterns can help inform:
- healthcare policy planning
- insurance risk modeling
- population health management
2. Dataset Overview¶
- Total observations: ~1.2M
- Median income: ~\$32K
- Mean income: ~\$48K
Insight:
The dataset shows a strong right-skewed distribution.
2a. Big Data Considerations¶
- Why Spark was used instead of Pandas?
- Example explanation:
- The dataset contains over 1M records. Apache Spark enables distributed processing and scalable analytics compared to single-node Pandas workflows.
2b. Analytical Insight¶
Healthcare spending does not increase linearly with income. While spending rises rapidly between the \$15K – \$70K income brackets, it plateaus and eventually declines among higher-income populations.
This pattern may suggest that middle-income populations experience the highest healthcare utilization, whereas higher-income individuals benefit from preventive care and better health management.
2c. Hypothesis for Further Analysis¶
Potential factors influencing the pattern:
- Insurance coverage differences
- Preventive care access
- Age distribution across income brackets
- Chronic disease prevalence
3. Data Engineering (Spark)¶
Data Processing Pipeline¶
The analysis follows a distributed data processing workflow using Apache Spark.
Pipeline steps:
- Data ingestion from Snowflake dataset
- Schema inspection and data validation
- Distributed aggregation using Spark SQL
- Feature engineering (income brackets)
- Statistical analysis
- Visualization using Pandas + Matplotlib
Spark was used for large-scale data processing, while smaller aggregated datasets were converted to Pandas for visualization.
This hybrid workflow enables scalable processing while maintaining flexible analysis capabilities.
# Import SparkSession
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
# Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/test.txt")
WARNING: Using incubator modules: jdk.incubator.vector Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 26/03/15 17:03:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
data_path = '/Users/thanhbrown/Home/work_portfolio/healthcare_spark/data/raw/warming_up.csv'
data_frame = (
spark.read
.option("header", True)
.option("inferSchema", True)
.csv(data_path)
)
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
data_frame
| PATIENT_ID | BIRTHDATE | DEATHDATE | SSN | DRIVERS | PASSPORT | PREFIX | FIRST | MIDDLE | LAST | SUFFIX | MAIDEN | MARITAL | RACE | ETHNICITY | GENDER | BIRTHPLACE | ADDRESS | CITY | STATE | COUNTY | FIPS | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | INCOME | SYNTHEA_CITY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1640106 | 2000-10-04 | NULL | 999-37-5169 | S99927567 | X33172632X | Ms. | Eloisa | María del Carmen | León | NULL | NULL | NULL | white | hispanic | F | Wesley Saint And... | 557 Abbott Lane | Austin | Texas | Hays County | 48453 | 78725 | 30.354726579225414 | -97.73422750216807 | 75583.97 | 294119.98 | 600051 | AUSTIN |
| 1640107 | 2012-02-10 | NULL | 999-44-7325 | NULL | NULL | NULL | Maryam | Danita | Tillman | NULL | NULL | NULL | white | nonhispanic | F | Longview Texas US | 722 Fadel Key Sui... | Austin | Texas | Hays County | 48453 | 78732 | 30.241189796621992 | -97.67758852990734 | 43459.33 | 16700.37 | 48653 | AUSTIN |
| 1640108 | 1963-08-11 | NULL | 999-68-4600 | S99983898 | X57933561X | Mr. | Shad | NULL | Pagac | NULL | NULL | S | white | hispanic | M | San Antonio Texa... | 442 Schmidt Throu... | Austin | Texas | Hays County | 48453 | 78758 | 30.391158191525708 | -97.82603275334445 | 430034.9 | 284638.58 | 8390 | AUSTIN |
| 1640109 | 2003-04-20 | NULL | 999-18-7911 | S99910290 | X22386013X | Mr. | Alex | NULL | Hammes | NULL | NULL | NULL | white | hispanic | M | Big Spring Texas... | 433 Becker Passag... | Austin | Texas | Hays County | 48453 | 78744 | 30.448291322845034 | -97.88521095196566 | 7751.63 | 67354.64 | 15977 | AUSTIN |
| 1640110 | 1951-12-23 | NULL | 999-67-9744 | S99990121 | X11642889X | Mrs. | Cherish | Shakita | Stoltenberg | NULL | Monahan | D | white | nonhispanic | F | Irving Texas US | 697 Wiza Crossing | Austin | Texas | Hays County | 48453 | 78749 | 30.5354363419989 | -97.8646393237985 | 684121.28 | 130648.4 | 99273 | AUSTIN |
| 1640111 | 2014-04-13 | NULL | 999-16-4648 | NULL | NULL | NULL | Darren | Leon | Dietrich | NULL | NULL | NULL | white | nonhispanic | M | Lubbock Texas US | 394 Hayes Hollow ... | Austin | Texas | Hays County | 48453 | 78739 | 30.129846173226507 | -97.57234941245392 | 23805.47 | 0.0 | 94919 | AUSTIN |
| 1640112 | 1980-11-04 | NULL | 999-89-2304 | S99929484 | X84264157X | Mr. | Palmer | Hyman | Sipes | NULL | NULL | D | white | nonhispanic | M | Hewitt Texas US | 827 Krajcik Union... | Austin | Texas | Hays County | 48453 | 78741 | 30.243705989128614 | -97.68952854520568 | 122641.66 | 36055.11 | 954499 | AUSTIN |
| 1640113 | 1982-07-19 | NULL | 999-80-7707 | S99970819 | X64509957X | Mr. | Daryl | Teddy | Ryan | NULL | NULL | D | white | nonhispanic | M | Houston Texas US | 357 Davis Esplanade | Austin | Texas | Hays County | 48491 | 78717 | 30.34413813354928 | -97.93154202232147 | 145434.53 | 1092025.1 | 8978 | AUSTIN |
| 1640114 | 1994-04-26 | NULL | 999-43-5125 | S99957739 | X68419579X | Mrs. | Débora | Francisca | Madera | NULL | Toro | M | white | hispanic | F | Carolina Puerto ... | 503 Keebler Cente... | Austin | Texas | Hays County | 48453 | 78741 | 30.37585635437581 | -97.55014443141285 | 97101.96 | 612425.37 | 13729 | AUSTIN |
| 1640115 | 1963-08-13 | NULL | 999-93-6364 | S99951081 | X66961424X | Mrs. | Jeanne | Chin | Hermiston | NULL | Murphy | M | white | nonhispanic | F | Sugar Land Texas... | 692 Kuhlman Meadow | Austin | Texas | Hays County | 48453 | 78750 | 30.30485682126074 | -97.64607331883383 | 229508.53 | 671061.29 | 18485 | AUSTIN |
| 1640116 | 1981-12-08 | NULL | 999-57-5488 | S99935062 | X9142516X | Mrs. | Alicia | NULL | Casas | NULL | Fernández | W | white | hispanic | F | San Juan Puerto ... | 138 Stokes Burg U... | Austin | Texas | Hays County | 48453 | 78660 | 30.3317735118376 | -97.77549016248436 | 306258.91 | 451376.93 | 2803 | AUSTIN |
| 1640117 | 2005-10-02 | NULL | 999-99-8126 | S99927805 | NULL | Mr. | Luis | Manuel | Olivares | NULL | NULL | NULL | black | hispanic | M | Juarez Chihuahua... | 138 Hammes Ferry ... | Austin | Texas | Hays County | 48453 | 78750 | 30.410658199513644 | -97.79402405988824 | 57614.8 | 197611.58 | 79641 | AUSTIN |
| 1640118 | 1991-04-03 | NULL | 999-10-4433 | S99956308 | X89460230X | Mr. | Dennis | Arnold | Bode | NULL | NULL | M | black | nonhispanic | M | Arlington Texas US | 803 Schuppe Lodge | Austin | Texas | Hays County | 48453 | 78744 | 30.28485521668174 | -97.82801447687807 | 65199.91 | 550518.32 | 996 | AUSTIN |
| 1640119 | 1993-08-10 | NULL | 999-42-6132 | S99975433 | X33504910X | Mr. | Howard | Ignacio | Abbott | NULL | NULL | M | white | hispanic | M | Dallas Texas US | 864 Waelchi Camp ... | Austin | Texas | Hays County | 48453 | 78734 | 30.186529529252535 | -97.75063376599643 | 65535.25 | 15874.07 | 49427 | AUSTIN |
| 1640120 | 2020-03-29 | NULL | 999-22-3187 | NULL | NULL | NULL | Roderick | Derick | Bosco | NULL | NULL | NULL | white | nonhispanic | M | Houston Texas US | 967 Daugherty Ville | Austin | Texas | Hays County | 48453 | 78742 | 30.541822801406493 | -97.67857620480662 | 9712.89 | 0.0 | 59358 | AUSTIN |
| 1640121 | 1971-07-03 | NULL | 999-41-3265 | S99969956 | X38591251X | Mr. | Arlen | Lawerence | Bailey | NULL | NULL | M | white | hispanic | M | Edinburg Texas US | 1066 Boyer Bridge | Austin | Texas | Hays County | 48453 | 78723 | 30.17923179171047 | -97.7836328356483 | 163553.29 | 57088.71 | 36658 | AUSTIN |
| 1640122 | 1983-10-24 | NULL | 999-40-9630 | S99973435 | X74977147X | Mr. | Dylan | Leandro | Hayes | NULL | NULL | S | white | nonhispanic | M | Amarillo Texas US | 662 Bruen Dale | Austin | Texas | Hays County | 48209 | 78610 | 30.266042712591933 | -97.70404670044297 | 632187.87 | 0.0 | 81708 | AUSTIN |
| 1640123 | 1930-05-28 | 2002-04-23 | 999-37-6544 | S99962206 | X194625X | Mr. | César | Bernardo | Matos | NULL | NULL | S | white | hispanic | M | Wesley Saint And... | 406 Weber Burg | Austin | Texas | Hays County | 48453 | 78741 | 30.10035672282173 | -97.5711182110766 | 105484.65 | 1757252.01 | 201309 | AUSTIN |
| 1640124 | 2003-01-11 | NULL | 999-16-5612 | S99946585 | X75431317X | Ms. | Grace | Kecia | Considine | NULL | NULL | NULL | white | nonhispanic | F | Belton Texas US | 871 Kshlerin Spur... | Austin | Texas | Hays County | 48453 | 78701 | 30.299539751127124 | -97.67210058669058 | 88679.25 | 49575.44 | 40877 | AUSTIN |
| 1640125 | 1964-11-05 | NULL | 999-23-1335 | S99955769 | X35985558X | Mrs. | Anna | Ivy | Herman | NULL | Ratke | M | white | nonhispanic | F | Chandler Texas US | 796 Bogisich Well... | Austin | Texas | Hays County | 48453 | 78730 | 30.357610383913563 | -97.74642132691518 | 1335250.57 | 79548.51 | 62100 | AUSTIN |
data_frame.show(2, vertical=True)
-RECORD 0----------------------------------- PATIENT_ID | 1640106 BIRTHDATE | 2000-10-04 DEATHDATE | NULL SSN | 999-37-5169 DRIVERS | S99927567 PASSPORT | X33172632X PREFIX | Ms. FIRST | Eloisa MIDDLE | María del Carmen LAST | León SUFFIX | NULL MAIDEN | NULL MARITAL | NULL RACE | white ETHNICITY | hispanic GENDER | F BIRTHPLACE | Wesley Saint And... ADDRESS | 557 Abbott Lane CITY | Austin STATE | Texas COUNTY | Hays County FIPS | 48453 ZIP | 78725 LAT | 30.354726579225414 LON | -97.73422750216807 HEALTHCARE_EXPENSES | 75583.97 HEALTHCARE_COVERAGE | 294119.98 INCOME | 600051 SYNTHEA_CITY | AUSTIN -RECORD 1----------------------------------- PATIENT_ID | 1640107 BIRTHDATE | 2012-02-10 DEATHDATE | NULL SSN | 999-44-7325 DRIVERS | NULL PASSPORT | NULL PREFIX | NULL FIRST | Maryam MIDDLE | Danita LAST | Tillman SUFFIX | NULL MAIDEN | NULL MARITAL | NULL RACE | white ETHNICITY | nonhispanic GENDER | F BIRTHPLACE | Longview Texas US ADDRESS | 722 Fadel Key Sui... CITY | Austin STATE | Texas COUNTY | Hays County FIPS | 48453 ZIP | 78732 LAT | 30.241189796621992 LON | -97.67758852990734 HEALTHCARE_EXPENSES | 43459.33 HEALTHCARE_COVERAGE | 16700.37 INCOME | 48653 SYNTHEA_CITY | AUSTIN only showing top 2 rows
## General stats
# data_frame.groupBy("GENDER").agg(F.countDistinct("id").alias("unique_id_count"))
# data_frame.select("GENDER").distinct().count()
from pyspark.sql import functions as F
from pyspark.sql.functions import col, count, sum
# from pyspark.sql.window import Window
def count_unique_and_percent(value):
counts = data_frame.groupBy(value).count()
total = data_frame.count()
result = counts.withColumn(
"percent",
col("count") / total * 100
)
result.show()
count_unique_and_percent("RACE")
[Stage 10:======================================> (2 + 1) / 3]
+--------+------+-------------------+ | RACE| count| percent| +--------+------+-------------------+ | asian|108267| 7.615555380485856| | other| 16577| 1.166034540001238| | white|978825| 68.85104413444603| | black|294507| 20.71577090379107| | native| 6960|0.48956991002042693| |hawaiian| 16520| 1.162025131255381| +--------+------+-------------------+
count_unique_and_percent("ETHNICITY")
[Stage 16:======================================> (2 + 1) / 3]
+-----------+-------+------------------+ | ETHNICITY| count| percent| +-----------+-------+------------------+ |nonhispanic|1206908| 84.89451737973181| | hispanic| 214748|15.105482620268193| +-----------+-------+------------------+
count_unique_and_percent("GENDER")
[Stage 22:======================================> (2 + 1) / 3]
+------+------+------------------+ |GENDER| count| percent| +------+------+------------------+ | F|713363|50.178313178434166| | M|708293|49.821686821565834| +------+------+------------------+
4. Income Distribution Analysis¶
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
income= (data_frame.select("INCOME").
toPandas()
)
fig, ax = plt.subplots()
income.hist(ax=ax)
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation=45) # Rotate so numbers don't overlap
plt.tight_layout()
plt.show()
Interpretation:
The Y-axis shows frequency (count) — the number of records/rows in the dataset that fall within each income bin. The 1e6 multiplier at the top means the values are in millions, so:
- The tallest bar (~1.17) = approximately 1,170,000 records with income between $0–$100K
- The second bar (~0.19) = approximately 190,000 records with income between $100K–$200K
- All remaining bars are nearly flat = very few records with income above $200K
What the shape tells you
This is a classic right-skewed (positive skew) distribution, which is very typical for income data:
The vast majority of your population earns under \$100k
A smaller but notable group earns \$100k – \$200k
A tiny minority (outliers) earn up to \$1,000,000
data_frame.select("INCOME").summary("count", "min", "25%","mean","50%", "stddev", "75%", "max").show()
[Stage 26:===================> (1 + 1) / 3]
+-------+------------------+ |summary| INCOME| +-------+------------------+ | count| 1421656| | min| 0| | 25%| 17149| | mean| 78416.96058680862| | 50%| 39850| | stddev|138024.94543862357| | 75%| 81096| | max| 998949| +-------+------------------+
Note: there's a huge difference in mean and median (or 50%). This is because mean was bias towards outliers (which present in this case).
income["INCOME"].quantile([0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
0.10 6884.00 0.25 17150.00 0.50 39855.00 0.75 81110.00 0.90 144760.50 0.99 840170.45 Name: INCOME, dtype: float64
Log transformation of x-axis¶
The long right tail with near-zero bars suggests possible outliers or data quality issues — we may want to investigate those high-income records. Since the outliers are valid but distorting the viz, trying log-scaling the x-axis for a clearer picture.
The log-scaling will compress the long right tail so we can actually see the distribution of high earners instead of flat near-zero bars.
Interpretation:
The x-axis is now logarithmic, so the spacing between ticks is no longer equal — notice how 10,000 to 100,000 takes the same horizontal space as 100,000 to 1,000,000. This is intentional and reveals hidden structure.
In the previous plot, everything above \$200K looked like a flat zero line. Now we can clearly see:
- There are still meaningful counts of people earning \$200K – \$500K
- The distribution gradually tapers off rather than falling off a cliff
- The data has a much smoother, more gradual decline across all income levels
fig, ax = plt.subplots()
income.hist(ax=ax, bins=50)
ax.set_xscale('log')
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.show()
import matplotlib.ticker as ticker
fig, ax = plt.subplots()
income.hist(ax=ax, bins=50)
ax.set_xscale('log')
# Explicitly define the tick positions you want to show
tick_positions = [5000, 15000, 40000, 70000, 100000, 200000, 500000, 1000000]
ax.set_xticks(tick_positions)
# Format them with commas and a $ sign
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))
plt.xticks(rotation=45) # rotate so labels don't overlap
plt.tight_layout() # prevent labels from being cut off
plt.show()
Interpretation:
- From reading the bars, we observe:
| Income Range | Approx. Count | Interpretation | |---------------|---------------|----------------| | ~$5K–$15K | ~410,000 | Largest group — very low earners | | ~$15K–$40K | ~305,000 | Second largest group | | ~$40K–$70K | ~200,000 | Middle-low earners | | ~$70K–$100K | ~150,000 | Approaching middle class | | ~$100K–$200K | ~100,000 | Upper-middle earners | | ~$200K–$500K | ~50,000 | High earners | | ~$500K+ | ~20,000–25,000 | Very high earners / outliers |
Insight:
The income distribution is strongly right-skewed, with the majority of individuals earning under $40K.
⚠️ Data Note:
The $500K+ category contains extreme outliers that may distort mean income statistics.
This shape — a straight declining line on a log x-axis — is actually close to a Pareto distribution, which is the classic mathematical model for income and wealth data in real-world populations. It basically confirms: the fewer people, the higher the income.
Dataset Snapshot¶
- Total records: ~1,235,000
- Median income: ~$32,000
- Mean income: ~ \$48,000
- Highest income bracket: \$500K+
- Lowest income bracket: \$5K
5. Gender Analysis¶
Summary stats a quick lense¶
df = data_frame.select("GENDER", "INCOME").toPandas()
df.groupby("GENDER")["INCOME"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| GENDER | ||||||||
| F | 713363.0 | 78167.985528 | 138061.080894 | 0.0 | 17037.0 | 39635.0 | 80755.5 | 998862.0 |
| M | 708293.0 | 78667.717823 | 137988.185153 | 0.0 | 17258.0 | 40087.0 | 81479.0 | 998949.0 |
From summary stats at a quick lense, there's not much difference in earnings between genders. Now we can take a look of viz, then do some statistics to confirm the observation.
Viz distribution¶
import seaborn as sns
fig, ax = plt.subplots()
sns.boxplot(data=df, x="GENDER", y="INCOME", ax=ax)
ax.set_yscale('log')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))
plt.title("Income by Gender")
plt.tight_layout()
plt.show()
The box plot shows you median, spread, and outliers per gender at a glance. It seems that there's not much difference in median. Also notice lots of outliers for both genders
Statistical test: is the difference significant?¶
from scipy import stats
male = df[df["GENDER"] == "M"]["INCOME"].dropna()
female = df[df["GENDER"] == "F"]["INCOME"].dropna()
t_stat, p_value = stats.mannwhitneyu(male, female, alternative='two-sided')
print(f"p-value: {p_value}")
p-value: 2.3758734348749904e-07
Use Mann-Whitney U (not a t-test) because income is skewed, not normally distributed.
| p-value | Meaning |
|---|---|
< 0.05 |
Statistically significant difference in income between genders |
>= 0.05 |
No statistically significant difference found |
With p-value above, there's no statistically significant difference in income between 2 genders
6. Income vs Healthcare Spending¶
This is now a relationship between two continuous variables, so the approach is different from the gender comparison
Summary stats¶
df = data_frame.select("INCOME", "HEALTHCARE_EXPENSES").toPandas()
# Pearson (linear) or Spearman (non-linear, better for skewed data)
print(df[["INCOME", "HEALTHCARE_EXPENSES"]].corr(method='spearman'))
INCOME HEALTHCARE_EXPENSES INCOME 1.000000 0.462926 HEALTHCARE_EXPENSES 0.462926 1.000000
Viz¶
import seaborn as sns
import numpy as np
import pandas as pd
# Create income brackets
df["INCOME_BUCKET"] = pd.cut(df["INCOME"],
bins=[0, 15000, 40000, 70000, 100000, 200000, 500000, np.inf],
labels=["<$15K", "$15-40K", "$40-70K", "$70-100K", "$100-200K", "$200-500K", "$500K+"]
)
fig, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(data=df, x="INCOME_BUCKET", y="HEALTHCARE_EXPENSES", ax=ax)
ax.set_yscale('log')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))
plt.xlabel("Income Range")
plt.ylabel("Healthcare Expenses")
plt.title("Healthcare Expenses by Income Bracket")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
7. Income Distribution Summary¶
Key Finding: Healthcare Expenses RISE with Income (up to a point)
| Income Bracket | Median Healthcare Expense | Observation |
|---|---|---|
| <\$15k | \$12,000 | Very limited spending |
| \$15–40k | \$40,000 | Significant jump |
| \$40–70k | \$175,000 | Another large jump |
| \$70–100k | \$175,000 | Similar to previous bracket |
| \$100–200k | \$150,000 | Slight plateau |
| \$200–500k | \$110,000 | Starts declining |
| \$500k+ | \$110,000 | Similar to $200–500K |
Important Patterns to Notice
Insight 1 — Healthcare spending rises sharply with income initially
Lower-income populations (<$40K) show much lower median healthcare expenses, likely reflecting:
- Reduced access to care
- Delayed or deferred treatments
- Affordability constraints
Insight 2 - The big jump from <$15K to \$15–40K
The median nearly triples between these two groups. The lowest earners spend dramatically less on healthcare — likely due to inability to afford care, not because they need it less.
Insight 3 — Peak spending occurs in middle-income groups (\$40K –\$100K)
Healthcare expenses increase dramatically and plateau around \$175K, suggesting:
- Higher healthcare utilization once financial barriers decrease
- Greater access to medical services
Insight 4 - Plateau and slight decline after \$70K
Healthcare expenses don't keep rising indefinitely. After \$70–100K, the median flattens then slightly drops — suggesting:
- high earners may be more efficient or healthier overall.
Insight 5 — Spending declines in higher-income brackets
Income groups above $200K show lower median healthcare expenses, which may reflect:
- Better preventive care
- Healthier lifestyles
- Superior insurance coverage
Insight 6 - Massive outliers in every bracket
Every income group has dots reaching \$1M – \$10M in healthcare expenses. These likely represent:
- Catastrophic illnesses (cancer, transplants)
- Long-term chronic conditions
- These occur across all income levels, not just the wealthy
Insight 7 - The Spread (Box Size) Tells a Story Too
- <$15K — very tight box → low earners consistently spend little
- \$40–100K — widest boxes → middle income has the most variability in healthcare spending
- \$200K+ — boxes tighten again → high earners are more consistent in their spending
Bottom Line
Income and healthcare expenses are positively correlated up to ~ \$70–100K, then level off. The biggest barrier is at the bottom — the <$15K group spends far less, likely reflecting access and affordability constraints rather than better health.
8. Final Takeaways¶
Although low-income populations represent the largest group in the dataset, their healthcare spending remains relatively low.
The majority of healthcare expenditures appear concentrated in middle-income brackets ($40K–$100K), where healthcare utilization peaks.
This finding may reflect increased healthcare access through employer-provided insurance and higher treatment utilization among middle-income individuals.
Jupyter Notebook Environment¶
Tools and libraries used in this project:
- Python 3.12
- Apache Spark (PySpark)
- Pandas
- NumPy
- Matplotlib / Seaborn
- Jupyter Notebook
Infrastructure:
- Local Spark environment
- Dataset sourced from Snowflake Marketplace