Income Distribution and Healthcare Spending Analysis¶

1. Project Overview¶

This analysis explores the relationship between income levels and healthcare spending using a large synthetic healthcare dataset.

The objective is to understand how healthcare expenditures vary across income groups and identify patterns that may reflect differences in access to care, healthcare utilization, or preventive health behavior.

The dataset contains over 1.2 million patient records, making it suitable for distributed processing using Apache Spark.

1a. Business Questions¶

Healthcare spending varies significantly across populations, but the relationship between income level and healthcare utilization is not always straightforward.

Key questions explored in this analysis:

How is income distributed across the population?
Do higher income groups spend more on healthcare?
At what income level does healthcare spending peak?
Are there demographic differences in healthcare expenses?

Understanding these patterns can help inform:

healthcare policy planning
insurance risk modeling
population health management

2. Dataset Overview¶

Total observations: ~1.2M
Median income: ~\$32K
Mean income: ~\$48K

Insight:
The dataset shows a strong right-skewed distribution.

2a. Big Data Considerations¶

Why Spark was used instead of Pandas?
Example explanation:
- The dataset contains over 1M records. Apache Spark enables distributed processing and scalable analytics compared to single-node Pandas workflows.

2b. Analytical Insight¶

Healthcare spending does not increase linearly with income. While spending rises rapidly between the \$15K – \$70K income brackets, it plateaus and eventually declines among higher-income populations.

This pattern may suggest that middle-income populations experience the highest healthcare utilization, whereas higher-income individuals benefit from preventive care and better health management.

2c. Hypothesis for Further Analysis¶

Potential factors influencing the pattern:

Insurance coverage differences
Preventive care access
Age distribution across income brackets
Chronic disease prevalence

3. Data Engineering (Spark)¶

Data Processing Pipeline¶

The analysis follows a distributed data processing workflow using Apache Spark.

Pipeline steps:

Data ingestion from Snowflake dataset
Schema inspection and data validation
Distributed aggregation using Spark SQL
Feature engineering (income brackets)
Statistical analysis
Visualization using Pandas + Matplotlib

Spark was used for large-scale data processing, while smaller aggregated datasets were converted to Pandas for visualization.

This hybrid workflow enables scalable processing while maintaining flexible analysis capabilities.

In [1]:

# Import SparkSession
from pyspark.sql import SparkSession

# Create SparkSession 
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("SparkByExamples.com") \
      .getOrCreate() 

# Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/test.txt")

WARNING: Using incubator modules: jdk.incubator.vector
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/03/15 17:03:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In [2]:

data_path = '/Users/thanhbrown/Home/work_portfolio/healthcare_spark/data/raw/warming_up.csv'
data_frame = (
        spark.read
        .option("header", True)
        .option("inferSchema", True)
        .csv(data_path)
    )

In [4]:

spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

data_frame

Out[4]:

PATIENT_ID	BIRTHDATE	DEATHDATE	SSN	DRIVERS	PASSPORT	PREFIX	FIRST	MIDDLE	LAST	SUFFIX	MAIDEN	MARITAL	RACE	ETHNICITY	GENDER	BIRTHPLACE	ADDRESS	CITY	STATE	COUNTY	FIPS	ZIP	LAT	LON	HEALTHCARE_EXPENSES	HEALTHCARE_COVERAGE	INCOME	SYNTHEA_CITY
1640106	2000-10-04	NULL	999-37-5169	S99927567	X33172632X	Ms.	Eloisa	María del Carmen	León	NULL	NULL	NULL	white	hispanic	F	Wesley Saint And...	557 Abbott Lane	Austin	Texas	Hays County	48453	78725	30.354726579225414	-97.73422750216807	75583.97	294119.98	600051	AUSTIN
1640107	2012-02-10	NULL	999-44-7325	NULL	NULL	NULL	Maryam	Danita	Tillman	NULL	NULL	NULL	white	nonhispanic	F	Longview Texas US	722 Fadel Key Sui...	Austin	Texas	Hays County	48453	78732	30.241189796621992	-97.67758852990734	43459.33	16700.37	48653	AUSTIN
1640108	1963-08-11	NULL	999-68-4600	S99983898	X57933561X	Mr.	Shad	NULL	Pagac	NULL	NULL	S	white	hispanic	M	San Antonio Texa...	442 Schmidt Throu...	Austin	Texas	Hays County	48453	78758	30.391158191525708	-97.82603275334445	430034.9	284638.58	8390	AUSTIN
1640109	2003-04-20	NULL	999-18-7911	S99910290	X22386013X	Mr.	Alex	NULL	Hammes	NULL	NULL	NULL	white	hispanic	M	Big Spring Texas...	433 Becker Passag...	Austin	Texas	Hays County	48453	78744	30.448291322845034	-97.88521095196566	7751.63	67354.64	15977	AUSTIN
1640110	1951-12-23	NULL	999-67-9744	S99990121	X11642889X	Mrs.	Cherish	Shakita	Stoltenberg	NULL	Monahan	D	white	nonhispanic	F	Irving Texas US	697 Wiza Crossing	Austin	Texas	Hays County	48453	78749	30.5354363419989	-97.8646393237985	684121.28	130648.4	99273	AUSTIN
1640111	2014-04-13	NULL	999-16-4648	NULL	NULL	NULL	Darren	Leon	Dietrich	NULL	NULL	NULL	white	nonhispanic	M	Lubbock Texas US	394 Hayes Hollow ...	Austin	Texas	Hays County	48453	78739	30.129846173226507	-97.57234941245392	23805.47	0.0	94919	AUSTIN
1640112	1980-11-04	NULL	999-89-2304	S99929484	X84264157X	Mr.	Palmer	Hyman	Sipes	NULL	NULL	D	white	nonhispanic	M	Hewitt Texas US	827 Krajcik Union...	Austin	Texas	Hays County	48453	78741	30.243705989128614	-97.68952854520568	122641.66	36055.11	954499	AUSTIN
1640113	1982-07-19	NULL	999-80-7707	S99970819	X64509957X	Mr.	Daryl	Teddy	Ryan	NULL	NULL	D	white	nonhispanic	M	Houston Texas US	357 Davis Esplanade	Austin	Texas	Hays County	48491	78717	30.34413813354928	-97.93154202232147	145434.53	1092025.1	8978	AUSTIN
1640114	1994-04-26	NULL	999-43-5125	S99957739	X68419579X	Mrs.	Débora	Francisca	Madera	NULL	Toro	M	white	hispanic	F	Carolina Puerto ...	503 Keebler Cente...	Austin	Texas	Hays County	48453	78741	30.37585635437581	-97.55014443141285	97101.96	612425.37	13729	AUSTIN
1640115	1963-08-13	NULL	999-93-6364	S99951081	X66961424X	Mrs.	Jeanne	Chin	Hermiston	NULL	Murphy	M	white	nonhispanic	F	Sugar Land Texas...	692 Kuhlman Meadow	Austin	Texas	Hays County	48453	78750	30.30485682126074	-97.64607331883383	229508.53	671061.29	18485	AUSTIN
1640116	1981-12-08	NULL	999-57-5488	S99935062	X9142516X	Mrs.	Alicia	NULL	Casas	NULL	Fernández	W	white	hispanic	F	San Juan Puerto ...	138 Stokes Burg U...	Austin	Texas	Hays County	48453	78660	30.3317735118376	-97.77549016248436	306258.91	451376.93	2803	AUSTIN
1640117	2005-10-02	NULL	999-99-8126	S99927805	NULL	Mr.	Luis	Manuel	Olivares	NULL	NULL	NULL	black	hispanic	M	Juarez Chihuahua...	138 Hammes Ferry ...	Austin	Texas	Hays County	48453	78750	30.410658199513644	-97.79402405988824	57614.8	197611.58	79641	AUSTIN
1640118	1991-04-03	NULL	999-10-4433	S99956308	X89460230X	Mr.	Dennis	Arnold	Bode	NULL	NULL	M	black	nonhispanic	M	Arlington Texas US	803 Schuppe Lodge	Austin	Texas	Hays County	48453	78744	30.28485521668174	-97.82801447687807	65199.91	550518.32	996	AUSTIN
1640119	1993-08-10	NULL	999-42-6132	S99975433	X33504910X	Mr.	Howard	Ignacio	Abbott	NULL	NULL	M	white	hispanic	M	Dallas Texas US	864 Waelchi Camp ...	Austin	Texas	Hays County	48453	78734	30.186529529252535	-97.75063376599643	65535.25	15874.07	49427	AUSTIN
1640120	2020-03-29	NULL	999-22-3187	NULL	NULL	NULL	Roderick	Derick	Bosco	NULL	NULL	NULL	white	nonhispanic	M	Houston Texas US	967 Daugherty Ville	Austin	Texas	Hays County	48453	78742	30.541822801406493	-97.67857620480662	9712.89	0.0	59358	AUSTIN
1640121	1971-07-03	NULL	999-41-3265	S99969956	X38591251X	Mr.	Arlen	Lawerence	Bailey	NULL	NULL	M	white	hispanic	M	Edinburg Texas US	1066 Boyer Bridge	Austin	Texas	Hays County	48453	78723	30.17923179171047	-97.7836328356483	163553.29	57088.71	36658	AUSTIN
1640122	1983-10-24	NULL	999-40-9630	S99973435	X74977147X	Mr.	Dylan	Leandro	Hayes	NULL	NULL	S	white	nonhispanic	M	Amarillo Texas US	662 Bruen Dale	Austin	Texas	Hays County	48209	78610	30.266042712591933	-97.70404670044297	632187.87	0.0	81708	AUSTIN
1640123	1930-05-28	2002-04-23	999-37-6544	S99962206	X194625X	Mr.	César	Bernardo	Matos	NULL	NULL	S	white	hispanic	M	Wesley Saint And...	406 Weber Burg	Austin	Texas	Hays County	48453	78741	30.10035672282173	-97.5711182110766	105484.65	1757252.01	201309	AUSTIN
1640124	2003-01-11	NULL	999-16-5612	S99946585	X75431317X	Ms.	Grace	Kecia	Considine	NULL	NULL	NULL	white	nonhispanic	F	Belton Texas US	871 Kshlerin Spur...	Austin	Texas	Hays County	48453	78701	30.299539751127124	-97.67210058669058	88679.25	49575.44	40877	AUSTIN
1640125	1964-11-05	NULL	999-23-1335	S99955769	X35985558X	Mrs.	Anna	Ivy	Herman	NULL	Ratke	M	white	nonhispanic	F	Chandler Texas US	796 Bogisich Well...	Austin	Texas	Hays County	48453	78730	30.357610383913563	-97.74642132691518	1335250.57	79548.51	62100	AUSTIN

only showing top 20 rows

In [6]:

data_frame.show(2, vertical=True)

-RECORD 0-----------------------------------
 PATIENT_ID          | 1640106              
 BIRTHDATE           | 2000-10-04           
 DEATHDATE           | NULL                 
 SSN                 | 999-37-5169          
 DRIVERS             | S99927567            
 PASSPORT            | X33172632X           
 PREFIX              | Ms.                  
 FIRST               | Eloisa               
 MIDDLE              | María del Carmen     
 LAST                | León                 
 SUFFIX              | NULL                 
 MAIDEN              | NULL                 
 MARITAL             | NULL                 
 RACE                | white                
 ETHNICITY           | hispanic             
 GENDER              | F                    
 BIRTHPLACE          | Wesley  Saint And... 
 ADDRESS             | 557 Abbott Lane      
 CITY                | Austin               
 STATE               | Texas                
 COUNTY              | Hays County          
 FIPS                | 48453                
 ZIP                 | 78725                
 LAT                 | 30.354726579225414   
 LON                 | -97.73422750216807   
 HEALTHCARE_EXPENSES | 75583.97             
 HEALTHCARE_COVERAGE | 294119.98            
 INCOME              | 600051               
 SYNTHEA_CITY        | AUSTIN               
-RECORD 1-----------------------------------
 PATIENT_ID          | 1640107              
 BIRTHDATE           | 2012-02-10           
 DEATHDATE           | NULL                 
 SSN                 | 999-44-7325          
 DRIVERS             | NULL                 
 PASSPORT            | NULL                 
 PREFIX              | NULL                 
 FIRST               | Maryam               
 MIDDLE              | Danita               
 LAST                | Tillman              
 SUFFIX              | NULL                 
 MAIDEN              | NULL                 
 MARITAL             | NULL                 
 RACE                | white                
 ETHNICITY           | nonhispanic          
 GENDER              | F                    
 BIRTHPLACE          | Longview  Texas  US  
 ADDRESS             | 722 Fadel Key Sui... 
 CITY                | Austin               
 STATE               | Texas                
 COUNTY              | Hays County          
 FIPS                | 48453                
 ZIP                 | 78732                
 LAT                 | 30.241189796621992   
 LON                 | -97.67758852990734   
 HEALTHCARE_EXPENSES | 43459.33             
 HEALTHCARE_COVERAGE | 16700.37             
 INCOME              | 48653                
 SYNTHEA_CITY        | AUSTIN               
only showing top 2 rows

In [8]:

## General stats

In [9]:

# data_frame.groupBy("GENDER").agg(F.countDistinct("id").alias("unique_id_count"))
# data_frame.select("GENDER").distinct().count()

from pyspark.sql import functions as F
from pyspark.sql.functions import col, count, sum
# from pyspark.sql.window import Window

def count_unique_and_percent(value):
    
    counts = data_frame.groupBy(value).count()
    
    total = data_frame.count()
    
    result = counts.withColumn(
        "percent",
        col("count") / total * 100
    )
    
    result.show()

count_unique_and_percent("RACE")

[Stage 10:======================================>                   (2 + 1) / 3]

+--------+------+-------------------+
|    RACE| count|            percent|
+--------+------+-------------------+
|   asian|108267|  7.615555380485856|
|   other| 16577|  1.166034540001238|
|   white|978825|  68.85104413444603|
|   black|294507|  20.71577090379107|
|  native|  6960|0.48956991002042693|
|hawaiian| 16520|  1.162025131255381|
+--------+------+-------------------+

In [10]:

count_unique_and_percent("ETHNICITY")

[Stage 16:======================================>                   (2 + 1) / 3]

+-----------+-------+------------------+
|  ETHNICITY|  count|           percent|
+-----------+-------+------------------+
|nonhispanic|1206908| 84.89451737973181|
|   hispanic| 214748|15.105482620268193|
+-----------+-------+------------------+

In [11]:

count_unique_and_percent("GENDER")

[Stage 22:======================================>                   (2 + 1) / 3]

+------+------+------------------+
|GENDER| count|           percent|
+------+------+------------------+
|     F|713363|50.178313178434166|
|     M|708293|49.821686821565834|
+------+------+------------------+

4. Income Distribution Analysis¶

In [13]:

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

income= (data_frame.select("INCOME").
                    toPandas()
            )


fig, ax = plt.subplots()
income.hist(ax=ax)
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation=45)  # Rotate so numbers don't overlap
plt.tight_layout()
plt.show()

No description has been provided for this image

Interpretation:

The Y-axis shows frequency (count) — the number of records/rows in the dataset that fall within each income bin. The 1e6 multiplier at the top means the values are in millions, so:

The tallest bar (~1.17) = approximately 1,170,000 records with income between $0–$100K
The second bar (~0.19) = approximately 190,000 records with income between $100K–$200K
All remaining bars are nearly flat = very few records with income above $200K

What the shape tells you

This is a classic right-skewed (positive skew) distribution, which is very typical for income data:
The vast majority of your population earns under \$100k
A smaller but notable group earns \$100k – \$200k
A tiny minority (outliers) earn up to \$1,000,000

In [15]:

data_frame.select("INCOME").summary("count",  "min", "25%","mean","50%", "stddev", "75%", "max").show()

[Stage 26:===================>                                      (1 + 1) / 3]

+-------+------------------+
|summary|            INCOME|
+-------+------------------+
|  count|           1421656|
|    min|                 0|
|    25%|             17149|
|   mean| 78416.96058680862|
|    50%|             39850|
| stddev|138024.94543862357|
|    75%|             81096|
|    max|            998949|
+-------+------------------+

Note: there's a huge difference in mean and median (or 50%). This is because mean was bias towards outliers (which present in this case).

In [17]:

income["INCOME"].quantile([0.1, 0.25, 0.5, 0.75, 0.9, 0.99])

Out[17]:

0.10      6884.00
0.25     17150.00
0.50     39855.00
0.75     81110.00
0.90    144760.50
0.99    840170.45
Name: INCOME, dtype: float64

Log transformation of x-axis¶

The long right tail with near-zero bars suggests possible outliers or data quality issues — we may want to investigate those high-income records. Since the outliers are valid but distorting the viz, trying log-scaling the x-axis for a clearer picture.
The log-scaling will compress the long right tail so we can actually see the distribution of high earners instead of flat near-zero bars.

Interpretation:
- The x-axis is now logarithmic, so the spacing between ticks is no longer equal — notice how 10,000 to 100,000 takes the same horizontal space as 100,000 to 1,000,000. This is intentional and reveals hidden structure.
- In the previous plot, everything above \$200K looked like a flat zero line. Now we can clearly see:
  - There are still meaningful counts of people earning \$200K – \$500K
  - The distribution gradually tapers off rather than falling off a cliff
  - The data has a much smoother, more gradual decline across all income levels

In [20]:

fig, ax = plt.subplots()
income.hist(ax=ax, bins=50)
ax.set_xscale('log')
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.show()

In [21]:

import matplotlib.ticker as ticker

fig, ax = plt.subplots()
income.hist(ax=ax, bins=50)
ax.set_xscale('log')

# Explicitly define the tick positions you want to show
tick_positions = [5000, 15000, 40000, 70000, 100000, 200000, 500000, 1000000]
ax.set_xticks(tick_positions)

# Format them with commas and a $ sign
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))

plt.xticks(rotation=45)  # rotate so labels don't overlap
plt.tight_layout()       # prevent labels from being cut off
plt.show()

Interpretation:
- From reading the bars, we observe:
| Income Range | Approx. Count | Interpretation | |---------------|---------------|----------------| | ~$5K–$15K | ~410,000 | Largest group — very low earners | | ~$15K–$40K | ~305,000 | Second largest group | | ~$40K–$70K | ~200,000 | Middle-low earners | | ~$70K–$100K | ~150,000 | Approaching middle class | | ~$100K–$200K | ~100,000 | Upper-middle earners | | ~$200K–$500K | ~50,000 | High earners | | ~$500K+ | ~20,000–25,000 | Very high earners / outliers |

Insight:
The income distribution is strongly right-skewed, with the majority of individuals earning under $40K.

⚠️ Data Note:
The $500K+ category contains extreme outliers that may distort mean income statistics.

This shape — a straight declining line on a log x-axis — is actually close to a Pareto distribution, which is the classic mathematical model for income and wealth data in real-world populations. It basically confirms: the fewer people, the higher the income.

Dataset Snapshot¶

Total records: ~1,235,000
Median income: ~$32,000
Mean income: ~ \$48,000
Highest income bracket: \$500K+
Lowest income bracket: \$5K

5. Gender Analysis¶

Summary stats a quick lense¶

In [26]:

df = data_frame.select("GENDER", "INCOME").toPandas()
df.groupby("GENDER")["INCOME"].describe()

Out[26]:

	count	mean	std	min	25%	50%	75%	max
GENDER
F	713363.0	78167.985528	138061.080894	0.0	17037.0	39635.0	80755.5	998862.0
M	708293.0	78667.717823	137988.185153	0.0	17258.0	40087.0	81479.0	998949.0

From summary stats at a quick lense, there's not much difference in earnings between genders. Now we can take a look of viz, then do some statistics to confirm the observation.

Viz distribution¶

In [27]:

import seaborn as sns

fig, ax = plt.subplots()
sns.boxplot(data=df, x="GENDER", y="INCOME", ax=ax)
ax.set_yscale('log')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))
plt.title("Income by Gender")
plt.tight_layout()
plt.show()

The box plot shows you median, spread, and outliers per gender at a glance. It seems that there's not much difference in median. Also notice lots of outliers for both genders

Statistical test: is the difference significant?¶

In [29]:

from scipy import stats

male   = df[df["GENDER"] == "M"]["INCOME"].dropna()
female = df[df["GENDER"] == "F"]["INCOME"].dropna()

t_stat, p_value = stats.mannwhitneyu(male, female, alternative='two-sided')
print(f"p-value: {p_value}")

p-value: 2.3758734348749904e-07

Use Mann-Whitney U (not a t-test) because income is skewed, not normally distributed.

p-value	Meaning
`< 0.05`	Statistically significant difference in income between genders
`>= 0.05`	No statistically significant difference found

With p-value above, there's no statistically significant difference in income between 2 genders

6. Income vs Healthcare Spending¶

This is now a relationship between two continuous variables, so the approach is different from the gender comparison

Summary stats¶

In [30]:

df = data_frame.select("INCOME", "HEALTHCARE_EXPENSES").toPandas()

# Pearson (linear) or Spearman (non-linear, better for skewed data)
print(df[["INCOME", "HEALTHCARE_EXPENSES"]].corr(method='spearman'))

                       INCOME  HEALTHCARE_EXPENSES
INCOME               1.000000             0.462926
HEALTHCARE_EXPENSES  0.462926             1.000000

Viz¶

In [31]:

import seaborn as sns
import numpy as np
import pandas as pd

# Create income brackets
df["INCOME_BUCKET"] = pd.cut(df["INCOME"],
    bins=[0, 15000, 40000, 70000, 100000, 200000, 500000, np.inf],
    labels=["<$15K", "$15-40K", "$40-70K", "$70-100K", "$100-200K", "$200-500K", "$500K+"]
)

fig, ax = plt.subplots(figsize=(12, 6))
sns.boxplot(data=df, x="INCOME_BUCKET", y="HEALTHCARE_EXPENSES", ax=ax)
ax.set_yscale('log')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${x:,.0f}'))
plt.xlabel("Income Range")
plt.ylabel("Healthcare Expenses")
plt.title("Healthcare Expenses by Income Bracket")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

7. Income Distribution Summary¶

Key Finding: Healthcare Expenses RISE with Income (up to a point)

Income Bracket	Median Healthcare Expense	Observation
<\$15k	\$12,000	Very limited spending
\$15–40k	\$40,000	Significant jump
\$40–70k	\$175,000	Another large jump
\$70–100k	\$175,000	Similar to previous bracket
\$100–200k	\$150,000	Slight plateau
\$200–500k	\$110,000	Starts declining
\$500k+	\$110,000	Similar to $200–500K

Important Patterns to Notice

Insight 1 — Healthcare spending rises sharply with income initially
Lower-income populations (<$40K) show much lower median healthcare expenses, likely reflecting:

Reduced access to care

Delayed or deferred treatments

Affordability constraints

Insight 2 - The big jump from <$15K to \$15–40K

The median nearly triples between these two groups. The lowest earners spend dramatically less on healthcare — likely due to inability to afford care, not because they need it less.

Insight 3 — Peak spending occurs in middle-income groups (\$40K –\$100K)
Healthcare expenses increase dramatically and plateau around \$175K, suggesting:

Higher healthcare utilization once financial barriers decrease

Greater access to medical services

Insight 4 - Plateau and slight decline after \$70K

Healthcare expenses don't keep rising indefinitely. After \$70–100K, the median flattens then slightly drops — suggesting:

high earners may be more efficient or healthier overall.

Insight 5 — Spending declines in higher-income brackets
Income groups above $200K show lower median healthcare expenses, which may reflect:

Better preventive care

Healthier lifestyles

Superior insurance coverage

Insight 6 - Massive outliers in every bracket

Every income group has dots reaching \$1M – \$10M in healthcare expenses. These likely represent:

Catastrophic illnesses (cancer, transplants)

Long-term chronic conditions

These occur across all income levels, not just the wealthy

Insight 7 - The Spread (Box Size) Tells a Story Too

<$15K — very tight box → low earners consistently spend little

\$40–100K — widest boxes → middle income has the most variability in healthcare spending

\$200K+ — boxes tighten again → high earners are more consistent in their spending

Bottom Line

Income and healthcare expenses are positively correlated up to ~ \$70–100K, then level off. The biggest barrier is at the bottom — the <$15K group spends far less, likely reflecting access and affordability constraints rather than better health.

8. Final Takeaways¶

Although low-income populations represent the largest group in the dataset, their healthcare spending remains relatively low.

The majority of healthcare expenditures appear concentrated in middle-income brackets ($40K–$100K), where healthcare utilization peaks.

This finding may reflect increased healthcare access through employer-provided insurance and higher treatment utilization among middle-income individuals.

Jupyter Notebook Environment¶

Tools and libraries used in this project:

Python 3.12
Apache Spark (PySpark)
Pandas
NumPy
Matplotlib / Seaborn
Jupyter Notebook

Infrastructure:

Local Spark environment
Dataset sourced from Snowflake Marketplace