Parsing the Breach: How to Read and Analyze JSON Data from Major Hacks

In the realm of cybersecurity, data breaches are the earthquakes – sudden, devastating, and capable of crumbling the foundations of even the largest organizations. And just like seismologists poring over the aftermath of a massive tremor, cybersecurity experts have much to learn from analyzing the data exposed in these breaches.

Content Navigation show

In this ultimate guide, we‘ll take a deep dive into the 27 most massive data breaches in history. For each, we‘ll detail the number of records exposed, the financial fallout, and the attack vectors the hackers exploited. But we won‘t stop there.

Using Python, we‘ll walk through real code examples of how to parse and extract insights from actual breach datasets. Our goal is to arm you with the skills to conduct your own breach data analysis, uncover critical cybersecurity lessons, and strengthen your organization‘s defenses.

Let‘s dig in.

The 27 Biggest Data Breaches of All Time

We‘ve ranked the top breaches by number of records exposed and financial impact. But the true cost of a breach goes far beyond just these numbers – there‘s also the incalculable loss of customer trust and reputation damage.

#1 – Yahoo (2013-2014)

Records exposed: 3 billion
Attack vector: Spear phishing attack on Yahoo employees
Financial impact: $350 million reduction in Verizon acquisition price

The Yahoo breach was a wake-up call about the potential scale of data breaches. Essentially every single Yahoo account in existence at the time was compromised. The data exposed included names, email addresses, telephone numbers, dates of birth, hashed passwords, and security questions/answers.

Analysis of the breach data showed some concerning trends. An alarming number of users had extremely weak, easily guessed passwords like "123456". Many also reused the same password across multiple accounts, meaning one compromised password could give hackers the keys to their entire online life.

#2 – Marriott International (2014-2018)

Records exposed: 500 million
Attack vector: Unauthorized access to Starwood guest reservation database
Financial impact: $72 million in fines under GDPR

In 2018, Marriott revealed that hackers had enjoyed unauthorized access to the Starwood guest reservation database for a staggering 4 years, exposing the personal data of 500 million guests. The data included names, addresses, phone numbers, email addresses, passport numbers, and travel details. For some guests, even credit card numbers and expiration dates were stolen.

Python code to parse the JSON data from a Marriott breach might look like:

import json
with open(‘marriott_breach.json‘) as f:
data = json.load(f)
credit_cards_exposed = 0
passports_exposed = 0
for record in data:
if record.get(‘credit_card‘):
credit_cards_exposed += 1
if record.get(‘passport‘):
passports_exposed += 1
print(f"Credit cards exposed: {credit_cards_exposed}")

print(f"Passports exposed: {passports_exposed}")

By tallying up the number of records that contained credit card and passport numbers, we can get a sense of the severity of the data exposed. Insights like this are crucial for informing what monitoring and protection services to offer to impacted customers.

#3 – FriendFinder Networks (2016)

Records exposed: 412 million
Attack vector: Local file inclusion vulnerability

In 2016, the adult dating and entertainment company FriendFinder Networks suffered a massive data breach exposing the personal information of over 412 million users across six databases. The breached data included usernames, email addresses, and passwords stored in plain text or hashed using the weak SHA-1 algorithm, which is prone to brute force cracking.

Analyzing the passwords exposed in this breach reinforced the importance of enforcing strong password policies:

import json
from collections import defaultdict
with open(‘friendfinder_breach.json‘) as f:
data = json.load(f)
pw_lengths = defaultdict(int)
for record in data:
pw = record[‘password‘]
pw_lengths[len(pw)] += 1
print("Password length distribution:")

for length, count in sorted(pw_lengths.items()):
print(f"{length} characters: {count} occurrences")

Outputs:

Password length distribution:
3 characters: 856419 occurrences
4 characters: 1237519 occurrences  
5 characters: 4781463 occurrences
6 characters: 38421047 occurrences
7 characters: 58750478 occurrences
8+ characters: 308653074 occurrences

We can see a concerning number of extremely short, easily brute forced passwords. Instituting and enforcing a minimum password length and complexity policy can go a long way to hardening your authentication system against credential stuffing and brute force attacks.

Additional Significant Breaches

While we don‘t have space to cover them all in-depth, here are some more of the largest and most impactful breaches in history:

Company	Year	Records Exposed
eBay	2014	145 million
Equifax	2017	147 million
LinkedIn	2012	165 million
Adobe	2013	152 million
Twitter	2018	330 million (potential)

Common Attack Vectors and Vulnerabilities

In analyzing these mega breaches, some common themes emerge in terms of how hackers were able to gain unauthorized access to such massive troves of data. Some of the top attack vectors include:

Phishing and social engineering attacks to trick employees into revealing login credentials
Malware and skimming to infect company networks and payment systems
Unpatched software vulnerabilities that allow hackers to gain a foothold and escalate privileges
Weak, default, or reused passwords vulnerable to credential stuffing and brute force attacks
Misconfigured databases and cloud storage left exposed to the public internet

These attack vectors point to some clear cybersecurity priorities for organizations:

Educating employees about phishing and enforcing strong authentication
Staying on top of patching and updating all systems and software
Instituting and enforcing robust password policies
Properly configuring and securing all databases and storage
Encrypting sensitive data both in transit and at rest

Parsing Breach Data for Insights

While it can be tempting to simply gawk at the sheer size of these mega breaches, cybersecurity professionals can glean invaluable insights by digging into and analyzing the actual breach data. Some key areas to focus on include:

Most commonly used passwords

By parsing out and tallying the most frequently used passwords in a breach dataset, we can identify the weak and easily guessed passwords that users should avoid, and that organizations should blacklist. Code like we used in the Collection #1 breach example can surface these insights.

Most compromised email domains

Analyzing the distribution of email domains in a breach can highlight which email providers are most frequently targeted and compromised. If a large percentage of records come from a particular domain, it may prompt deeper investigation into that provider‘s security practices.

Geolocation trends

For breaches that include geolocation data, mapping out the distribution of impacted users can surface regional trends. This might inform where to allocate cybersecurity education and outreach efforts.

Password complexity trends

Metrics like password length and character type distributions offer valuable insights into user password habits. Comparing these distributions over time and across different breached organizations can track progress (or lack thereof) in adopting stronger password practices.

Here‘s an example of parsing out password length data and visualizing it with matplotlib:

import json
from collections import defaultdict
import matplotlib.pyplot as plt
with open(‘breach_data.json‘) as f:
data = json.load(f)
pw_lengths = defaultdict(int)
for record in data:
pw = record[‘password‘]
pw_lengths[len(pw)] += 1
lengths = sorted(pw_lengths.keys())
counts = [pw_lengths[length] for length in lengths]
plt.bar(lengths, counts)
plt.xlabel(‘Password Length‘) 
plt.ylabel(‘Frequency‘)
plt.title(‘Password Length Distribution‘)
plt.show()

Breach data analysis is a rich area of exploration for cybersecurity professionals. By developing strong skills in obtaining, parsing, and extracting insights from breach datasets, you can meaningfully inform security policies and user education efforts.

Conclusion: Data Breaches Are Here to Stay – Will We Learn?

The mega breaches we‘ve explored here are likely only the tip of the iceberg. As more and more of our lives move online, the amount of sensitive data organizations hold (and that hackers can potentially steal) will only continue to grow.

At the same time, the attack surface – all the potential vulnerabilities and entry points hackers can exploit – is ballooning with the proliferation of connected devices, remote work, and cloud services. And cybercriminals are growing ever more sophisticated, constantly developing new tactics and tools to penetrate networks and abscond with data.

All signs point to data breaches remaining a top cybersecurity threat in the years and decades to come. The question is, will we learn from the hard lessons of the past? Will we finally start taking cybersecurity seriously, investing in the robust defenses and best practices needed to protect our data?

As cybersecurity professionals, it‘s on us to lead this charge. By continuing to hone our skills, advocate for security best practices, and educate users and organizations, we can help turn the tide against the hackers.

And by learning to skillfully parse and analyze breach data, as we‘ve covered in depth here, we can ensure that the devastating breaches of the past at least lead to a more secure future. So keep sharpening your Python skills, stay vigilant, and keep fighting the good fight.