Categorical Data Analysis in Python Using Automation

Automatic Categorical Data Analysis focuses on using programmatic techniques to examine categorical features efficiently and consistently. Rather than analyzing each column individually, automation enables scalable summaries of categorical attributes using Python libraries which is essential when working with datasets containing many non-numeric fields.

Why Automation Matters in Data Analysis

Automating categorical analysis streamlines the process, allowing analysts to focus on insights rather than repetitive tasks.

Saves time and effort: Eliminates manual counting and repetitive operations.
Ensures consistency: Applies the same logic across multiple features without variation.
Scales efficiently: Handles large datasets and numerous categorical columns with ease.
Reduces errors: Minimizes human mistakes that can occur during manual analysis.
Improves readability and maintainability: Clean, reusable code makes workflows easier to understand and update.

Understanding the Dataset Structure

Here, we use the Google Play Store dataset, which contains information about mobile applications. The presence of multiple non-numeric attributes makes it suitable for categorical data analysis.

You can download Dataset from here

Python

import pandas as pd

df = pd.read_csv('googleplaystore.csv')
df.head()

Output:

Data Types and Column Information

The info() method shows the data type and non-null count for each column, making it easier to distinguish categorical and numerical features and identify missing values. This information guides appropriate preprocessing and analysis steps.

Python

df.info()

Output:

Automating Category Counts in App Data

Counting apps manually is inefficient and error-prone. Python provides ways to automate this task from basic loops to optimized Pandas methods.

Counting Apps in Each Category (Manual Automation with Loops)

A simple approach uses loops to count occurrences of each category which helps understand the underlying logic of automation.

df['Category'].unique() extracts all distinct categories.
A nested loop iterates through all rows to count how many times each category appears.
Counts are stored in a dictionary for easy access and display.

Python

categories = {}

for name in df['Category'].unique():
    ct = 0
    for value in df['Category']:
        if value == name:
            ct += 1
    categories[name] = ct

for key, value in categories.items():
    print(f'{key}: {value}')

Output:

Optimized Approach Using Pandas value_counts()

Pandas offers a faster more efficient method for counting category occurrences ideal for large datasets.

df['Category'].value_counts() provides counts of each category in a single line.
Categories are automatically sorted by frequency in descending order.
Optimized for large datasets, eliminating the need for manual loops.

Python

df['Category'].value_counts()

Output:

Automating Analysis for App Types

Analyzing app types can be automated to quickly understand the distribution of Free vs Paid apps without manually counting each entry.

Counting App Types with Loops

A simple loop-based approach counts each app type and stores the results in a dictionary.

df['Type'].unique() retrieves all distinct app types.
Nested loops iterate through the column to tally counts.

Python

types = {}

for name in df['Type'].unique():
    ct = 0
    for value in df['Type']:
        if value == name:
            ct += 1
    types[name] = ct

print(types)

Output:

{'Free': 10039, 'Paid': 800, nan: 0, '0': 1}

Using Pandas value_counts()

Pandas provides a faster, fully automated way to get counts and percentages.

Python

(df['Type'].value_counts(normalize=True) * 100).round(2)

Output:

This method eliminates manual loops, handles large datasets efficiently.

Automating Content Rating Analysis

Content Rating indicates the target audience for each app and is an important categorical feature to analyze. Automating this analysis helps quickly understand how apps are distributed across age groups.

Counting Apps by Content Rating

We can use a loop to count the number of apps in each content rating and store the results in a dictionary.

df['Content Rating'].unique() retrieves all distinct content ratings.
A nested loop iterates through the column to tally the count for each rating.

Python

content_rating = {}

for name in df['Content Rating'].unique():
    ct = 0
    for value in df['Content Rating']:
        if value == name:
            ct += 1
    content_rating[name] = ct

print(content_rating)

Output:

{'Everyone': 8714, 'Teen': 1208, 'Everyone 10+': 414, 'Mature 17+': 499, 'Adults only 18+': 3, 'Unrated': 2, nan: 0}

Automated Analysis Using Pandas

For a cleaner and faster approach, Pandas provides the value_counts() method to automate counting and sorting of categories.

Python

df['Content Rating'].value_counts()

(df['Content Rating'].value_counts(normalize=True) * 100).round(2)

Output:

This method eliminates the need for loops, handles large datasets efficiently, and instantly provides both counts and proportions of apps in each content rating.

Quick Categorical Summary Using describe()

Pandas provides a fast way to get summary statistics for categorical data using the describe() method. This avoids manual counting and gives an instant overview of key insights.

Python

df['Reviews'].describe()

Output:

count: total number of non-null values in the column.
unique: number of distinct values.
top: the most frequent value in the column.
freq: frequency of the most common value.

This method is especially useful for quick exploratory analysis giving a snapshot of data distribution without writing multiple lines of code.

Automating Analysis for Multiple Categorical Columns

Instead of analyzing one column at a time, we can generalize categorical analysis using a loop over multiple columns. This makes the process reusable and scalable.

categorical_cols stores the list of categorical columns to analyze.
Looping through each column applies the same value_counts() logic automatically.
New categorical features can be added easily without rewriting code.

Python

categorical_cols = ['Category', 'Type', 'Content Rating']

for col in categorical_cols:
    print(f"\nAnalysis for {col}")
    print(df[col].value_counts())

Output:

You can download full code from here

Categorical Data Analysis in Python Using Automation

Why Automation Matters in Data Analysis

Understanding the Dataset Structure

Data Types and Column Information

Automating Category Counts in App Data

Counting Apps in Each Category (Manual Automation with Loops)

Optimized Approach Using Pandas value_counts()

Automating Analysis for App Types

Counting App Types with Loops

Using Pandas value_counts()

Automating Content Rating Analysis

Counting Apps by Content Rating

Automated Analysis Using Pandas

Quick Categorical Summary Using describe()

Automating Analysis for Multiple Categorical Columns

Explore