Automatic Categorical Data Analysis focuses on using programmatic techniques to examine categorical features efficiently and consistently. Rather than analyzing each column individually, automation enables scalable summaries of categorical attributes using Python libraries which is essential when working with datasets containing many non-numeric fields.
Why Automation Matters in Data Analysis
Automating categorical analysis streamlines the process, allowing analysts to focus on insights rather than repetitive tasks.
- Saves time and effort: Eliminates manual counting and repetitive operations.
- Ensures consistency: Applies the same logic across multiple features without variation.
- Scales efficiently: Handles large datasets and numerous categorical columns with ease.
- Reduces errors: Minimizes human mistakes that can occur during manual analysis.
- Improves readability and maintainability: Clean, reusable code makes workflows easier to understand and update.
Understanding the Dataset Structure
Here, we use the Google Play Store dataset, which contains information about mobile applications. The presence of multiple non-numeric attributes makes it suitable for categorical data analysis.
You can download Dataset from here
import pandas as pd
df = pd.read_csv('googleplaystore.csv')
df.head()
Output:

Data Types and Column Information
The info() method shows the data type and non-null count for each column, making it easier to distinguish categorical and numerical features and identify missing values. This information guides appropriate preprocessing and analysis steps.
df.info()
Output:

Automating Category Counts in App Data
Counting apps manually is inefficient and error-prone. Python provides ways to automate this task from basic loops to optimized Pandas methods.
Counting Apps in Each Category (Manual Automation with Loops)
A simple approach uses loops to count occurrences of each category which helps understand the underlying logic of automation.
- df['Category'].unique() extracts all distinct categories.
- A nested loop iterates through all rows to count how many times each category appears.
- Counts are stored in a dictionary for easy access and display.
categories = {}
for name in df['Category'].unique():
ct = 0
for value in df['Category']:
if value == name:
ct += 1
categories[name] = ct
for key, value in categories.items():
print(f'{key}: {value}')
Output:

Optimized Approach Using Pandas value_counts()
Pandas offers a faster more efficient method for counting category occurrences ideal for large datasets.
- df['Category'].value_counts() provides counts of each category in a single line.
- Categories are automatically sorted by frequency in descending order.
- Optimized for large datasets, eliminating the need for manual loops.
df['Category'].value_counts()
Output:

Automating Analysis for App Types
Analyzing app types can be automated to quickly understand the distribution of Free vs Paid apps without manually counting each entry.
Counting App Types with Loops
A simple loop-based approach counts each app type and stores the results in a dictionary.
- df['Type'].unique() retrieves all distinct app types.
- Nested loops iterate through the column to tally counts.
types = {}
for name in df['Type'].unique():
ct = 0
for value in df['Type']:
if value == name:
ct += 1
types[name] = ct
print(types)
Output:
{'Free': 10039, 'Paid': 800, nan: 0, '0': 1}
Using Pandas value_counts()
Pandas provides a faster, fully automated way to get counts and percentages.
(df['Type'].value_counts(normalize=True) * 100).round(2)
Output:

This method eliminates manual loops, handles large datasets efficiently.
Automating Content Rating Analysis
Content Rating indicates the target audience for each app and is an important categorical feature to analyze. Automating this analysis helps quickly understand how apps are distributed across age groups.
Counting Apps by Content Rating
We can use a loop to count the number of apps in each content rating and store the results in a dictionary.
- df['Content Rating'].unique() retrieves all distinct content ratings.
- A nested loop iterates through the column to tally the count for each rating.
content_rating = {}
for name in df['Content Rating'].unique():
ct = 0
for value in df['Content Rating']:
if value == name:
ct += 1
content_rating[name] = ct
print(content_rating)
Output:
{'Everyone': 8714, 'Teen': 1208, 'Everyone 10+': 414, 'Mature 17+': 499, 'Adults only 18+': 3, 'Unrated': 2, nan: 0}
Automated Analysis Using Pandas
For a cleaner and faster approach, Pandas provides the value_counts() method to automate counting and sorting of categories.
df['Content Rating'].value_counts()
(df['Content Rating'].value_counts(normalize=True) * 100).round(2)
Output:

This method eliminates the need for loops, handles large datasets efficiently, and instantly provides both counts and proportions of apps in each content rating.
Quick Categorical Summary Using describe()
Pandas provides a fast way to get summary statistics for categorical data using the describe() method. This avoids manual counting and gives an instant overview of key insights.
df['Reviews'].describe()
Output:

- count: total number of non-null values in the column.
- unique: number of distinct values.
- top: the most frequent value in the column.
- freq: frequency of the most common value.
This method is especially useful for quick exploratory analysis giving a snapshot of data distribution without writing multiple lines of code.
Automating Analysis for Multiple Categorical Columns
Instead of analyzing one column at a time, we can generalize categorical analysis using a loop over multiple columns. This makes the process reusable and scalable.
- categorical_cols stores the list of categorical columns to analyze.
- Looping through each column applies the same value_counts() logic automatically.
- New categorical features can be added easily without rewriting code.
categorical_cols = ['Category', 'Type', 'Content Rating']
for col in categorical_cols:
print(f"\nAnalysis for {col}")
print(df[col].value_counts())
Output:
You can download full code from here