강의 홍보
Competition
Intro
import os
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
Check File Size
- Check Each Size of Dataset Folder in this competition
- train_records = 4.5GB
- test_tfrecords = 0.5MB
- train (image data) = 6.5GB
- test (image data) = 0.8MB
import os
def get_folder_size(file_directory):
# file_list = os.listdir(file_directory)
dir_sizes = {}
for r, d, f in os.walk(file_directory, False):
size = sum(os.path.getsize(os.path.join(r,f)) for f in f+d)
size += sum(dir_sizes[os.path.join(r,d)] for d in d)
dir_sizes[r] = size
print("{} is {} MB".format(r, round(size/2**20), 2))
base_dir = '../input/ranzcr-clip-catheter-line-classification'
get_folder_size(base_dir)
../input/ranzcr-clip-catheter-line-classification/test is 805 MB
../input/ranzcr-clip-catheter-line-classification/test_tfrecords is 555 MB
../input/ranzcr-clip-catheter-line-classification/train_tfrecords is 4563 MB
../input/ranzcr-clip-catheter-line-classification/train is 6592 MB
../input/ranzcr-clip-catheter-line-classification is 12524 MB
Check train file
train = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/train.csv', index_col = 0)
test = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/sample_submission.csv', index_col = 0)
display(train.head())
display(test.head())
|
ETT - Abnormal |
ETT - Borderline |
ETT - Normal |
NGT - Abnormal |
NGT - Borderline |
NGT - Incompletely Imaged |
NGT - Normal |
CVC - Abnormal |
CVC - Borderline |
CVC - Normal |
Swan Ganz Catheter Present |
PatientID |
StudyInstanceUID |
|
|
|
|
|
|
|
|
|
|
|
|
1.2.826.0.1.3680043.8.498.26697628953273228189375557799582420561 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
ec89415d1 |
1.2.826.0.1.3680043.8.498.46302891597398758759818628675365157729 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
bf4c6da3c |
1.2.826.0.1.3680043.8.498.23819260719748494858948050424870692577 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
3fc1c97e5 |
1.2.826.0.1.3680043.8.498.68286643202323212801283518367144358744 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
c31019814 |
1.2.826.0.1.3680043.8.498.10050203009225938259119000528814762175 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
207685cd1 |
|
ETT - Abnormal |
ETT - Borderline |
ETT - Normal |
NGT - Abnormal |
NGT - Borderline |
NGT - Incompletely Imaged |
NGT - Normal |
CVC - Abnormal |
CVC - Borderline |
CVC - Normal |
Swan Ganz Catheter Present |
StudyInstanceUID |
|
|
|
|
|
|
|
|
|
|
|
1.2.826.0.1.3680043.8.498.46923145579096002617106567297135160932 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.2.826.0.1.3680043.8.498.84006870182611080091824109767561564887 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.2.826.0.1.3680043.8.498.12219033294413119947515494720687541672 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.2.826.0.1.3680043.8.498.84994474380235968109906845540706092671 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1.2.826.0.1.3680043.8.498.35798987793805669662572108881745201372 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
Definitions of Variables
- What’s inside data?
- StudyInstanceUID - unique ID for each image
- ETT - Abnormal - endotracheal tube placement abnormal
- ETT - Borderline - endotracheal tube placement borderline abnormal
- ETT - Normal - endotracheal tube placement normal
- NGT - Abnormal - nasogastric tube placement abnormal
- NGT - Borderline - nasogastric tube placement borderline abnormal
- NGT - Incompletely Imaged - nasogastric tube placement inconclusive due to imaging
- NGT - Normal - nasogastric tube placement borderline normal
- CVC - Abnormal - central venous catheter placement abnormal
- CVC - Borderline - central venous catheter placement borderline abnormal
- CVC - Normal - central venous catheter placement normal
- Swan Ganz Catheter Present(??)
- PatientID - unique ID for each patient in the dataset
Data Distribution of Each Variable
- why two calculations are different?
- When inserting catheters and lines into patients, some patients needs them to put on multiple positions.
- Let’s see PatientID - bf4c6da3c
- But, you realize that three groups - ETT, NGT, CVC counted seperately.
print("Total Rows of Train Data is", len(train))
print("Total Count of Each Variable in Train Data is", train.iloc[:, :-1].sum().sum())
var_cal_tmp = train.iloc[:, :-1].sum()
print(var_cal_tmp)
Total Rows of Train Data is 30083
Total Count of Each Variable in Train Data is 50619
ETT - Abnormal 79
ETT - Borderline 1138
ETT - Normal 7240
NGT - Abnormal 279
NGT - Borderline 529
NGT - Incompletely Imaged 2748
NGT - Normal 4797
CVC - Abnormal 3195
CVC - Borderline 8460
CVC - Normal 21324
Swan Ganz Catheter Present 830
dtype: int64
train.iloc[1].to_frame().T
|
ETT - Abnormal |
ETT - Borderline |
ETT - Normal |
NGT - Abnormal |
NGT - Borderline |
NGT - Incompletely Imaged |
NGT - Normal |
CVC - Abnormal |
CVC - Borderline |
CVC - Normal |
Swan Ganz Catheter Present |
PatientID |
1.2.826.0.1.3680043.8.498.46302891597398758759818628675365157729 |
0 |
0 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
1 |
0 |
bf4c6da3c |
Quick Visualization
- In general, CVC outnumbered other group.
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x = var_cal_tmp.values, y = var_cal_tmp.index, ax=ax)
ax.tick_params(axis="x", labelsize=14)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)
Text(0.5, 1.0, 'Distribution of Labels')

- The number of Patients are smaller than total data.
- It means some patients are frequently checked, depending upon patients
print("Number of Unique Patients: ", train["PatientID"].unique().shape[0])
print("Number of Total Data: ", len(train["PatientID"]))
Number of Unique Patients: 3255
Number of Total Data: 30083
tmp = train['PatientID'].value_counts()
print(tmp)
fig, ax = plt.subplots(figsize=(24, 6))
sns.countplot(x = tmp.values, ax=ax)
ax.tick_params(axis="x", labelsize=10)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)
05029c63a 172
55073fece 167
26da0d5ad 148
8849382d0 130
34242119f 110
...
ad32e88e0 1
7755053cb 1
2d5a5f0d0 1
1951dc11c 1
22e8f333f 1
Name: PatientID, Length: 3255, dtype: int64
Text(0.5, 1.0, 'Distribution of Labels')

- Now, we need to see the distribution of data in each variable.
target_cols = ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 'NGT - Abnormal',
'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal', 'CVC - Abnormal',
'CVC - Borderline', 'CVC - Normal', 'Swan Ganz Catheter Present']
fig, ax = plt.subplots(4, 3, figsize=(16, 10))
for i, col in enumerate(train[target_cols].columns[0:]):
print(i, col)
if i <= 2:
ax[0, i].hist(train[col].values)
ax[0, i].set_title(f'target: {col}')
elif i <= 5:
ax[1, i-3].hist(train[col].values)
ax[1, i-3].set_title(f'target: {col}')
elif i <= 8:
ax[2, i-6].hist(train[col].values)
ax[2, i-6].set_title(f'target: {col}')
else:
ax[3, i-9].hist(train[col].values)
ax[3, i-9].set_title(f'target: {col}')
fig.tight_layout()
fig.subplots_adjust(top=0.95)
0 ETT - Abnormal
1 ETT - Borderline
2 ETT - Normal
3 NGT - Abnormal
4 NGT - Borderline
5 NGT - Incompletely Imaged
6 NGT - Normal
7 CVC - Abnormal
8 CVC - Borderline
9 CVC - Normal
10 Swan Ganz Catheter Present

- How to interpret the graph?
- CVC group is the top most amongst groups
- In each group, Normal is the top most.
- This datasets are typically imbalanced, and multi-classification problem is revealed.
Background Knowledge
- Since my major is far from this medical area, it difficults to figure what to classify from images.
- So, need some videos to understand the processing.
- Thanks to RANZCR CLiP: Visualize and Understand Dataset
- Please visit here and upvote
Endotracheal Tube¶
- It’s so called ETT in this dataset.
from IPython.display import YouTubeVideo
YouTubeVideo('FtJr7i7ENMY')
Nasogastric Tube
- It’s so called NTT in this dataset.
YouTubeVideo('Abf3Gd6AaZQ')
Central venous catheter
- It’s so called CVC in this dataset.
YouTubeVideo('mTBrCMn86cU')
Swan Ganz Catheter Present
- It’s Swan Ganz Catheter Present
YouTubeVideo('YkN30T6ig30')
Check train annotation file
- What’s Inside train_annotations file?
- The main purpose is said that ‘These are segmentation annotations for training samples that have them. They are included solely as additional information for competitors.’
- Let’s look at data
annot = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/train_annotations.csv")
annot.head(10)
|
StudyInstanceUID |
label |
data |
0 |
1.2.826.0.1.3680043.8.498.12616281126973421762... |
CVC - Normal |
[[1487, 1279], [1477, 1168], [1472, 1052], [14... |
1 |
1.2.826.0.1.3680043.8.498.12616281126973421762... |
CVC - Normal |
[[1328, 7], [1347, 101], [1383, 193], [1400, 2... |
2 |
1.2.826.0.1.3680043.8.498.72921907356394389969... |
CVC - Borderline |
[[801, 1207], [812, 1112], [823, 1023], [842, ... |
3 |
1.2.826.0.1.3680043.8.498.11697104485452001927... |
CVC - Normal |
[[1366, 961], [1411, 861], [1453, 751], [1508,... |
4 |
1.2.826.0.1.3680043.8.498.87704688663091069148... |
NGT - Normal |
[[1862, 14], [1845, 293], [1801, 869], [1716, ... |
5 |
1.2.826.0.1.3680043.8.498.87704688663091069148... |
CVC - Normal |
[[906, 604], [1103, 578], [1242, 607], [1459, ... |
6 |
1.2.826.0.1.3680043.8.498.87704688663091069148... |
ETT - Normal |
[[1781, 804], [1801, 666], [1791, 496], [1798,... |
7 |
1.2.826.0.1.3680043.8.498.53113362093090654004... |
CVC - Normal |
[[1152, 938], [1193, 856], [1265, 795], [1362,... |
8 |
1.2.826.0.1.3680043.8.498.83331936392921199432... |
NGT - Normal |
[[1903, 73], [1934, 768], [1917, 1061], [1866,... |
9 |
1.2.826.0.1.3680043.8.498.83331936392921199432... |
CVC - Normal |
[[92, 1857], [163, 1936], [251, 1917], [282, 1... |
Visualization of X-rays image
- combined train + train_annotations, let’s draw sample image
from PIL import Image, ImageDraw
def train_base_chest_plot(row_ind, base_dir):
row = annot.loc[row_ind]
train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
uid = row['StudyInstanceUID']
label = row['label']
fig, ax = plt.subplots(figsize=(15, 6))
ax.imshow(train_img)
plt.title(f"train: {label}")
base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_chest_plot(1, base_dir)

- But, what we need is to draw tube. Thus, we need to use column ‘data’ in this plot. Let’s do this.
import ast
import numpy as np
def train_base_tube_plot(row_ind, base_dir):
row = annot.loc[row_ind]
train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
uid = row['StudyInstanceUID']
label = row['label']
data = np.array(ast.literal_eval(row['data']))
fig, ax = plt.subplots(figsize=(15, 6))
ax.imshow(train_img)
ax.plot(data[:, 0], data[:, 1], color = 'b', linewidth=2, marker='o')
plt.title(f"train: {label}")
base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_tube_plot(1, base_dir)
train_base_tube_plot(2, base_dir)
train_base_tube_plot(25, base_dir)



- Well, still difficult to figure out what the difference between normal and abnormal is. So, Droped to draw more.