You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+27-10Lines changed: 27 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ The main objective of the project is to solve the multi-label text classificatio
9
9
## Requirements
10
10
11
11
- Python 3.6
12
-
- Tensorflow 1.4 +
12
+
- Tensorflow 1.4
13
13
- Numpy
14
14
- Gensim
15
15
@@ -24,7 +24,7 @@ The project structure is below:
24
24
│ ├── text_model.py
25
25
│ └── train_model.py
26
26
├── data
27
-
│ ├── word2vec_100.model [Need Download]
27
+
│ ├── word2vec_100.model.* [Need Download]
28
28
│ ├── Test_sample.json
29
29
│ ├── Train_sample.json
30
30
│ └── Validation_sample.json
@@ -42,9 +42,9 @@ The project structure is below:
42
42
## Innovation
43
43
44
44
### Data part
45
-
1. Make the data support **Chinese** and English (Which use `jieba`seems easy).
46
-
2. Can use **your own pre-trained word vectors** (Which use `gensim` seems easy).
47
-
3. Add embedding visualization based on the **tensorboard**.
45
+
1. Make the data support **Chinese** and English (Can use `jieba`or `nltk`).
46
+
2. Can use **your pre-trained word vectors** (Can use `gensim`).
47
+
3. Add embedding visualization based on the **tensorboard** (Need to create `metadata.tsv` first).
48
48
49
49
### Model part
50
50
1. Add the correct **L2 loss** calculation operation.
@@ -57,24 +57,35 @@ The project structure is below:
57
57
1. Can choose to **train** the model directly or **restore** the model from the checkpoint in `train.py`.
58
58
2. Can predict the labels via **threshold** and **top-K** in `train.py` and `test.py`.
59
59
3. Can calculate the evaluation metrics --- **AUC** & **AUPRC**.
60
-
4.Add `test.py`, the **model test code**, it can show the predicted values and predicted labels of the data in Testset when creating the final prediction file.
60
+
4.Can create the prediction file which including the predicted values and predicted labels of the Testset data in `test.py`.
61
61
5. Add other useful data preprocess functions in `data_helpers.py`.
62
62
6. Use `logging` for helping to record the whole info (including **parameters display**, **model training info**, etc.).
63
63
7. Provide the ability to save the best n checkpoints in `checkmate.py`, whereas the `tf.train.Saver` can only save the last n checkpoints.
64
64
65
65
## Data
66
66
67
-
See data format in `data` folder which including the data sample files.
67
+
See data format in `/data` folder which including the data sample files. For example:
-**"features_content"**: the word segment (after removing the stopwords)
75
+
-**"labels_index"**: The label index of the data records.
76
+
-**"labels_num"**: The number of labels.
68
77
69
78
### Text Segment
70
79
71
-
You can use `jieba` package if you are going to deal with the Chinese text data.
80
+
1. You can use `nltk` package if you are going to deal with the English text data.
81
+
82
+
2. You can use `jieba` package if you are going to deal with the Chinese text data.
72
83
73
84
### Data Format
74
85
75
86
This repository can be used in other datasets (text classification) in two ways:
76
-
1. Modify your datasets into the same format of [the sample](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/data/data_sample.json).
77
-
2. Modify the data preprocess code in `data_helpers.py`.
87
+
1. Modify your datasets into the same format of [the sample](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/data).
88
+
2. Modify the data preprocessing code in `data_helpers.py`.
78
89
79
90
Anyway, it should depend on what your data and task are.
80
91
@@ -86,11 +97,17 @@ Anyway, it should depend on what your data and task are.
86
97
87
98
### Pre-trained Word Vectors
88
99
100
+
**You can download the [Word2vec model file](https://drive.google.com/open?id=1XM0-Y8UJcJTKEAwKlweWv-NZWakW5Wmp) (dim=100). Make sure they are unzipped and under the `/data` folder.**
101
+
89
102
You can pre-training your word vectors (based on your corpus) in many ways:
90
103
- Use `gensim` package to pre-train data.
91
104
- Use `glove` tools to pre-train data.
92
105
- Even can use a **fasttext** network to pre-train data.
93
106
107
+
## Usage
108
+
109
+
See [Usage](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/Usage.md).
0 commit comments