Skip to content

Commit 4551ea0

Browse files
committed
Update README.md
1 parent f7d8b89 commit 4551ea0

1 file changed

Lines changed: 27 additions & 10 deletions

File tree

‎README.md‎

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ The main objective of the project is to solve the multi-label text classificatio
99
## Requirements
1010

1111
- Python 3.6
12-
- Tensorflow 1.4 +
12+
- Tensorflow 1.4
1313
- Numpy
1414
- Gensim
1515

@@ -24,7 +24,7 @@ The project structure is below:
2424
│   ├── text_model.py
2525
│   └── train_model.py
2626
├── data
27-
│   ├── word2vec_100.model [Need Download]
27+
│   ├── word2vec_100.model.* [Need Download]
2828
│   ├── Test_sample.json
2929
│   ├── Train_sample.json
3030
│   └── Validation_sample.json
@@ -42,9 +42,9 @@ The project structure is below:
4242
## Innovation
4343

4444
### Data part
45-
1. Make the data support **Chinese** and English (Which use `jieba` seems easy).
46-
2. Can use **your own pre-trained word vectors** (Which use `gensim` seems easy).
47-
3. Add embedding visualization based on the **tensorboard**.
45+
1. Make the data support **Chinese** and English (Can use `jieba` or `nltk` ).
46+
2. Can use **your pre-trained word vectors** (Can use `gensim`).
47+
3. Add embedding visualization based on the **tensorboard** (Need to create `metadata.tsv` first).
4848

4949
### Model part
5050
1. Add the correct **L2 loss** calculation operation.
@@ -57,24 +57,35 @@ The project structure is below:
5757
1. Can choose to **train** the model directly or **restore** the model from the checkpoint in `train.py`.
5858
2. Can predict the labels via **threshold** and **top-K** in `train.py` and `test.py`.
5959
3. Can calculate the evaluation metrics --- **AUC** & **AUPRC**.
60-
4. Add `test.py`, the **model test code**, it can show the predicted values and predicted labels of the data in Testset when creating the final prediction file.
60+
4. Can create the prediction file which including the predicted values and predicted labels of the Testset data in `test.py`.
6161
5. Add other useful data preprocess functions in `data_helpers.py`.
6262
6. Use `logging` for helping to record the whole info (including **parameters display**, **model training info**, etc.).
6363
7. Provide the ability to save the best n checkpoints in `checkmate.py`, whereas the `tf.train.Saver` can only save the last n checkpoints.
6464

6565
## Data
6666

67-
See data format in `data` folder which including the data sample files.
67+
See data format in `/data` folder which including the data sample files. For example:
68+
69+
```json
70+
{"testid": "3935745", "features_content": ["pore", "water", "pressure", "metering", "device", "incorporating", "pressure", "meter", "force", "meter", "influenced", "pressure", "meter", "device", "includes", "power", "member", "arranged", "control", "pressure", "exerted", "pressure", "meter", "force", "meter", "applying", "overriding", "force", "pressure", "meter", "stop", "influence", "force", "meter", "removing", "overriding", "force", "pressure", "meter", "influence", "force", "meter", "resumed"], "labels_index": [526, 534, 411], "labels_num": 3}
71+
```
72+
73+
- **"testid"**: just the id.
74+
- **"features_content"**: the word segment (after removing the stopwords)
75+
- **"labels_index"**: The label index of the data records.
76+
- **"labels_num"**: The number of labels.
6877

6978
### Text Segment
7079

71-
You can use `jieba` package if you are going to deal with the Chinese text data.
80+
1. You can use `nltk` package if you are going to deal with the English text data.
81+
82+
2. You can use `jieba` package if you are going to deal with the Chinese text data.
7283

7384
### Data Format
7485

7586
This repository can be used in other datasets (text classification) in two ways:
76-
1. Modify your datasets into the same format of [the sample](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/data/data_sample.json).
77-
2. Modify the data preprocess code in `data_helpers.py`.
87+
1. Modify your datasets into the same format of [the sample](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/data).
88+
2. Modify the data preprocessing code in `data_helpers.py`.
7889

7990
Anyway, it should depend on what your data and task are.
8091

@@ -86,11 +97,17 @@ Anyway, it should depend on what your data and task are.
8697

8798
### Pre-trained Word Vectors
8899

100+
**You can download the [Word2vec model file](https://drive.google.com/open?id=1XM0-Y8UJcJTKEAwKlweWv-NZWakW5Wmp) (dim=100). Make sure they are unzipped and under the `/data` folder.**
101+
89102
You can pre-training your word vectors (based on your corpus) in many ways:
90103
- Use `gensim` package to pre-train data.
91104
- Use `glove` tools to pre-train data.
92105
- Even can use a **fasttext** network to pre-train data.
93106

107+
## Usage
108+
109+
See [Usage](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/Usage.md).
110+
94111
## Network Structure
95112

96113
### FastText

0 commit comments

Comments
 (0)