Skip to content

Commit 12c960c

Browse files
authored
Merge pull request HKUDS#56 from sank8-2/dev
chore: added pre-commit-hooks and ruff formatting for commit-hooks
2 parents e2db7b6 + 744dad3 commit 12c960c

26 files changed

+630
-388
lines changed

‎.gitignore‎

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
__pycache__
22
*.egg-info
33
dickens/
4-
book.txt
4+
book.txt
5+
lightrag-dev/

‎.pre-commit-config.yaml‎

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v5.0.0
4+
hooks:
5+
- id: trailing-whitespace
6+
- id: end-of-file-fixer
7+
- id: requirements-txt-fixer
8+
9+
10+
- repo: https://github.com/astral-sh/ruff-pre-commit
11+
rev: v0.6.4
12+
hooks:
13+
- id: ruff-format
14+
- id: ruff
15+
args: [--fix]
16+
17+
18+
- repo: https://github.com/mgedmin/check-manifest
19+
rev: "0.49"
20+
hooks:
21+
- id: check-manifest
22+
stages: [manual]

‎README.md‎

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,16 @@
1616
<a href="https://pypi.org/project/lightrag-hku/"><img src="https://img.shields.io/pypi/v/lightrag-hku.svg"></a>
1717
<a href="https://pepy.tech/project/lightrag-hku"><img src="https://static.pepy.tech/badge/lightrag-hku/month"></a>
1818
</p>
19-
19+
2020
This repository hosts the code of LightRAG. The structure of this code is based on [nano-graphrag](https://github.com/gusye1234/nano-graphrag).
2121
![请添加图片描述](https://i-blog.csdnimg.cn/direct/b2aaf634151b4706892693ffb43d9093.png)
2222
</div>
2323

24-
## 🎉 News
24+
## 🎉 News
2525
- [x] [2024.10.18]🎯🎯📢📢We’ve added a link to a [LightRAG Introduction Video](https://youtu.be/oageL-1I0GE). Thanks to the author!
2626
- [x] [2024.10.17]🎯🎯📢📢We have created a [Discord channel](https://discord.gg/mvsfu2Tg)! Welcome to join for sharing and discussions! 🎉🎉
27-
- [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
28-
- [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
27+
- [x] [2024.10.16]🎯🎯📢📢LightRAG now supports [Ollama models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
28+
- [x] [2024.10.15]🎯🎯📢📢LightRAG now supports [Hugging Face models](https://github.com/HKUDS/LightRAG?tab=readme-ov-file#quick-start)!
2929

3030
## Install
3131

@@ -92,7 +92,7 @@ print(rag.query("What are the top themes in this story?", param=QueryParam(mode=
9292
<details>
9393
<summary> Using Open AI-like APIs </summary>
9494

95-
LightRAG also support Open AI-like chat/embeddings APIs:
95+
LightRAG also supports Open AI-like chat/embeddings APIs:
9696
```python
9797
async def llm_model_func(
9898
prompt, system_prompt=None, history_messages=[], **kwargs
@@ -129,7 +129,7 @@ rag = LightRAG(
129129

130130
<details>
131131
<summary> Using Hugging Face Models </summary>
132-
132+
133133
If you want to use Hugging Face models, you only need to set LightRAG as follows:
134134
```python
135135
from lightrag.llm import hf_model_complete, hf_embedding
@@ -145,7 +145,7 @@ rag = LightRAG(
145145
embedding_dim=384,
146146
max_token_size=5000,
147147
func=lambda texts: hf_embedding(
148-
texts,
148+
texts,
149149
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
150150
embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
151151
)
@@ -157,7 +157,7 @@ rag = LightRAG(
157157
<details>
158158
<summary> Using Ollama Models </summary>
159159
If you want to use Ollama models, you only need to set LightRAG as follows:
160-
160+
161161
```python
162162
from lightrag.llm import ollama_model_complete, ollama_embedding
163163

@@ -171,7 +171,7 @@ rag = LightRAG(
171171
embedding_dim=768,
172172
max_token_size=8192,
173173
func=lambda texts: ollama_embedding(
174-
texts,
174+
texts,
175175
embed_model="nomic-embed-text"
176176
)
177177
),
@@ -196,14 +196,14 @@ with open("./newText.txt") as f:
196196
```
197197
## Evaluation
198198
### Dataset
199-
The dataset used in LightRAG can be download from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
199+
The dataset used in LightRAG can be downloaded from [TommyChien/UltraDomain](https://huggingface.co/datasets/TommyChien/UltraDomain).
200200

201201
### Generate Query
202-
LightRAG uses the following prompt to generate high-level queries, with the corresponding code located in `example/generate_query.py`.
202+
LightRAG uses the following prompt to generate high-level queries, with the corresponding code in `example/generate_query.py`.
203203

204204
<details>
205205
<summary> Prompt </summary>
206-
206+
207207
```python
208208
Given the following description of a dataset:
209209

@@ -228,18 +228,18 @@ Output the results in the following structure:
228228
...
229229
```
230230
</details>
231-
231+
232232
### Batch Eval
233233
To evaluate the performance of two RAG systems on high-level queries, LightRAG uses the following prompt, with the specific code available in `example/batch_eval.py`.
234234

235235
<details>
236236
<summary> Prompt </summary>
237-
237+
238238
```python
239239
---Role---
240240
You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
241241
---Goal---
242-
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
242+
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
243243

244244
- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
245245
- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
@@ -303,15 +303,15 @@ Output your evaluation in the following JSON format:
303303
| **Empowerment** | 36.69% | **63.31%** | 45.09% | **54.91%** | 42.81% | **57.19%** | **52.94%** | 47.06% |
304304
| **Overall** | 43.62% | **56.38%** | 45.98% | **54.02%** | 45.70% | **54.30%** | **51.86%** | 48.14% |
305305

306-
## Reproduce
306+
## Reproduce
307307
All the code can be found in the `./reproduce` directory.
308308

309309
### Step-0 Extract Unique Contexts
310310
First, we need to extract unique contexts in the datasets.
311311

312312
<details>
313313
<summary> Code </summary>
314-
314+
315315
```python
316316
def extract_unique_contexts(input_directory, output_directory):
317317

@@ -370,12 +370,12 @@ For the extracted contexts, we insert them into the LightRAG system.
370370

371371
<details>
372372
<summary> Code </summary>
373-
373+
374374
```python
375375
def insert_text(rag, file_path):
376376
with open(file_path, mode='r') as f:
377377
unique_contexts = json.load(f)
378-
378+
379379
retries = 0
380380
max_retries = 3
381381
while retries < max_retries:
@@ -393,11 +393,11 @@ def insert_text(rag, file_path):
393393

394394
### Step-2 Generate Queries
395395

396-
We extract tokens from both the first half and the second half of each context in the dataset, then combine them as the dataset description to generate queries.
396+
We extract tokens from the first and the second half of each context in the dataset, then combine them as dataset descriptions to generate queries.
397397

398398
<details>
399399
<summary> Code </summary>
400-
400+
401401
```python
402402
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
403403

@@ -410,7 +410,7 @@ def get_summary(context, tot_tokens=2000):
410410

411411
summary_tokens = start_tokens + end_tokens
412412
summary = tokenizer.convert_tokens_to_string(summary_tokens)
413-
413+
414414
return summary
415415
```
416416
</details>
@@ -420,12 +420,12 @@ For the queries generated in Step-2, we will extract them and query LightRAG.
420420

421421
<details>
422422
<summary> Code </summary>
423-
423+
424424
```python
425425
def extract_queries(file_path):
426426
with open(file_path, 'r') as f:
427427
data = f.read()
428-
428+
429429
data = data.replace('**', '')
430430

431431
queries = re.findall(r'- Question \d+: (.+)', data)
@@ -479,7 +479,7 @@ def extract_queries(file_path):
479479

480480
```python
481481
@article{guo2024lightrag,
482-
title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
482+
title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
483483
author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang},
484484
year={2024},
485485
eprint={2410.05779},

‎examples/batch_eval.py‎

Lines changed: 17 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import os
21
import re
32
import json
43
import jsonlines
@@ -9,28 +8,28 @@
98
def batch_eval(query_file, result1_file, result2_file, output_file_path):
109
client = OpenAI()
1110

12-
with open(query_file, 'r') as f:
11+
with open(query_file, "r") as f:
1312
data = f.read()
1413

15-
queries = re.findall(r'- Question \d+: (.+)', data)
14+
queries = re.findall(r"- Question \d+: (.+)", data)
1615

17-
with open(result1_file, 'r') as f:
16+
with open(result1_file, "r") as f:
1817
answers1 = json.load(f)
19-
answers1 = [i['result'] for i in answers1]
18+
answers1 = [i["result"] for i in answers1]
2019

21-
with open(result2_file, 'r') as f:
20+
with open(result2_file, "r") as f:
2221
answers2 = json.load(f)
23-
answers2 = [i['result'] for i in answers2]
22+
answers2 = [i["result"] for i in answers2]
2423

2524
requests = []
2625
for i, (query, answer1, answer2) in enumerate(zip(queries, answers1, answers2)):
27-
sys_prompt = f"""
26+
sys_prompt = """
2827
---Role---
2928
You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
3029
"""
3130

3231
prompt = f"""
33-
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
32+
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
3433
3534
- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
3635
- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
@@ -69,7 +68,6 @@ def batch_eval(query_file, result1_file, result2_file, output_file_path):
6968
}}
7069
"""
7170

72-
7371
request_data = {
7472
"custom_id": f"request-{i+1}",
7573
"method": "POST",
@@ -78,35 +76,33 @@ def batch_eval(query_file, result1_file, result2_file, output_file_path):
7876
"model": "gpt-4o-mini",
7977
"messages": [
8078
{"role": "system", "content": sys_prompt},
81-
{"role": "user", "content": prompt}
79+
{"role": "user", "content": prompt},
8280
],
83-
}
81+
},
8482
}
85-
83+
8684
requests.append(request_data)
8785

88-
with jsonlines.open(output_file_path, mode='w') as writer:
86+
with jsonlines.open(output_file_path, mode="w") as writer:
8987
for request in requests:
9088
writer.write(request)
9189

9290
print(f"Batch API requests written to {output_file_path}")
9391

9492
batch_input_file = client.files.create(
95-
file=open(output_file_path, "rb"),
96-
purpose="batch"
93+
file=open(output_file_path, "rb"), purpose="batch"
9794
)
9895
batch_input_file_id = batch_input_file.id
9996

10097
batch = client.batches.create(
10198
input_file_id=batch_input_file_id,
10299
endpoint="/v1/chat/completions",
103100
completion_window="24h",
104-
metadata={
105-
"description": "nightly eval job"
106-
}
101+
metadata={"description": "nightly eval job"},
107102
)
108103

109-
print(f'Batch {batch.id} has been created.')
104+
print(f"Batch {batch.id} has been created.")
105+
110106

111107
if __name__ == "__main__":
112-
batch_eval()
108+
batch_eval()

‎examples/generate_query.py‎

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
1-
import os
2-
31
from openai import OpenAI
42

53
# os.environ["OPENAI_API_KEY"] = ""
64

5+
76
def openai_complete_if_cache(
87
model="gpt-4o-mini", prompt=None, system_prompt=None, history_messages=[], **kwargs
98
) -> str:
@@ -47,10 +46,10 @@ def openai_complete_if_cache(
4746
...
4847
"""
4948

50-
result = openai_complete_if_cache(model='gpt-4o-mini', prompt=prompt)
49+
result = openai_complete_if_cache(model="gpt-4o-mini", prompt=prompt)
5150

52-
file_path = f"./queries.txt"
51+
file_path = "./queries.txt"
5352
with open(file_path, "w") as file:
5453
file.write(result)
5554

56-
print(f"Queries written to {file_path}")
55+
print(f"Queries written to {file_path}")

‎examples/lightrag_azure_openai_demo.py‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,4 +122,4 @@ async def test_funcs():
122122
print(rag.query(query_text, param=QueryParam(mode="global")))
123123

124124
print("\nResult (Hybrid):")
125-
print(rag.query(query_text, param=QueryParam(mode="hybrid")))
125+
print(rag.query(query_text, param=QueryParam(mode="hybrid")))

‎examples/lightrag_bedrock_demo.py‎

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,22 +20,17 @@
2020
llm_model_func=bedrock_complete,
2121
llm_model_name="Anthropic Claude 3 Haiku // Amazon Bedrock",
2222
embedding_func=EmbeddingFunc(
23-
embedding_dim=1024,
24-
max_token_size=8192,
25-
func=bedrock_embedding
26-
)
23+
embedding_dim=1024, max_token_size=8192, func=bedrock_embedding
24+
),
2725
)
2826

29-
with open("./book.txt", 'r', encoding='utf-8') as f:
27+
with open("./book.txt", "r", encoding="utf-8") as f:
3028
rag.insert(f.read())
3129

3230
for mode in ["naive", "local", "global", "hybrid"]:
3331
print("\n+-" + "-" * len(mode) + "-+")
3432
print(f"| {mode.capitalize()} |")
3533
print("+-" + "-" * len(mode) + "-+\n")
3634
print(
37-
rag.query(
38-
"What are the top themes in this story?",
39-
param=QueryParam(mode=mode)
40-
)
35+
rag.query("What are the top themes in this story?", param=QueryParam(mode=mode))
4136
)

0 commit comments

Comments
 (0)