🚀 DataNova

Revolutionizing data warehouse development with AI intelligence.

DataNova is an AI-powered data warehouse development platform that combines large language models with specialized tools to provide intelligent data architecture design, query optimization, and automated pipeline generation capabilities for data teams. Our goal is to transform traditional data warehouse development through AI-driven insights and recommendations.

DataNova is now officially available on Volcengine's FaaS Application Center, where users can experience its powerful features and convenient operations through the experience link. At the same time, to meet the deployment needs of different users, DataNova supports one-click deployment based on Volcengine. Click the deployment link to quickly complete the deployment process and start your intelligent data warehouse development journey.

Please visit DataNova's official website for more details.

Demo

Video

deer-flow.mp4

In this demo, we showcase how to use DataNova:

Seamless integration with MCP services
Conducting in-depth data analysis processes and generating comprehensive reports with charts
Creating podcast audio based on generated analysis reports

Replay Examples

📑 Table of Contents

Quick Start

DataNova is developed in Python and comes with a Web UI written in Node.js. To ensure a smooth setup process, we recommend using the following tools:

Recommended Tools

uv: Simplifies Python environment and dependency management. uv will automatically create a virtual environment in the root directory and install all required packages for you—no need to manually install Python environments.
nvm: Easily manage multiple Node.js runtime versions.
pnpm: Install and manage dependencies for Node.js projects.

Environment Requirements

Ensure your system meets the following minimum requirements:

Python: Version 3.12+
Node.js: Version 22+

Installation

# Clone the repository
git clone https://github.com/hszhsz/DataNova
cd DataNova

# Install dependencies, uv will handle the creation of Python interpreter and virtual environment, and install the required packages
uv sync

# Configure .env with your API keys
# Tavily: https://app.tavily.com/home
# Brave_SEARCH: https://brave.com/search/api/
# Volcengine TTS: Add if you have TTS credentials
cp .env.example .env

# See the "Supported Search Engines" and "Text-to-Speech Integration" sections below for all available options

# Configure conf.yaml with your LLM models and API keys
# See 'docs/configuration_guide.md' for more details
cp conf.yaml.example conf.yaml

# Install marp for PPT generation
# https://github.com/marp-team/marp-cli?tab=readme-ov-file#use-package-manager
brew install marp-cli

Optionally, install Web UI dependencies via pnpm:

cd DataNova/web
pnpm install

Configuration

See Configuration Guide for more details.

Note

Please read the guide carefully before starting the project and update the configuration to match your specific settings and requirements.

Console UI

The fastest way to run the project is using the console UI.

# Run the project in a bash-like shell
uv run main.py

Web UI

This project also includes a Web UI that provides a more dynamic and engaging interactive experience.

Note

You need to install the Web UI dependencies first.

# Run both backend and frontend servers in development mode
# On macOS/Linux
./bootstrap.sh -d

# On Windows
bootstrap.bat -d

Open your browser and visit http://localhost:3000 to explore the Web UI.

Explore more details in the web directory.

Supported Search Engines

Public Domain Search Engines

DataNova supports multiple search engines that can be configured in the .env file through the SEARCH_API variable:

Tavily (default): Professional search API designed for AI applications
- Requires setting TAVILY_API_KEY in the .env file
- Registration address: https://app.tavily.com/home
DuckDuckGo: Privacy-focused search engine
- No API key required
Brave Search: Privacy-focused search engine with advanced features
- Requires setting BRAVE_SEARCH_API_KEY in the .env file
- Registration address: https://brave.com/search/api/
Arxiv: Scientific paper search for academic research
- No API key required
- Designed specifically for scientific and academic papers

To configure your preferred search engine, set the SEARCH_API variable in the .env file:

# Choose one: tavily, duckduckgo, brave_search, arxiv
SEARCH_API=tavily

Private Knowledge Base Engines

DataNova supports retrieval based on private domain knowledge. You can upload documents to multiple private knowledge bases for use during data analysis. Currently supported private knowledge bases include:

RAGFlow: Open-source knowledge base engine based on Retrieval-Augmented Generation

# Refer to .env.example for configuration
RAG_PROVIDER=ragflow
RAGFLOW_API_URL="http://localhost:9388"
RAGFLOW_API_KEY="ragflow-xxx"
RAGFLOW_RETRIEVAL_SIZE=10

VikingDB Knowledge Base: Public cloud knowledge base engine provided by Volcengine

Note: First obtain account AK/SK from Volcengine

# Refer to .env.example for configuration
RAG_PROVIDER=vikingdb_knowledge_base
VIKINGDB_KNOWLEDGE_BASE_API_URL="api-knowledgebase.mlp.cn-beijing.volces.com"
VIKINGDB_KNOWLEDGE_BASE_API_AK="volcengine-ak-xxx"
VIKINGDB_KNOWLEDGE_BASE_API_SK="volcengine-sk-xxx"
VIKINGDB_KNOWLEDGE_BASE_RETRIEVAL_SIZE=15

Features

Core Capabilities

🤖 AI-Powered Data Architecture Design
- Supports integration of most models through litellm
- Supports open-source models like Qwen
- Compatible with OpenAI API interface
- Multi-layer LLM system for tasks of different complexities

Data Tools and MCP Integration

🔍 Data Exploration and Retrieval
- Data source search through Tavily, Brave Search, and others
- Data crawling using Jina
- Advanced data content extraction
- Supports retrieval of specified private knowledge bases
📃 RAG Integration
- Supports RAGFlow knowledge base
- Supports VikingDB Volcengine knowledge base
🔗 MCP Seamless Integration
- Extends capabilities for data source access, data quality checks, query optimization, and more
- Promotes integration of diverse data tools and methodologies

AI-Powered Collaboration

🧠 Human-in-the-Loop
- Supports interactive modification of data architecture plans using natural language
- Supports automatic acceptance of architecture plans
📝 Data Report Post-Editing
- Supports Notion-like block editing
- Allows AI optimization, including AI-assisted polishing, sentence shortening, and expansion
- Powered by tiptap

Content Creation

🎙️ Podcast and Presentation Generation
- AI-powered podcast script generation and audio synthesis
- Automatic creation of simple PowerPoint presentations
- Customizable templates to meet personalized content needs

Architecture

DataNova implements a modular multi-agent system architecture designed for automated data warehouse development and data analysis. The system is built on LangGraph, implementing a flexible state-based workflow where components communicate through a well-defined message passing system.

View live demo at datanova.tech

The system employs a streamlined workflow consisting of the following components:

Coordinator: The entry point managing the workflow lifecycle
- Initiates the data analysis process based on user input
- Delegates tasks to the planner at appropriate times
- Serves as the main interface between the user and the system
Planner: The strategic component responsible for task decomposition and planning
- Analyzes data analysis goals and creates structured execution plans
- Determines if there is sufficient context or if more data exploration is needed
- Manages the data analysis process and decides when to generate the final report
Data Analysis Team: A collection of specialized agents that execute the plan:
- Data Analyst: Conducts data search and information gathering using tools such as data search engines, crawlers, and even MCP services.
- Data Engineer: Handles data processing, analysis, and technical tasks using Python REPL tools. Each agent has access to specific tools optimized for their role and operates within the LangGraph framework
Reporter: The final stage processor for data analysis outputs
- Aggregates findings from the data analysis team
- Processes and organizes collected information
- Generates comprehensive data analysis reports

Text-to-Speech Integration

DataNova now includes a text-to-speech (TTS) feature that allows you to convert research reports to audio. This feature uses the Volcengine TTS API to generate high-quality text audio. Characteristics such as speed, volume, and pitch can also be customized.

Using the TTS API

You can access the TTS feature through the /api/tts endpoint:

# Example API call using curl
curl --location 'http://localhost:8000/api/tts' \
--header 'Content-Type: application/json' \
--data '{
    "text": "This is a test of the text-to-speech feature.",
    "speed_ratio": 1.0,
    "volume_ratio": 1.0,
    "pitch_ratio": 1.0
}' \
--output speech.mp3

Development

Testing

Run the test suite:

# Run all tests
make test

# Run specific test files
pytest tests/integration/test_workflow.py

# Run coverage tests
make coverage

Code Quality

# Run code linting
make lint

# Format code
make format

Debugging with LangGraph Studio

DataNova uses LangGraph as its workflow architecture. You can use LangGraph Studio to debug and visualize workflows in real-time.

Running LangGraph Studio Locally

DataNova includes a langgraph.json configuration file that defines the graph structure and dependencies for LangGraph Studio. This file points to the workflow graph defined in the project and automatically loads environment variables from the .env file.

Mac

# Install uv package manager if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies and start the LangGraph server
uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.12 langgraph dev --allow-blocking

Windows / Linux

# Install dependencies
pip install -e .
pip install -U "langgraph-cli[inmem]"

# Start the LangGraph server
langgraph dev

After starting the LangGraph server, you will see several URLs in the terminal:

API: http://127.0.0.1:2024
Studio UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
API Documentation: http://127.0.0.1:2024/docs

Open the Studio UI link in your browser to access the debugging interface.

Using LangGraph Studio

In the Studio UI, you can:

Visualize the workflow graph and see how components connect
Track execution in real-time to understand how data flows through the system
Inspect the status of each step in the workflow
Debug issues by examining the input and output of each component
Provide feedback during the planning phase to refine the research plan

When you submit a data analysis topic in the Studio UI, you will be able to see the entire workflow execution process, including:

The planning phase where the data analysis plan is created
Feedback loops where the plan can be modified
Data analysis phases for each section
Final report generation

Enabling LangSmith Tracing

DataNova supports LangSmith tracing functionality to help you debug and monitor workflows. To enable LangSmith tracing:

Ensure you have the following configuration in your .env file (see .env.example):

LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY="xxx"
LANGSMITH_PROJECT="xxx"

Start LangSmith tracing locally by running:
```
langgraph dev
```

This will enable tracing visualization in LangGraph Studio and send your traces to LangSmith for monitoring and analysis.

Docker

You can also run this project using Docker.

First, you need to read the Configuration section below. Make sure the .env and .conf.yaml files are ready.

Next, build your own Web server Docker image:

docker build -t datanova-api .

Finally, start the Docker container running the Web server:

# Replace datanova-api-app with your preferred container name
# Start the server and bind to localhost:8000
docker run -d -t -p 127.0.0.1:8000:8000 --env-file .env --name datanova-api-app datanova-api

# Stop the server
docker stop datanova-api-app

Docker Compose

You can also set up this project using docker compose:

# Build docker images
docker compose build

# Start the server
docker compose up

Warning

If you want to deploy DataNova to a production environment, please add authentication to the website and evaluate the security checks for MCPServer and Python Repl.

Examples

The following examples showcase DataNova's capabilities:

Data Analysis Reports

E-commerce Data Warehouse Design - Star schema design for e-commerce analytics data warehouse
- Discusses fact table, dimension table design, and data modeling best practices
- View full report
SQL Query Optimization Strategies - SQL query performance optimization for large datasets
- Explores indexing strategies, query rewriting, and cost optimization techniques
- View full report
Real-time Data Pipeline Construction - Building real-time data pipelines using modern streaming technologies
- Researches Kafka, Spark Streaming, and real-time data processing architectures
- View full report
Data Quality Monitoring Framework - Implementing automated data quality checks and monitoring
- Explores data quality metrics, anomaly detection, and automated remediation strategies
- View full report
What is LLM? - An in-depth exploration of large language models
- Discusses architecture, training, applications, and ethical considerations
- View full report
How to Use Claude for Deep Research? - Best practices and workflows for using Claude in deep research
- Covers prompt engineering, data analysis, and integration with other tools
- View full report
AI Adoption in Healthcare: Influencing Factors - Analysis of factors influencing AI adoption in healthcare
- Discusses AI technologies, data quality, ethical considerations, economic evaluation, organizational readiness, and digital infrastructure
- View full report
Quantum Computing's Impact on Cryptography - Analysis of quantum computing's impact on cryptography
- Discusses vulnerabilities of classical cryptography, post-quantum cryptography, and quantum-resistant cryptographic solutions
- View full report
Cristiano Ronaldo's Performance Highlights - Analysis of Cristiano Ronaldo's performance highlights
- Discusses his career achievements, international goals, and performances in various competitions
- View full report

To run these examples or create your own research reports, you can use the following commands:

# Run with a specific query
uv run main.py "Design a data warehouse architecture for e-commerce analytics"

# Run with custom planning parameters
uv run main.py --max_plan_iterations 3 "How to optimize performance of complex SQL queries?"

# Run in interactive mode with built-in questions
uv run main.py --interactive

# Or run with basic interactive prompt
uv run main.py

# View all available options
uv run main.py --help

Interactive Mode

DataNova supports an interactive mode with built-in questions in both English and Chinese, specifically tailored for data warehouse development scenarios:

Start interactive mode:
```
uv run main.py --interactive
```
Select your preferred language (English or Chinese)
Choose from the built-in data warehouse question list or select the option to ask your own question
The system will process your question and generate a comprehensive data analysis report

Human-in-the-Loop

DataNova includes a human-in-the-loop mechanism that allows you to review, edit, and approve before executing data analysis plans:

Plan Review: When human-in-the-loop is enabled, the system will show you the generated data analysis plan before execution
Provide Feedback: You can:
- Accept the plan by replying [ACCEPTED]
- Edit the plan by providing feedback (e.g., [EDIT PLAN] Add more steps about data quality checks)
- The system will incorporate your feedback and generate a revised plan
Automatic Acceptance: You can enable automatic acceptance to skip the review process:
- Via API: Set auto_accepted_plan: true in the request

API Integration: When using the API, you can provide feedback through the feedback parameter:

{
  "messages": [{ "role": "user", "content": "Design an e-commerce data warehouse architecture" }],
  "thread_id": "my_thread_id",
  "auto_accepted_plan": false,
  "feedback": "[EDIT PLAN] Include more content about real-time data processing"
}

Command Line Arguments

DataNova supports multiple command line arguments to customize its behavior:

query: The data analysis query to process (can be multiple words)
--interactive: Run in interactive mode with built-in questions
--max_plan_iterations: Maximum number of planning cycles (default: 1)
--max_step_num: Maximum number of steps in a data analysis plan (default: 3)
--debug: Enable verbose debug logging

FAQ

See FAQ.md for more details.

License

This project is open source under the MIT License.

Acknowledgements

DataNova is built upon the outstanding work of the open-source community. We deeply appreciate all the projects and contributors that made DataNova possible. Indeed, we stand on the shoulders of giants.

We would like to express our sincere gratitude to the following projects for their valuable contributions:

LangChain: Their excellent framework powers our LLM interactions and chains, enabling seamless integration and functionality.
LangGraph: Their innovative approach to multi-agent orchestration is crucial for implementing DataNova's complex workflows.

These projects demonstrate the transformative power of open-source collaboration, and we are proud to build upon their foundations.

Core Contributors

We extend our heartfelt thanks to the core authors of DataNova whose vision, passion, and dedication made this project possible:

Daniel Walnut
Henry Li

Your unwavering commitment and expertise are the driving force behind DataNova's success. We are honored to have you lead this journey.

Name		Name	Last commit message	Last commit date
Latest commit History 439 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
docs		docs
examples		examples
src		src
tests		tests
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING		CONTRIBUTING
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_de.md		README_de.md
README_es.md		README_es.md
README_ja.md		README_ja.md
README_pt.md		README_pt.md
README_ru.md		README_ru.md
README_zh.md		README_zh.md
bootstrap.bat		bootstrap.bat
bootstrap.sh		bootstrap.sh
conf.yaml.example		conf.yaml.example
docker-compose.yml		docker-compose.yml
langgraph.json		langgraph.json
main.py		main.py
pre-commit		pre-commit
pyproject.toml		pyproject.toml
server.py		server.py
test_fix.py		test_fix.py
uv.lock		uv.lock

License

hszhsz/DataNova

Folders and files

Latest commit

History

Repository files navigation

🚀 DataNova

Demo

Video

Replay Examples

📑 Table of Contents

Quick Start

Recommended Tools

Environment Requirements

Installation

Configuration

Console UI

Web UI

Supported Search Engines

Public Domain Search Engines

Private Knowledge Base Engines

Features

Core Capabilities

Data Tools and MCP Integration

AI-Powered Collaboration

Content Creation

Architecture

Text-to-Speech Integration

Using the TTS API

Development

Testing

Code Quality

Debugging with LangGraph Studio

Running LangGraph Studio Locally

Mac

Windows / Linux

Using LangGraph Studio

Enabling LangSmith Tracing

Docker

Docker Compose

Examples

Data Analysis Reports

Interactive Mode

Human-in-the-Loop

Command Line Arguments

FAQ

License

Acknowledgements

Core Contributors

Star History

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages