English | 简体中文 | 日本語 | Deutsch | Español | Русский |Portuguese
Revolutionizing data warehouse development with AI intelligence.
DataNova is an AI-powered data warehouse development platform that combines large language models with specialized tools to provide intelligent data architecture design, query optimization, and automated pipeline generation capabilities for data teams. Our goal is to transform traditional data warehouse development through AI-driven insights and recommendations.
DataNova is now officially available on Volcengine's FaaS Application Center, where users can experience its powerful features and convenient operations through the experience link. At the same time, to meet the deployment needs of different users, DataNova supports one-click deployment based on Volcengine. Click the deployment link to quickly complete the deployment process and start your intelligent data warehouse development journey.
Please visit DataNova's official website for more details.
deer-flow.mp4
In this demo, we showcase how to use DataNova:
- Seamless integration with MCP services
- Conducting in-depth data analysis processes and generating comprehensive reports with charts
- Creating podcast audio based on generated analysis reports
- Designing a star schema for e-commerce data warehouse
- Optimizing performance of complex SQL queries
- Building real-time data pipelines to process streaming data
- Implementing data quality monitoring and alerting systems
- Visit our official website to explore more replay examples.
- 🚀 Quick Start
- 🌟 Features
- 🏗️ Architecture
- 🛠️ Development
- 🗣️ Text-to-Speech Integration
- 📚 Examples
- ❓ FAQ
- 📜 License
- 💖 Acknowledgements
- ⭐ Star History
DataNova is developed in Python and comes with a Web UI written in Node.js. To ensure a smooth setup process, we recommend using the following tools:
-
uv: Simplifies Python environment and dependency management.uvwill automatically create a virtual environment in the root directory and install all required packages for you—no need to manually install Python environments. -
nvm: Easily manage multiple Node.js runtime versions. -
pnpm: Install and manage dependencies for Node.js projects.
Ensure your system meets the following minimum requirements:
# Clone the repository
git clone https://github.com/hszhsz/DataNova
cd DataNova
# Install dependencies, uv will handle the creation of Python interpreter and virtual environment, and install the required packages
uv sync
# Configure .env with your API keys
# Tavily: https://app.tavily.com/home
# Brave_SEARCH: https://brave.com/search/api/
# Volcengine TTS: Add if you have TTS credentials
cp .env.example .env
# See the "Supported Search Engines" and "Text-to-Speech Integration" sections below for all available options
# Configure conf.yaml with your LLM models and API keys
# See 'docs/configuration_guide.md' for more details
cp conf.yaml.example conf.yaml
# Install marp for PPT generation
# https://github.com/marp-team/marp-cli?tab=readme-ov-file#use-package-manager
brew install marp-cliOptionally, install Web UI dependencies via pnpm:
cd DataNova/web
pnpm installSee Configuration Guide for more details.
Note
Please read the guide carefully before starting the project and update the configuration to match your specific settings and requirements.
The fastest way to run the project is using the console UI.
# Run the project in a bash-like shell
uv run main.pyThis project also includes a Web UI that provides a more dynamic and engaging interactive experience.
Note
You need to install the Web UI dependencies first.
# Run both backend and frontend servers in development mode
# On macOS/Linux
./bootstrap.sh -d
# On Windows
bootstrap.bat -dOpen your browser and visit http://localhost:3000 to explore the Web UI.
Explore more details in the web directory.
DataNova supports multiple search engines that can be configured in the .env file through the SEARCH_API variable:
-
Tavily (default): Professional search API designed for AI applications
- Requires setting
TAVILY_API_KEYin the.envfile - Registration address: https://app.tavily.com/home
- Requires setting
-
DuckDuckGo: Privacy-focused search engine
- No API key required
-
Brave Search: Privacy-focused search engine with advanced features
- Requires setting
BRAVE_SEARCH_API_KEYin the.envfile - Registration address: https://brave.com/search/api/
- Requires setting
-
Arxiv: Scientific paper search for academic research
- No API key required
- Designed specifically for scientific and academic papers
To configure your preferred search engine, set the SEARCH_API variable in the .env file:
# Choose one: tavily, duckduckgo, brave_search, arxiv
SEARCH_API=tavilyDataNova supports retrieval based on private domain knowledge. You can upload documents to multiple private knowledge bases for use during data analysis. Currently supported private knowledge bases include:
-
RAGFlow: Open-source knowledge base engine based on Retrieval-Augmented Generation
# Refer to .env.example for configuration RAG_PROVIDER=ragflow RAGFLOW_API_URL="http://localhost:9388" RAGFLOW_API_KEY="ragflow-xxx" RAGFLOW_RETRIEVAL_SIZE=10 -
VikingDB Knowledge Base: Public cloud knowledge base engine provided by Volcengine
Note: First obtain account AK/SK from Volcengine
# Refer to .env.example for configuration RAG_PROVIDER=vikingdb_knowledge_base VIKINGDB_KNOWLEDGE_BASE_API_URL="api-knowledgebase.mlp.cn-beijing.volces.com" VIKINGDB_KNOWLEDGE_BASE_API_AK="volcengine-ak-xxx" VIKINGDB_KNOWLEDGE_BASE_API_SK="volcengine-sk-xxx" VIKINGDB_KNOWLEDGE_BASE_RETRIEVAL_SIZE=15
- 🤖 AI-Powered Data Architecture Design
- Supports integration of most models through litellm
- Supports open-source models like Qwen
- Compatible with OpenAI API interface
- Multi-layer LLM system for tasks of different complexities
-
🔍 Data Exploration and Retrieval
- Data source search through Tavily, Brave Search, and others
- Data crawling using Jina
- Advanced data content extraction
- Supports retrieval of specified private knowledge bases
-
📃 RAG Integration
-
🔗 MCP Seamless Integration
- Extends capabilities for data source access, data quality checks, query optimization, and more
- Promotes integration of diverse data tools and methodologies
-
🧠 Human-in-the-Loop
- Supports interactive modification of data architecture plans using natural language
- Supports automatic acceptance of architecture plans
-
📝 Data Report Post-Editing
- Supports Notion-like block editing
- Allows AI optimization, including AI-assisted polishing, sentence shortening, and expansion
- Powered by tiptap
- 🎙️ Podcast and Presentation Generation
- AI-powered podcast script generation and audio synthesis
- Automatic creation of simple PowerPoint presentations
- Customizable templates to meet personalized content needs
DataNova implements a modular multi-agent system architecture designed for automated data warehouse development and data analysis. The system is built on LangGraph, implementing a flexible state-based workflow where components communicate through a well-defined message passing system.
View live demo at datanova.tech
The system employs a streamlined workflow consisting of the following components:
-
Coordinator: The entry point managing the workflow lifecycle
- Initiates the data analysis process based on user input
- Delegates tasks to the planner at appropriate times
- Serves as the main interface between the user and the system
-
Planner: The strategic component responsible for task decomposition and planning
- Analyzes data analysis goals and creates structured execution plans
- Determines if there is sufficient context or if more data exploration is needed
- Manages the data analysis process and decides when to generate the final report
-
Data Analysis Team: A collection of specialized agents that execute the plan:
- Data Analyst: Conducts data search and information gathering using tools such as data search engines, crawlers, and even MCP services.
- Data Engineer: Handles data processing, analysis, and technical tasks using Python REPL tools. Each agent has access to specific tools optimized for their role and operates within the LangGraph framework
-
Reporter: The final stage processor for data analysis outputs
- Aggregates findings from the data analysis team
- Processes and organizes collected information
- Generates comprehensive data analysis reports
DataNova now includes a text-to-speech (TTS) feature that allows you to convert research reports to audio. This feature uses the Volcengine TTS API to generate high-quality text audio. Characteristics such as speed, volume, and pitch can also be customized.
You can access the TTS feature through the /api/tts endpoint:
# Example API call using curl
curl --location 'http://localhost:8000/api/tts' \
--header 'Content-Type: application/json' \
--data '{
"text": "This is a test of the text-to-speech feature.",
"speed_ratio": 1.0,
"volume_ratio": 1.0,
"pitch_ratio": 1.0
}' \
--output speech.mp3Run the test suite:
# Run all tests
make test
# Run specific test files
pytest tests/integration/test_workflow.py
# Run coverage tests
make coverage# Run code linting
make lint
# Format code
make formatDataNova uses LangGraph as its workflow architecture. You can use LangGraph Studio to debug and visualize workflows in real-time.
DataNova includes a langgraph.json configuration file that defines the graph structure and dependencies for LangGraph Studio. This file points to the workflow graph defined in the project and automatically loads environment variables from the .env file.
# Install uv package manager if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies and start the LangGraph server
uvx --refresh --from "langgraph-cli[inmem]" --with-editable . --python 3.12 langgraph dev --allow-blocking# Install dependencies
pip install -e .
pip install -U "langgraph-cli[inmem]"
# Start the LangGraph server
langgraph devAfter starting the LangGraph server, you will see several URLs in the terminal:
- API: http://127.0.0.1:2024
- Studio UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
- API Documentation: http://127.0.0.1:2024/docs
Open the Studio UI link in your browser to access the debugging interface.
In the Studio UI, you can:
- Visualize the workflow graph and see how components connect
- Track execution in real-time to understand how data flows through the system
- Inspect the status of each step in the workflow
- Debug issues by examining the input and output of each component
- Provide feedback during the planning phase to refine the research plan
When you submit a data analysis topic in the Studio UI, you will be able to see the entire workflow execution process, including:
- The planning phase where the data analysis plan is created
- Feedback loops where the plan can be modified
- Data analysis phases for each section
- Final report generation
DataNova supports LangSmith tracing functionality to help you debug and monitor workflows. To enable LangSmith tracing:
-
Ensure you have the following configuration in your
.envfile (see.env.example):LANGSMITH_TRACING=true LANGSMITH_ENDPOINT="https://api.smith.langchain.com" LANGSMITH_API_KEY="xxx" LANGSMITH_PROJECT="xxx"
-
Start LangSmith tracing locally by running:
langgraph dev
This will enable tracing visualization in LangGraph Studio and send your traces to LangSmith for monitoring and analysis.
You can also run this project using Docker.
First, you need to read the Configuration section below. Make sure the .env and .conf.yaml files are ready.
Next, build your own Web server Docker image:
docker build -t datanova-api .Finally, start the Docker container running the Web server:
# Replace datanova-api-app with your preferred container name
# Start the server and bind to localhost:8000
docker run -d -t -p 127.0.0.1:8000:8000 --env-file .env --name datanova-api-app datanova-api
# Stop the server
docker stop datanova-api-appYou can also set up this project using docker compose:
# Build docker images
docker compose build
# Start the server
docker compose upWarning
If you want to deploy DataNova to a production environment, please add authentication to the website and evaluate the security checks for MCPServer and Python Repl.
The following examples showcase DataNova's capabilities:
-
E-commerce Data Warehouse Design - Star schema design for e-commerce analytics data warehouse
- Discusses fact table, dimension table design, and data modeling best practices
- View full report
-
SQL Query Optimization Strategies - SQL query performance optimization for large datasets
- Explores indexing strategies, query rewriting, and cost optimization techniques
- View full report
-
Real-time Data Pipeline Construction - Building real-time data pipelines using modern streaming technologies
- Researches Kafka, Spark Streaming, and real-time data processing architectures
- View full report
-
Data Quality Monitoring Framework - Implementing automated data quality checks and monitoring
- Explores data quality metrics, anomaly detection, and automated remediation strategies
- View full report
-
What is LLM? - An in-depth exploration of large language models
- Discusses architecture, training, applications, and ethical considerations
- View full report
-
How to Use Claude for Deep Research? - Best practices and workflows for using Claude in deep research
- Covers prompt engineering, data analysis, and integration with other tools
- View full report
-
AI Adoption in Healthcare: Influencing Factors - Analysis of factors influencing AI adoption in healthcare
- Discusses AI technologies, data quality, ethical considerations, economic evaluation, organizational readiness, and digital infrastructure
- View full report
-
Quantum Computing's Impact on Cryptography - Analysis of quantum computing's impact on cryptography
- Discusses vulnerabilities of classical cryptography, post-quantum cryptography, and quantum-resistant cryptographic solutions
- View full report
-
Cristiano Ronaldo's Performance Highlights - Analysis of Cristiano Ronaldo's performance highlights
- Discusses his career achievements, international goals, and performances in various competitions
- View full report
To run these examples or create your own research reports, you can use the following commands:
# Run with a specific query
uv run main.py "Design a data warehouse architecture for e-commerce analytics"
# Run with custom planning parameters
uv run main.py --max_plan_iterations 3 "How to optimize performance of complex SQL queries?"
# Run in interactive mode with built-in questions
uv run main.py --interactive
# Or run with basic interactive prompt
uv run main.py
# View all available options
uv run main.py --helpDataNova supports an interactive mode with built-in questions in both English and Chinese, specifically tailored for data warehouse development scenarios:
-
Start interactive mode:
uv run main.py --interactive
-
Select your preferred language (English or Chinese)
-
Choose from the built-in data warehouse question list or select the option to ask your own question
-
The system will process your question and generate a comprehensive data analysis report
DataNova includes a human-in-the-loop mechanism that allows you to review, edit, and approve before executing data analysis plans:
-
Plan Review: When human-in-the-loop is enabled, the system will show you the generated data analysis plan before execution
-
Provide Feedback: You can:
- Accept the plan by replying
[ACCEPTED] - Edit the plan by providing feedback (e.g.,
[EDIT PLAN] Add more steps about data quality checks) - The system will incorporate your feedback and generate a revised plan
- Accept the plan by replying
-
Automatic Acceptance: You can enable automatic acceptance to skip the review process:
- Via API: Set
auto_accepted_plan: truein the request
- Via API: Set
-
API Integration: When using the API, you can provide feedback through the
feedbackparameter:{ "messages": [{ "role": "user", "content": "Design an e-commerce data warehouse architecture" }], "thread_id": "my_thread_id", "auto_accepted_plan": false, "feedback": "[EDIT PLAN] Include more content about real-time data processing" }
DataNova supports multiple command line arguments to customize its behavior:
- query: The data analysis query to process (can be multiple words)
- --interactive: Run in interactive mode with built-in questions
- --max_plan_iterations: Maximum number of planning cycles (default: 1)
- --max_step_num: Maximum number of steps in a data analysis plan (default: 3)
- --debug: Enable verbose debug logging
See FAQ.md for more details.
This project is open source under the MIT License.
DataNova is built upon the outstanding work of the open-source community. We deeply appreciate all the projects and contributors that made DataNova possible. Indeed, we stand on the shoulders of giants.
We would like to express our sincere gratitude to the following projects for their valuable contributions:
- LangChain: Their excellent framework powers our LLM interactions and chains, enabling seamless integration and functionality.
- LangGraph: Their innovative approach to multi-agent orchestration is crucial for implementing DataNova's complex workflows.
These projects demonstrate the transformative power of open-source collaboration, and we are proud to build upon their foundations.
We extend our heartfelt thanks to the core authors of DataNova whose vision, passion, and dedication made this project possible:
Your unwavering commitment and expertise are the driving force behind DataNova's success. We are honored to have you lead this journey.
