pixmo-docs

This is the repository for the generation system of the PixMo-Docs, CoSyn-400K, and CoSyn-point datasets. PixMo-Docs was used to train the Molmo model, and the CoSyn datasets are an expanded version that use an improved pipeline and more types of documents. More details can be found in our paper.

🆕 New Features

This enhanced version includes the following improvements over the original repository:

Modern Package Management: Uses uv for fast, reliable Python dependency management with pyproject.toml
Flexible API Configuration: Supports both official APIs and proxy services (like OpenRouter) via .env configuration
Batch Testing Script: Includes test_pipelines.py for automated testing of multiple pipelines
Environment Variables: All API keys and configurations managed through .env file for better security
Improved Error Handling: Enhanced multiprocessing patches and better error recovery
Dual Language Documentation: Both English and Chinese README files

Installation

Prerequisites

Python 3.10 or higher
uv package manager

Using uv (Recommended)

After cloning the repo, you can set up the project using uv:

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

# Install additional dependencies for specific pipelines
uv pip install playwright && playwright install
uv pip install mpl_finance<=0.10.1 mplfinance<=0.12.10b0 cairosvg<=2.7.1

Traditional Installation (Alternative)

conda create --name pixmo-doc python=3.10
conda activate pixmo-doc
pip install -r requirements.txt

Environment Configuration

Create a .env file in the project root with your API keys:

# API Mode: "official" or "proxy"
API_MODE=official

# Official API Keys
OPENAI_API_KEY=your-openai-key
ANTHROPIC_API_KEY=your-anthropic-key
HF_TOKEN=your-huggingface-token  # Optional, for uploading datasets

# Proxy Configuration (if API_MODE=proxy)
PROXY_API_KEY=your-proxy-key
PROXY_BASE_URL=https://api.openrouter.ai/v1
OPENAI_MODEL=gpt-4o
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

System Dependencies

Some pipelines require additional system dependencies:

LaTeX: Install based on your OS from official LaTeX website

# macOS
brew install --cask mactex
# Ubuntu/Debian
sudo apt-get install texlive-full

Mermaid CLI:
```
npm install -g @mermaid-js/mermaid-cli
```

PDF Tools (for LaTeX pipelines):

# macOS
brew install poppler
# Ubuntu/Debian
sudo apt-get install poppler-utils

Quick Start

Basic Usage

Generate synthetic data using the main script:

python main.py -p {PIPELINE} \
               -t {TYPE_OF_DATA} \
               -n {NUMBER_OF_SAMPLES} \
               -m {DATASET_NAME}

Example:

python main.py -p "MatplotlibChartPipeline" -n 5 -m "matplotlib_test" -t "bar chart"

Batch Testing

Test multiple pipelines at once:

python test_pipelines.py

This will test all configured pipelines and save results to the examples/ directory.

Advanced Usage

Generate multiple types with different pipelines:

python main.py -p "MatplotlibChartPipeline,PlotlyChartPipeline" \
               -n 10 \
               -t "bar chart,line chart,scatter plot" \
               -m "combined_charts"

Command Line Arguments

-p, --pipelines: Pipeline names (comma-separated)
-t, --types: Visualization types to generate (comma-separated)
-n, --num: Number of samples per pipeline
-l, --llm: LLM model for text generation (default: gpt-4o)
-c, --code_llm: LLM for code generation (default: claude-3-5-sonnet)
-s, --seed: Random seed (default: 42)
-b, --batch_size: LLM batch size (default: 24)
-m, --name: Dataset name for HuggingFace upload
-f, --force: Force regeneration, ignore cache

Pipelines

We support 25 pipelines across 8 categories:

Charts

MatplotlibChartPipeline: Traditional charts using Matplotlib
PlotlyChartPipeline: Interactive charts with Plotly
VegaLiteChartPipeline: Declarative charts with Vega-Lite
LaTeXChartPipeline: Charts using TikZ
HTMLChartPipeline: Simple charts with HTML/CSS

Tables

LaTeXTablePipeline: Complex structured tables
MatplotlibTablePipeline: Tables within figures
PlotlyTablePipeline: Simple interactive tables
HTMLTablePipeline: Web-based tables

Documents

LaTeXDocumentPipeline: Scientific documents and reports
HTMLDocumentPipeline: Web documents with rich styling
DOCXDocumentPipeline: Microsoft Word documents

Diagrams

GraphvizDiagramPipeline: Graph and tree structures
MermaidDiagramPipeline: Flowcharts and sequence diagrams
LaTeXDiagramPipeline: Technical diagrams with TikZ

Circuits

SchemDrawCircuitPipeline: Electrical circuit diagrams
LaTeXCircuitPipeline: Circuits using CircuiTikZ

Specialized Graphics

DALLEImagePipeline: AI-generated images
RdkitChemicalPipeline: Chemical structure diagrams
LaTeXMathPipeline: Mathematical expressions
LilyPondMusicPipeline: Sheet music notation
SVGGraphicPipeline: Vector graphics
AsymptoteGraphicPipeline: Mathematical/technical graphics

Web Screens

HTMLScreenPipeline: Web page screenshots

Pointing

HTMLDocumentPointPipeline: Documents with pointing annotations

Troubleshooting

Common Issues

DataDreamer multiprocessing errors: Already patched in this version
LaTeX missing packages: Install texlive-full or equivalent
Plotly export issues: Ensure kaleido is installed: uv pip install kaleido
API rate limits: Adjust batch_size parameter or use proxy services

Debug Mode

Enable detailed logging:

export DATADREAMER_DISABLE_CACHE=1
python main.py -p "PlotlyChartPipeline" -n 1 -t "bar chart" -f

Citation

Please cite the following papers if you use this codebase or our datasets:

@article{yang2025scaling,
      title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation},
      author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and others},
      journal={arXiv preprint arXiv:2502.14846},
      year={2025}
}

@article{deitke2024molmo,
  title={Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models},
  author={Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and others},
  journal={arXiv preprint arXiv:2409.17146},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
pipeline		pipeline
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt
test_pipelines.py		test_pipelines.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pixmo-docs

🆕 New Features

Installation

Prerequisites

Using uv (Recommended)

Traditional Installation (Alternative)

Environment Configuration

System Dependencies

Quick Start

Basic Usage

Batch Testing

Advanced Usage

Command Line Arguments

Pipelines

Charts

Tables

Documents

Diagrams

Circuits

Specialized Graphics

Web Screens

Pointing

Troubleshooting

Common Issues

Debug Mode

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

YBCarry/pixmo-docs-debug

Folders and files

Latest commit

History

Repository files navigation

pixmo-docs

🆕 New Features

Installation

Prerequisites

Using uv (Recommended)

Traditional Installation (Alternative)

Environment Configuration

System Dependencies

Quick Start

Basic Usage

Batch Testing

Advanced Usage

Command Line Arguments

Pipelines

Charts

Tables

Documents

Diagrams

Circuits

Specialized Graphics

Web Screens

Pointing

Troubleshooting

Common Issues

Debug Mode

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages