Highlights
- Pro
Starred repositories
A feature-rich Python text case conversion library
CLI that queries multiple language models in parallel using prompts from a CSV file
Official Implementation of "KBLaM: Knowledge Base augmented Language Model"
Cutting-edge web scraping techniques workshop at NICAR 2025
Vision infrastructure to turn complex documents into RAG/LLM-ready data
Export any Kindle book you own as text, PDF, EPUB, or as a custom, AI-narrated audiobook. 🔥
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
Concept Induction: Analyzing Unstructured Text with High-Level Concepts Using LLooM (CHI 2024 paper). LLooM automatically surfaces high-level concepts to analyze unstructured text.
A pytest plugin for running and analyzing LLM evaluation tests.
A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.
python CLI for interacting with unix tools built for people who haven't committed manpages to memory
The repository for the NICAR 2024 class, SELECT * FROM interesting
Tip sheet and activities for a hands-on session about using the command line for the 2025 NICAR conference
semantic search for your spreadsheets
📝 python package to calculate readability statistics of a text object - paragraphs, sentences, articles.
A repository for collecting several simple datasets that track the impact of the Trump 47 regime
Codec is a collaborative tool for managing video evidence.
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
potato: portable text annotation tool
An LLM plugin to efficiently pose questions to LLMs, cache the answers, and quickly retrieve answers to questions that you've already posed.
A collection of rosters of forms maintained by policing organizations
A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
This repository contains code and explanations for how to use large language models and a variety of other natural language processing techniques to analyze congressional hearings.