LangExtract is an open-source Python library that enables developers to extract structured data from unstructured text using large language models. It provides reliable schema enforcement and precise source grounding while supporting multiple LLM backends including Gemini, OpenAI, and local Ollama models for privacy-sensitive applications.
Key benefits include:
- Precise Source Grounding: Link every extraction to its exact location in source text for verification
- Reliable Structured Outputs: Enforce consistent schemas with few-shot examples to prevent hallucinations
- LLM Flexibility: Switch seamlessly between Google Gemini, OpenAI GPT, or local Ollama models
- Long Document Handling: Built-in chunking and parallel processing for books/PDFs exceeding context windows
- Interactive Visualization: Generate HTML reports to visually verify extractions and source grounding
- OpenAI-Compatible API Support: Works with DeepSeek, Qwen, and other OpenAI-compatible endpoints
Perfect for developers and data scientists who need to process complex documents at scale while maintaining data privacy and model flexibility.
