randscript

Docling Basics - Progressive Tutorial

This folder contains a series of progressive examples demonstrating Docling’s capabilities for document processing, from simple PDF conversion to advanced hybrid chunking for RAG systems.

πŸ“š What is Docling?

Docling is a powerful document processing library that handles complex document formats that are typically challenging for RAG (Retrieval Augmented Generation) systems. Without Docling, you’d need to implement custom OCR, layout analysis, table extraction, and format-specific parsers. Docling handles all of this out-of-the-box.

Key Advantages:

🎯 Tutorial Progression

1️⃣ Simple PDF Conversion (01_simple_pdf.py)

What it demonstrates:

Key concepts:

Run it:

python 01_simple_pdf.py

What this covers:


2️⃣ Multiple Document Formats (02_multiple_formats.py)

What it demonstrates:

Key concepts:

Run it:

python 02_multiple_formats.py

What this covers:


3️⃣ Audio Transcription (03_audio_transcription.py)

What it demonstrates:

Key concepts:

Prerequisites: FFmpeg must be installed:

Windows (Chocolatey):

choco install ffmpeg

Windows (Conda):

conda install -c conda-forge ffmpeg

macOS:

brew install ffmpeg

Linux:

apt-get install ffmpeg  # Debian/Ubuntu
yum install ffmpeg      # RedHat/CentOS

Run it:

python 03_audio_transcription.py

What this covers:


4️⃣ Hybrid Chunking (04_hybrid_chunking.py)

What it demonstrates:

Key concepts:

Why Hybrid Chunking?

Run it:

python 04_hybrid_chunking.py

What this covers:


πŸš€ Advanced Features (Optional Enhancements)

Beyond these tutorials, Docling offers additional capabilities for even more robust document processing:

Picture Classification & Description

Add vision-based understanding to your PDFs:

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    granite_picture_description
)
from docling.datamodel.base_models import InputFormat

# Configure picture description for PDFs
pipeline_options = PdfPipelineOptions()
pipeline_options.do_picture_description = True
pipeline_options.picture_description_options = granite_picture_description

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Benefits:

Code Understanding

Enhanced processing for technical documents with code:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_code_enrichment = True  # Enables code syntax understanding

Benefits:

Table Structure Recognition

Advanced table parsing with TableFormer:

from docling.datamodel.pipeline_options import TableFormerMode

pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

Benefits:


πŸ“– From Basics to Full RAG Agent

These tutorials demonstrate the building blocks used in the main RAG agent:

Docling Basics (This Folder)

Full RAG Agent (Main Project)

Progression Flow:

  1. Learn β†’ Work through docling_basics/ tutorials
  2. Understand β†’ See how each piece works independently
  3. Apply β†’ Explore the full RAG agent implementation
  4. Customize β†’ Adapt for your specific use case

πŸ› οΈ Installation

All examples require Docling and dependencies:

# Install base Docling
pip install docling

# For hybrid chunking and ASR (example 3&4)
pip install transformers openai-whisper hf-xet

# OR install everything at once
pip install docling transformers openai-whisper hf-xet

πŸ“‚ Expected File Structure

The documents/ folder (one level up) contains example files:

docling-rag-agent/
β”œβ”€β”€ docling_basics/          # This folder - Tutorial scripts
β”‚   β”œβ”€β”€ 01_simple_pdf.py
β”‚   β”œβ”€β”€ 02_multiple_formats.py
β”‚   β”œβ”€β”€ 03_audio_transcription.py
β”‚   β”œβ”€β”€ 04_hybrid_chunking.py
β”‚   └── README.md
β”œβ”€β”€ documents/               # Source documents (examples provided)
β”‚   β”œβ”€β”€ technical-architecture-guide.pdf
β”‚   β”œβ”€β”€ q4-2024-business-review.pdf
β”‚   β”œβ”€β”€ meeting-notes-2025-01-08.docx
β”‚   β”œβ”€β”€ company-overview.md
β”‚   β”œβ”€β”€ Recording1.mp3
β”‚   └── ... (more files)
└── ... (main RAG agent files)

πŸŽ“ Learning Path

Recommended Order:

  1. Start Here β†’ 01_simple_pdf.py
    • Get comfortable with basic conversion
    • See Docling’s output format
  2. Expand β†’ 02_multiple_formats.py
    • Learn unified API for different formats
    • Understand batch processing
  3. Add Audio β†’ 03_audio_transcription.py
    • See how audio fits into document processing
    • Understand ASR pipeline
  4. Optimize for RAG β†’ 04_hybrid_chunking.py
    • Critical for production RAG systems
    • Learn about token limits and semantic chunking
  5. Explore Full Agent β†’ Main project files
    • See everything integrated
    • Production-ready implementation

πŸ’‘ Key Takeaways

After completing these tutorials, you’ll understand:

βœ… Why Docling?

βœ… When to Use Docling?

βœ… How Docling Fits RAG?


πŸ”— Additional Resources


πŸš€ Next Steps

Ready to build your own RAG system? Check out the main project files:

These tutorials provide the foundation. The main agent shows the complete picture! 🎯