📄 PDF OCR Service

Extract text and HTML from PDF files using multiple OCR methods

Authentication

⚠️ Authentication Required: All OCR endpoints require a Bearer token for authentication. To obtain an authentication token, please contact the service owner. Include the token in the Authorization header of your requests: Authorization: Bearer <your-token-here>

Overview

The PDF OCR Service is a FastAPI-based service that extracts text and HTML content from PDF files using multiple OCR (Optical Character Recognition) methods. It supports three different extraction methods with automatic fallback, ensuring the best possible results.

🚀 Fast Extraction

Uses PyMuPDF for fast text extraction from PDFs with existing text layers

🔍 OCR Support

Tesseract OCR for scanned documents and images

🤖 AI-Powered

OpenRouter AI for advanced vision model OCR as a fallback

📊 Table Extraction

Special focus on preserving table structure and formatting

🖼️ Image Generation

Generated OCR images are saved and URLs are returned in responses

Available Methods

You can choose from three OCR methods:

  • py - PyMuPDF: Fastest method, extracts text directly from PDF's text layer. Works best for PDFs that already contain text (not scanned images).
  • tesseract - Tesseract OCR: Converts PDF pages to images and performs OCR. Good for scanned PDFs and documents without text layers.
  • ai - OpenRouter AI: Uses advanced vision models (Google Gemini) for OCR. Best for complex layouts and difficult-to-read documents. Requires API key.
Note: If no method is specified, the service automatically tries all methods in order (py → tesseract → ai) until one succeeds.

API Endpoints

POST /extract-text

Extracts plain text from a PDF file.

Query Parameters:

  • method (optional): py, tesseract, or ai
  • image_width (optional): Integer between 1 and 10000 - Target image width in pixels. Aspect ratio is maintained. If not specified, images are saved at original size.

Request:

# Extract text with default settings
curl -X POST "http://localhost:8000/extract-text?method=py" \
  -H "Authorization: Bearer <your-token-here>" \
  -F "file=@document.pdf"

# Extract text with image width of 800 pixels
curl -X POST "http://localhost:8000/extract-text?method=tesseract&image_width=800" \
  -H "Authorization: Bearer <your-token-here>" \
  -F "file=@document.pdf"

Response:

{
  "text": "Extracted text content...",
  "method_used": "pymupdf",
  "tables": [
    [
      ["Header1", "Header2"],
      ["Value1", "Value2"]
    ]
  ],
  "image_urls": [
    "/images/550e8400-e29b-41d4-a716-446655440000_page_1.png",
    "/images/550e8400-e29b-41d4-a716-446655440000_page_2.png"
  ]
}

Note: The image_urls field is only included when OCR methods that generate images are used (Tesseract or OpenRouter). PyMuPDF method does not generate images.

POST /extract-html

Extracts content from a PDF file and returns it as HTML.

Query Parameters:

  • method (optional): py, tesseract, or ai
  • image_width (optional): Integer between 1 and 10000 - Target image width in pixels. Aspect ratio is maintained. If not specified, images are saved at original size.

Request:

# Extract HTML with default settings
curl -X POST "http://localhost:8000/extract-html?method=tesseract" \
  -H "Authorization: Bearer <your-token-here>" \
  -F "file=@document.pdf" \
  -o output.html

# Extract HTML with image width of 1200 pixels
curl -X POST "http://localhost:8000/extract-html?method=tesseract&image_width=1200" \
  -H "Authorization: Bearer <your-token-here>" \
  -F "file=@document.pdf" \
  -o output.html

Response:

Returns an HTML document with extracted content, including properly formatted tables.

Image URLs: When images are generated (Tesseract or OpenRouter methods), image URLs are included in:

  • X-Image-Urls response header (JSON array)
  • HTML comment at the beginning: <!-- IMAGE_URLS: [...] -->

Usage Examples

Using Python requests:

import requests

# Extract text with PyMuPDF
headers = {
    'Authorization': 'Bearer <your-token-here>'
}
response = requests.post(
    'http://localhost:8000/extract-text?method=py',
    headers=headers,
    files={'file': open('document.pdf', 'rb')}
)
result = response.json()
print(result['text'])

# Extract HTML with Tesseract and image width of 800 pixels
response = requests.post(
    'http://localhost:8000/extract-html?method=tesseract&image_width=800',
    headers=headers,
    files={'file': open('document.pdf', 'rb')}
)
html_content = response.text

Using JavaScript (fetch):

// Extract text
const formData = new FormData();
formData.append('file', fileInput.files[0]);

const response = await fetch(
    'http://localhost:8000/extract-text?method=ai',
    {
        method: 'POST',
        headers: {
            'Authorization': 'Bearer <your-token-here>'
        },
        body: formData
    }
);
const result = await response.json();
console.log(result.text);

// Access generated images (if available)
if (result.image_urls) {
    result.image_urls.forEach(url => {
        console.log(`Image URL: http://localhost:8000${url}`);
    });
}

Method Selection Guide

  • Use py for PDFs with existing text layers (fastest, no OCR needed)
  • Use tesseract for scanned PDFs or images (good balance of speed and accuracy)
  • Use ai for complex layouts, poor quality scans, or when other methods fail (slowest but most accurate)
  • Don't specify a method to let the service automatically choose the best method

Image Generation and Access

When OCR methods that convert PDF pages to images are used (Tesseract or OpenRouter), the service:

  • Saves generated images to a temporary directory
  • Allows you to control image width using the image_width query parameter (1-10000 pixels). Aspect ratio is automatically maintained.
  • Returns image URLs in API responses (image_urls field for JSON, X-Image-Urls header for HTML)
  • Serves images via the /images/{filename} endpoint
  • Automatically clears the temp directory every night at midnight
Image Width Control: Use the image_width parameter to control the width of generated images in pixels. For example, image_width=800 resizes images to 800 pixels wide (height is automatically calculated to maintain aspect ratio). This is useful for reducing storage and bandwidth. Valid range: 1 to 10000 pixels.
Important: Images are temporary and will be deleted during nightly cleanup. Download or save any images you need to keep.

Accessing Generated Images

Images can be accessed directly via their URLs:

# Example: Access a generated image
curl "http://localhost:8000/images/550e8400-e29b-41d4-a716-446655440000_page_1.png" \
  -H "Authorization: Bearer <your-token-here>"

Table Extraction

All methods have special handling for tables. Tables are detected and preserved in both text and HTML outputs:

  • Tables are returned as structured data in JSON responses
  • Tables are converted to proper HTML <table> elements in HTML output
  • Table formatting and structure are preserved as much as possible

Interactive API Documentation

For interactive API testing and detailed endpoint documentation, visit:

OpenAPI / Swagger UI

Error Handling

The service returns appropriate HTTP status codes:

  • 401 Unauthorized - Missing or invalid authentication token
  • 400 Bad Request - Invalid file type or method parameter
  • 500 Internal Server Error - All OCR methods failed or processing error