Authentication
Authorization header of your requests: Authorization: Bearer <your-token-here>
Overview
The PDF OCR Service is a FastAPI-based service that extracts text and HTML content from PDF files using multiple OCR (Optical Character Recognition) methods. It supports three different extraction methods with automatic fallback, ensuring the best possible results.
🚀 Fast Extraction
Uses PyMuPDF for fast text extraction from PDFs with existing text layers
🔍 OCR Support
Tesseract OCR for scanned documents and images
🤖 AI-Powered
OpenRouter AI for advanced vision model OCR as a fallback
📊 Table Extraction
Special focus on preserving table structure and formatting
🖼️ Image Generation
Generated OCR images are saved and URLs are returned in responses
Available Methods
You can choose from three OCR methods:
- py - PyMuPDF: Fastest method, extracts text directly from PDF's text layer. Works best for PDFs that already contain text (not scanned images).
- tesseract - Tesseract OCR: Converts PDF pages to images and performs OCR. Good for scanned PDFs and documents without text layers.
- ai - OpenRouter AI: Uses advanced vision models (Google Gemini) for OCR. Best for complex layouts and difficult-to-read documents. Requires API key.
API Endpoints
POST
/extract-text
Extracts plain text from a PDF file.
Query Parameters:
method(optional):py,tesseract, oraiimage_width(optional): Integer between 1 and 10000 - Target image width in pixels. Aspect ratio is maintained. If not specified, images are saved at original size.
Request:
# Extract text with default settings
curl -X POST "http://localhost:8000/extract-text?method=py" \
-H "Authorization: Bearer <your-token-here>" \
-F "file=@document.pdf"
# Extract text with image width of 800 pixels
curl -X POST "http://localhost:8000/extract-text?method=tesseract&image_width=800" \
-H "Authorization: Bearer <your-token-here>" \
-F "file=@document.pdf"
Response:
{
"text": "Extracted text content...",
"method_used": "pymupdf",
"tables": [
[
["Header1", "Header2"],
["Value1", "Value2"]
]
],
"image_urls": [
"/images/550e8400-e29b-41d4-a716-446655440000_page_1.png",
"/images/550e8400-e29b-41d4-a716-446655440000_page_2.png"
]
}
Note: The image_urls field is only included when OCR methods that generate images are used (Tesseract or OpenRouter). PyMuPDF method does not generate images.
POST
/extract-html
Extracts content from a PDF file and returns it as HTML.
Query Parameters:
method(optional):py,tesseract, oraiimage_width(optional): Integer between 1 and 10000 - Target image width in pixels. Aspect ratio is maintained. If not specified, images are saved at original size.
Request:
# Extract HTML with default settings
curl -X POST "http://localhost:8000/extract-html?method=tesseract" \
-H "Authorization: Bearer <your-token-here>" \
-F "file=@document.pdf" \
-o output.html
# Extract HTML with image width of 1200 pixels
curl -X POST "http://localhost:8000/extract-html?method=tesseract&image_width=1200" \
-H "Authorization: Bearer <your-token-here>" \
-F "file=@document.pdf" \
-o output.html
Response:
Returns an HTML document with extracted content, including properly formatted tables.
Image URLs: When images are generated (Tesseract or OpenRouter methods), image URLs are included in:
X-Image-Urlsresponse header (JSON array)- HTML comment at the beginning:
<!-- IMAGE_URLS: [...] -->
Usage Examples
Using Python requests:
import requests
# Extract text with PyMuPDF
headers = {
'Authorization': 'Bearer <your-token-here>'
}
response = requests.post(
'http://localhost:8000/extract-text?method=py',
headers=headers,
files={'file': open('document.pdf', 'rb')}
)
result = response.json()
print(result['text'])
# Extract HTML with Tesseract and image width of 800 pixels
response = requests.post(
'http://localhost:8000/extract-html?method=tesseract&image_width=800',
headers=headers,
files={'file': open('document.pdf', 'rb')}
)
html_content = response.text
Using JavaScript (fetch):
// Extract text
const formData = new FormData();
formData.append('file', fileInput.files[0]);
const response = await fetch(
'http://localhost:8000/extract-text?method=ai',
{
method: 'POST',
headers: {
'Authorization': 'Bearer <your-token-here>'
},
body: formData
}
);
const result = await response.json();
console.log(result.text);
// Access generated images (if available)
if (result.image_urls) {
result.image_urls.forEach(url => {
console.log(`Image URL: http://localhost:8000${url}`);
});
}
Method Selection Guide
- Use
pyfor PDFs with existing text layers (fastest, no OCR needed) - Use
tesseractfor scanned PDFs or images (good balance of speed and accuracy) - Use
aifor complex layouts, poor quality scans, or when other methods fail (slowest but most accurate) - Don't specify a method to let the service automatically choose the best method
Image Generation and Access
When OCR methods that convert PDF pages to images are used (Tesseract or OpenRouter), the service:
- Saves generated images to a temporary directory
- Allows you to control image width using the
image_widthquery parameter (1-10000 pixels). Aspect ratio is automatically maintained. - Returns image URLs in API responses (
image_urlsfield for JSON,X-Image-Urlsheader for HTML) - Serves images via the
/images/{filename}endpoint - Automatically clears the temp directory every night at midnight
image_width parameter to control the width of generated images in pixels.
For example, image_width=800 resizes images to 800 pixels wide (height is automatically calculated to maintain aspect ratio).
This is useful for reducing storage and bandwidth. Valid range: 1 to 10000 pixels.
Accessing Generated Images
Images can be accessed directly via their URLs:
# Example: Access a generated image
curl "http://localhost:8000/images/550e8400-e29b-41d4-a716-446655440000_page_1.png" \
-H "Authorization: Bearer <your-token-here>"
Table Extraction
All methods have special handling for tables. Tables are detected and preserved in both text and HTML outputs:
- Tables are returned as structured data in JSON responses
- Tables are converted to proper HTML
<table>elements in HTML output - Table formatting and structure are preserved as much as possible
Interactive API Documentation
For interactive API testing and detailed endpoint documentation, visit:
OpenAPI / Swagger UIError Handling
The service returns appropriate HTTP status codes:
401 Unauthorized- Missing or invalid authentication token400 Bad Request- Invalid file type or method parameter500 Internal Server Error- All OCR methods failed or processing error