PDFHUB Overview

PDFHUB is an automated financial document parser and analytics system. The application focuses on processing complex Vietnamese PDF documents (such as GSO statistical yearbooks, Customs import-export reports, and corporate financial statements) to automatically extract dense data tables into structured formats (JSON/CSV).

Data extracted and normalized through PDFHUB is designed to integrate with econometrics and causal inference workflows (Python, R, Stata).

1. Challenges with Vietnamese Statistical & Financial PDFs

Financial and statistical publications in Vietnam often present unique technical challenges for traditional OCR systems:

Vietnamese Unicode Issues: Strange embedded fonts or separated characters (combining vs. precomposed Unicode) can scramble text during raw copying.
Complex Table Layouts: Tables containing merged cells, multi-tiered column headers, regional/industry sub-groupings, and footnotes positioned immediately below the grid.
Inconsistent Structures: A single indicator (e.g., GDP by sector) may undergo minor modifications in column names, currency units (VND billions ↔ USD millions), or base-year prices across different years.

2. PDFHUB 4-Tier Ingestion Architecture

To address these challenges, PDFHUB is built on a logical 4-tier pipeline:

Source PDF (GSO/GDC) ──► Document Management Tier
                            └──► Parsing Tier (LlamaParse/LiteParse)
                                   └──► Ingestion Tier (LlamaExtract/Pydantic Schema)
                                          └──► Normalization Tier (PostgreSQL/DuckDB)

Document Management Tier: Classifies and tags uploaded PDFs by year, publishing agency (General Statistics Office, General Department of Vietnam Customs, Ministry of Finance), and topic.
Layout-aware Parsing Tier: Employs LlamaParse Cloud or LiteParse local to analyze the document's spatial layout, isolating tables from surrounding text and exporting them as JSON matrices or structured Markdown.
Schema-based Ingestion Tier: Uses pre-defined Pydantic schemas to map table grids row-by-row (PER_TABLE_ROW) into typed database records.
Normalization & Database Tier: Normalizes classification codes (VSIC sector codes, administrative region codes, HS commodity codes), handles missing data symbols (e.g., -, .., x), and merges annual records into time-series panel data.

1. Challenges with Vietnamese Statistical & Financial PDFs​

2. PDFHUB 4-Tier Ingestion Architecture​

1. Challenges with Vietnamese Statistical & Financial PDFs

2. PDFHUB 4-Tier Ingestion Architecture