Ingestion Mechanics & Schema-Based Parsing
The core of PDFHUB is an intelligent extraction engine that combines LlamaParse/LiteParse Layout-aware Parsing with Schema-based Ingestion. This process converts disorganized PDF table grids into structured, row-level data.
1. Layout-Aware Parsing
Traditional OCR libraries read documents from left-to-right and top-to-bottom, which completely breaks down the column structures of data tables. PDFHUB solves this using LlamaParse:
- Layout Analysis: The parser analyzes the physical page layout to isolate text blocks, tables, charts, and footnotes.
- Table Spatial Reconstruction: Identifies rows, columns, and merged cells (e.g., merged economic sector headers or merged regional row blocks).
- Intermediate Output: Exports table grids as JSON matrices or Markdown tables (e.g., using
|cell separators) to preserve spatial associations.
2. Schema-Based Ingestion
Once the table is recognized as a raw grid, the system uses LlamaExtract alongside pre-defined Pydantic schemas to transform cells into validated records.
Rather than extracting the entire table as a single chunk, the system uses the PER_TABLE_ROW ingestion target. Each row in the PDF table is mapped into a single database record.
Example Pydantic Schema for GDP:
Here is a Pydantic schema definition for a GDP indicator table:
from pydantic import BaseModel, Field
class GdpBySector(BaseModel):
year: int = Field(description="The statistical year of the record")
sector_code: str = Field(description="The economic sector code (e.g., A, B, C...)")
sector_name_vi: str = Field(description="The economic sector name in Vietnamese")
value: float = Field(description="The GDP value of the sector")
unit: str = Field(description="The measurement unit, e.g., billion VND")
When this schema is applied, the AI parses the PDF table row:
Agriculture, forestry and fisheries | 2023 | Billion VND | 1,234,567.8
And converts it into the following JSON object:
{
"year": 2023,
"sector_code": "A",
"sector_name_vi": "Nông, lâm nghiệp và thủy sản",
"value": 1234567.8,
"unit": "Billion VND"
}
3. Table Routing
Because a statistical yearbook contains hundreds of tables spanning multiple topics, the system must determine which schema to apply to each table. PDFHUB implements a Table Router mechanism:
- The system reads the table caption and surrounding text context (e.g., "Table 2.15: Export volume of key commodities by trading partners").
- The LLM Router classifies the caption into its corresponding topic (e.g., International Trade / Exports).
- The system applies the configured schema for that topic (e.g.,
TradeByCommodityschema) to ingest row-level records. - If a table caption is unrecognized, the table is queued under Pending Schema Definition and administrators are notified.