Infratex ranks #1 on table parsing.
We ran the released PulseBench-Tab dataset through the standard Infratex model and scored the outputs with Pulse AI's official T-LAG scorer. Infratex Standard reaches 0.9636 mean T-LAG across 1,820 samples, ahead of Pulse Ultra 2 at 0.9347.
Same scorer, full coverage, direct comparison.
PulseBench-Tab evaluates table extraction with a graph-based score over table structure and cell text. Our reported score uses the unmodified official scorer and 100% sample coverage. External provider values in the opening chart are the Pulse-published leaderboard values.
Infratex leads in 9 of 10 language groups.
The narrow French loss is 0.28 percentage points. The larger pattern is the higher floor: fewer severe misses across every language, especially Greek, English, Korean, German, and Spanish.
| Language | Samples | Infratex | Pulse Ultra 2 | Delta | Score bar | < 0.50 failures | P10 floor |
|---|---|---|---|---|---|---|---|
| Greek | 35 | 0.9693 | 0.8498 | +0.1195 | 0 vs 8 | 0.8971 | |
| English | 517 | 0.9608 | 0.9221 | +0.0387 | 2 vs 24 | 0.8850 | |
| Korean | 105 | 0.9646 | 0.9233 | +0.0413 | 0 vs 5 | 0.9046 | |
| German | 113 | 0.9692 | 0.9315 | +0.0377 | 0 vs 4 | 0.8972 | |
| Spanish | 169 | 0.9699 | 0.9355 | +0.0344 | 1 vs 9 | 0.9162 | |
| Arabic | 225 | 0.9625 | 0.9300 | +0.0326 | 2 vs 5 | 0.9016 | |
| Russian | 163 | 0.9600 | 0.9364 | +0.0236 | 0 vs 6 | 0.9037 | |
| Chinese | 160 | 0.9632 | 0.9510 | +0.0123 | 1 vs 3 | 0.9022 | |
| Japanese | 170 | 0.9615 | 0.9534 | +0.0082 | 0 vs 3 | 0.8887 | |
| French | 163 | 0.9681 | 0.9709 | -0.0028 | 0 vs 1 | 0.8535 |
The rank comes from consistency, not just peaks.
Pulse Ultra 2 has more perfect scores, but it also has a much heavier failure tail. Infratex gives up some exact 1.0s while keeping far more samples above 0.90.
More mass in the reliable band.
The two systems fail on different samples.
Sample-level analysis shows complementary behavior. Infratex wins the mean because its wins are large recoveries on cases where Pulse collapses, while many Pulse wins are narrow.
Hypothetical upper bound from sample-level complementarity between Infratex and Pulse outputs.
Near-independent sample scores between Infratex and Pulse Ultra 2.
Samples won by more than 0.50 T-LAG points.
Pulse failures where Infratex still scored at least 0.85.
PulseBench-Tab is useful, but the dataset is not clean.
We still treat the benchmark as valuable because it stresses real table structure. But our audit found annotation and labeling issues that make small deltas and language-level claims fragile.
Language labels are contaminated
Filename prefixes do not always match document language. Arabic has 81 non-Arabic files, French has 37, Korean has 21, and Greek historical contains no Greek script.
Arabic column order is inconsistent
True Arabic files use both RTL and LTR HTML ordering conventions without a documented rule. That means structurally correct outputs can be penalized for choosing the other convention.
Dot leaders are not standardized
Historical documents sometimes keep typographic dot leaders and sometimes strip them. Either model behavior is punished on the opposite convention.
Some samples are not meaningful table tests
Three source images are blank table grids where the score measures guessed grid dimensions rather than extraction quality.
Ground truth can be structurally degenerate
Several files collapse visual multi-row tables into single cells with line breaks, making high scores impossible for natural table extraction.
T-LAG amplifies small annotation choices
The text similarity kernel raises edit similarity to the seventh power, so punctuation, underscores, and number-format differences can collapse edge weights.
Where the ground truth changes the meaning of the score.
These are not model complaints about hard documents. They are cases where the released label convention, file label, or scoring kernel can penalize a reasonable extraction.
Same visual direction, different HTML convention
One Arabic table is encoded left-to-right even though the visual reading order is right-to-left. A similar file uses table dir=rtl. A correct model has no stable rule to follow, so either convention can be penalized.
Dot leaders are kept in one file and stripped in another
Historical tables sometimes include strings like dot-filled leaders in the ground truth and sometimes omit them. Preserving visible dots and stripping visible dots are both punished depending on the sample.
Blank grids become scored table extraction tasks
These samples contain blank table grids with no readable content. The score mostly reflects whether the model guessed the same empty grid dimensions, not whether it extracted table semantics.
A multi-row directory is collapsed into three cells
The image shows a staff directory with many individuals, but the ground truth stores long line-break lists inside a single-row, three-cell table. Natural row extraction can score poorly despite matching the visual structure.
Blank form lines dominate text similarity
The ground truth keeps long underscore fill-in lines inside cells. A model outputting the visible field numbers without underscores can be treated as almost completely different by the T-LAG text kernel.
A country list is encoded by columns, not records
The image is an eight-column country list, while the ground truth stores each full column as one cell with line breaks. A model extracting country rows is structurally incompatible with that annotation choice.