Overall T-LAG leaderboardScore / Coverage
1Infratex StandardThis run - 100.0%0.9636
2Pulse Ultra 2Pulse published - 100.0%0.9347
3LlamaParse AgenticPulse published - 94.0%0.7977
4Reducto AgenticPulse published - 78.8%0.7953
5ExtendPulse published - 91.9%0.7626
6Azure Document IntelligencePulse published - 92.0%0.7614
7ReductoPulse published - 80.4%0.7175
8AWS TextractPulse published - 98.5%0.6034
9UnstructuredPulse published - 100.0%0.3603
PulseBench-Tab Result

Infratex ranks #1 on table parsing.

We ran the released PulseBench-Tab dataset through the standard Infratex model and scored the outputs with Pulse AI's official T-LAG scorer. Infratex Standard reaches 0.9636 mean T-LAG across 1,820 samples, ahead of Pulse Ultra 2 at 0.9347.

0.9636
Mean T-LAG
+2.89 pp over Pulse Ultra 2
100%
Coverage
1,820 / 1,820 samples scored
9 / 10
Languages Won
French is the only narrow loss
6
Failures < 0.50
Pulse Ultra 2 has 68
/ WHAT WE RAN

Same scorer, full coverage, direct comparison.

PulseBench-Tab evaluates table extraction with a graph-based score over table structure and cell text. Our reported score uses the unmodified official scorer and 100% sample coverage. External provider values in the opening chart are the Pulse-published leaderboard values.

Methodology summary: 1,820 sample images, official T-LAG scoring, the standard Infratex model, and Pulse Ultra 2 stored predictions as the direct all-sample baseline. The detailed sections below focus on Infratex versus Pulse Ultra 2 because both are scored over the full local artifact set used in this analysis.
/ BY LANGUAGE

Infratex leads in 9 of 10 language groups.

The narrow French loss is 0.28 percentage points. The larger pattern is the higher floor: fewer severe misses across every language, especially Greek, English, Korean, German, and Spanish.

LanguageSamplesInfratexPulse Ultra 2DeltaScore bar< 0.50 failuresP10 floor
Greek350.96930.8498+0.11950 vs 80.8971
English5170.96080.9221+0.03872 vs 240.8850
Korean1050.96460.9233+0.04130 vs 50.9046
German1130.96920.9315+0.03770 vs 40.8972
Spanish1690.96990.9355+0.03441 vs 90.9162
Arabic2250.96250.9300+0.03262 vs 50.9016
Russian1630.96000.9364+0.02360 vs 60.9037
Chinese1600.96320.9510+0.01231 vs 30.9022
Japanese1700.96150.9534+0.00820 vs 30.8887
French1630.96810.9709-0.00280 vs 10.8535
/ RELIABILITY

The rank comes from consistency, not just peaks.

Pulse Ultra 2 has more perfect scores, but it also has a much heavier failure tail. Infratex gives up some exact 1.0s while keeping far more samples above 0.90.

MetricInfratexPulseMeaning
Std deviation0.06360.15532.4x tighter
P10 floor0.89420.7829+0.111
>= 0.901,6191,520+99 samples
>= 0.851,7921,576+216 samples
< 0.3023216x fewer
Zero scores03No hard zeroes
Score distribution

More mass in the reliable band.

1.008511053
0.95-1.00452220
0.90-0.95452238
0.85-0.901755
0.70-0.8520125
<0.7028119
Orange: Infratex. Gray: Pulse Ultra 2.
/ MATH STATS

The two systems fail on different samples.

Sample-level analysis shows complementary behavior. Infratex wins the mean because its wins are large recoveries on cases where Pulse collapses, while many Pulse wins are narrow.

Ceiling estimate
0.9863

Hypothetical upper bound from sample-level complementarity between Infratex and Pulse outputs.

Correlation
r = 0.1045

Near-independent sample scores between Infratex and Pulse Ultra 2.

Big wins
60 vs 4

Samples won by more than 0.50 T-LAG points.

Pulse failure rescue
61 / 68

Pulse failures where Infratex still scored at least 0.85.

/ DATASET AUDIT

PulseBench-Tab is useful, but the dataset is not clean.

We still treat the benchmark as valuable because it stresses real table structure. But our audit found annotation and labeling issues that make small deltas and language-level claims fragile.

176 files

Language labels are contaminated

Filename prefixes do not always match document language. Arabic has 81 non-Arabic files, French has 37, Korean has 21, and Greek historical contains no Greek script.

73 / 71 split

Arabic column order is inconsistent

True Arabic files use both RTL and LTR HTML ordering conventions without a documented rule. That means structurally correct outputs can be penalized for choosing the other convention.

461 cells

Dot leaders are not standardized

Historical documents sometimes keep typographic dot leaders and sometimes strip them. Either model behavior is punished on the opposite convention.

3 blank grids

Some samples are not meaningful table tests

Three source images are blank table grids where the score measures guessed grid dimensions rather than extraction quality.

10+ cases

Ground truth can be structurally degenerate

Several files collapse visual multi-row tables into single cells with line breaks, making high scores impossible for natural table extraction.

power 7

T-LAG amplifies small annotation choices

The text similarity kernel raises edit similarity to the seventh power, so punctuation, underscores, and number-format differences can collapse edge weights.

Practical interpretation: the top-line score is reproducible under the released scorer, but the dataset should not be treated as perfectly objective ground truth. Language-level leaderboards are especially affected by cross-contamination, Arabic ordering conventions, dot-leader choices, and structurally degenerate labels.
Examples from the audit

Where the ground truth changes the meaning of the score.

These are not model complaints about hard documents. They are cases where the released label convention, file label, or scoring kernel can penalize a reasonable extraction.

arabic_0070 / arabic_0056

Same visual direction, different HTML convention

One Arabic table is encoded left-to-right even though the visual reading order is right-to-left. A similar file uses table dir=rtl. A correct model has no stable rule to follow, so either convention can be penalized.

english_historical_0050 / 0022

Dot leaders are kept in one file and stripped in another

Historical tables sometimes include strings like dot-filled leaders in the ground truth and sometimes omit them. Preserving visible dots and stripping visible dots are both punished depending on the sample.

arabic_0396 / arabic_0430 / chinese_0262

Blank grids become scored table extraction tasks

These samples contain blank table grids with no readable content. The score mostly reflects whether the model guessed the same empty grid dimensions, not whether it extracted table semantics.

japanese_0177

A multi-row directory is collapsed into three cells

The image shows a staff directory with many individuals, but the ground truth stores long line-break lists inside a single-row, three-cell table. Natural row extraction can score poorly despite matching the visual structure.

korean_0073

Blank form lines dominate text similarity

The ground truth keeps long underscore fill-in lines inside cells. A model outputting the visible field numbers without underscores can be treated as almost completely different by the T-LAG text kernel.

russian_0144

A country list is encoded by columns, not records

The image is an eight-column country list, while the ground truth stores each full column as one cell with line breaks. A model extracting country rows is structurally incompatible with that annotation choice.