Overall T-LAG leaderboardScore / Coverage

1Infratex StandardThis run - 100.0%0.9636

2Pulse Ultra 2Pulse published - 100.0%0.9347

3LlamaParse AgenticPulse published - 94.0%0.7977

4Reducto AgenticPulse published - 78.8%0.7953

5ExtendPulse published - 91.9%0.7626

6Azure Document IntelligencePulse published - 92.0%0.7614

7ReductoPulse published - 80.4%0.7175

8AWS TextractPulse published - 98.5%0.6034

9UnstructuredPulse published - 100.0%0.3603

PulseBench-Tab Result

Infratex ranks #1 on table parsing.

We ran the released PulseBench-Tab dataset through the standard Infratex model and scored the outputs with Pulse AI's official T-LAG scorer. Infratex Standard reaches 0.9636 mean T-LAG across 1,820 samples, ahead of Pulse Ultra 2 at 0.9347.

See the numbers Source benchmark

0.9636

Mean T-LAG

+2.89 pp over Pulse Ultra 2

100%

Coverage

1,820 / 1,820 samples scored

9 / 10

Languages Won

French is the only narrow loss

Failures < 0.50

Pulse Ultra 2 has 68

/ WHAT WE RAN

Same scorer, full coverage, direct comparison.

PulseBench-Tab evaluates table extraction with a graph-based score over table structure and cell text. Our reported score uses the unmodified official scorer and 100% sample coverage. External provider values in the opening chart are the Pulse-published leaderboard values.

Methodology summary: 1,820 sample images, official T-LAG scoring, the standard Infratex model, and Pulse Ultra 2 stored predictions as the direct all-sample baseline. The detailed sections below focus on Infratex versus Pulse Ultra 2 because both are scored over the full local artifact set used in this analysis.

/ BY LANGUAGE

Infratex leads in 9 of 10 language groups.

The narrow French loss is 0.28 percentage points. The larger pattern is the higher floor: fewer severe misses across every language, especially Greek, English, Korean, German, and Spanish.

Language	Samples	Infratex	Pulse Ultra 2	Delta	< 0.50 failures	P10 floor
Greek	35	0.9693	0.8498	+0.1195	0 vs 8	0.8971
English	517	0.9608	0.9221	+0.0387	2 vs 24	0.8850
Korean	105	0.9646	0.9233	+0.0413	0 vs 5	0.9046
German	113	0.9692	0.9315	+0.0377	0 vs 4	0.8972
Spanish	169	0.9699	0.9355	+0.0344	1 vs 9	0.9162
Arabic	225	0.9625	0.9300	+0.0326	2 vs 5	0.9016
Russian	163	0.9600	0.9364	+0.0236	0 vs 6	0.9037
Chinese	160	0.9632	0.9510	+0.0123	1 vs 3	0.9022
Japanese	170	0.9615	0.9534	+0.0082	0 vs 3	0.8887
French	163	0.9681	0.9709	-0.0028	0 vs 1	0.8535

/ RELIABILITY

The rank comes from consistency, not just peaks.

Pulse Ultra 2 has more perfect scores, but it also has a much heavier failure tail. Infratex gives up some exact 1.0s while keeping far more samples above 0.90.

MetricInfratexPulseMeaning

Std deviation0.06360.15532.4x tighter

P10 floor0.89420.7829+0.111

>= 0.901,6191,520+99 samples

>= 0.851,7921,576+216 samples

< 0.3023216x fewer

Zero scores03No hard zeroes

Score distribution

More mass in the reliable band.

1.008511053

0.95-1.00452220

0.90-0.95452238

0.85-0.901755

0.70-0.8520125

<0.7028119

Orange: Infratex. Gray: Pulse Ultra 2.

/ MATH STATS

The two systems fail on different samples.

Sample-level analysis shows complementary behavior. Infratex wins the mean because its wins are large recoveries on cases where Pulse collapses, while many Pulse wins are narrow.

Ceiling estimate

0.9863

Hypothetical upper bound from sample-level complementarity between Infratex and Pulse outputs.

Correlation

r = 0.1045

Near-independent sample scores between Infratex and Pulse Ultra 2.

Big wins

60 vs 4

Samples won by more than 0.50 T-LAG points.

Pulse failure rescue

61 / 68

Pulse failures where Infratex still scored at least 0.85.

/ DATASET AUDIT

PulseBench-Tab is useful, but the dataset is not clean.

We still treat the benchmark as valuable because it stresses real table structure. But our audit found annotation and labeling issues that make small deltas and language-level claims fragile.

176 files

Language labels are contaminated

Filename prefixes do not always match document language. Arabic has 81 non-Arabic files, French has 37, Korean has 21, and Greek historical contains no Greek script.

73 / 71 split

Arabic column order is inconsistent

True Arabic files use both RTL and LTR HTML ordering conventions without a documented rule. That means structurally correct outputs can be penalized for choosing the other convention.

461 cells

Dot leaders are not standardized

Historical documents sometimes keep typographic dot leaders and sometimes strip them. Either model behavior is punished on the opposite convention.

3 blank grids

Some samples are not meaningful table tests

Three source images are blank table grids where the score measures guessed grid dimensions rather than extraction quality.

10+ cases

Ground truth can be structurally degenerate

Several files collapse visual multi-row tables into single cells with line breaks, making high scores impossible for natural table extraction.

power 7

T-LAG amplifies small annotation choices

The text similarity kernel raises edit similarity to the seventh power, so punctuation, underscores, and number-format differences can collapse edge weights.

Practical interpretation: the top-line score is reproducible under the released scorer, but the dataset should not be treated as perfectly objective ground truth. Language-level leaderboards are especially affected by cross-contamination, Arabic ordering conventions, dot-leader choices, and structurally degenerate labels.

Examples from the audit

Where the ground truth changes the meaning of the score.

These are not model complaints about hard documents. They are cases where the released label convention, file label, or scoring kernel can penalize a reasonable extraction.

arabic_0070 / arabic_0056

Same visual direction, different HTML convention

One Arabic table is encoded left-to-right even though the visual reading order is right-to-left. A similar file uses table dir=rtl. A correct model has no stable rule to follow, so either convention can be penalized.

english_historical_0050 / 0022

Dot leaders are kept in one file and stripped in another

Historical tables sometimes include strings like dot-filled leaders in the ground truth and sometimes omit them. Preserving visible dots and stripping visible dots are both punished depending on the sample.

arabic_0396 / arabic_0430 / chinese_0262

Blank grids become scored table extraction tasks

These samples contain blank table grids with no readable content. The score mostly reflects whether the model guessed the same empty grid dimensions, not whether it extracted table semantics.

japanese_0177

A multi-row directory is collapsed into three cells

The image shows a staff directory with many individuals, but the ground truth stores long line-break lists inside a single-row, three-cell table. Natural row extraction can score poorly despite matching the visual structure.

korean_0073

Blank form lines dominate text similarity

The ground truth keeps long underscore fill-in lines inside cells. A model outputting the visible field numbers without underscores can be treated as almost completely different by the T-LAG text kernel.

russian_0144

A country list is encoded by columns, not records

The image is an eight-column country list, while the ground truth stores each full column as one cell with line breaks. A model extracting country rows is structurally incompatible with that annotation choice.