Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major 'inconsistent text' problem when handling empty cells #28

Closed
martinreynaert opened this issue Mar 25, 2024 · 9 comments
Closed

Major 'inconsistent text' problem when handling empty cells #28

martinreynaert opened this issue Mar 25, 2024 · 9 comments
Assignees
Labels
bug Something isn't working ready Done but not released yet, pending closure on release

Comments

@martinreynaert
Copy link

Hi,

I have 568 fine FoLiA xml files, obtained first by running tei2folia, tokenized next by Ucto. I need to add lemmatization and POS by way of spacy2folia.

This unexpectedly fails on 164, or almost 29% of these, unfortunately.

Far as I can see their stderr messages all report the same issue. I give you the core of the smallest file's error report:

folia.main.ParseError: FoLiA exception in handling of

@ line 61 (in parent @ parent line 60) : [InconsistentText] Text for <Division at 140369425161904 id=Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1 set=https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/tei2folia/divisions.foliaset.ttl class=book>, is inconsistent: EXPECTED (deep text after normalization) *****>
Frontmatter Titlepage {ii} {1} THE HISTORICAL IMAGINATION An Inaugural Lecture DELIVERED BEFORE THE UNIVERSITY OF OXFORD ON 28 OCTOBER 1935 By R. G. COLLINGWOOD WAYNFLETE PROFESSOR OF METAPHYSICAL PHILOSOPHY OXFORD AT THE CLARENDON PRESS 1935 {2} Copyright page OXFORD UNIVERSITY PRESS AMEN HOUSE, E.C. 4 London | Edinburgh | Glasgow New York | Toronto | Melbourne Capetown | Bombay | Calcutta Madras | Shanghai HUMPHREY MILFORD PUBLISHER TO THE UNIVERSITY PRINTED IN GREAT BRITAIN {3}
****> BUT FOUND (strict text after normalization) ****>
Frontmatter Titlepage {ii} {1} THE HISTORICAL IMAGINATION An Inaugural Lecture DELIVERED BEFORE THE UNIVERSITY OF OXFORD ON 28 OCTOBER 1935 By R. G. COLLINGWOOD WAYNFLETE PROFESSOR OF METAPHYSICAL PHILOSOPHY OXFORD AT THE CLARENDON PRESS 1935 {2} Copyright page OXFORD UNIVERSITY PRESS AMEN HOUSE, E.C. 4 London | Edinburgh | Glasgow New York | Toronto | Melbourne Capetown | Bombay | Calcutta Madras | Shanghai | HUMPHREY MILFORD PUBLISHER TO THE UNIVERSITY PRINTED IN GREAT BRITAIN {3}
******* DEVIATION POINT: Shanghai <HERE>| HUMPHREY
(also checked against older rules prior to FoLiA v2.4.1)

I could get you the tokenized FoLiA version of this one.

I hope this can expediently be solved!

Thank you, as ever.

@proycon
Copy link
Owner

proycon commented Mar 25, 2024 via email

@proycon
Copy link
Owner

proycon commented Mar 25, 2024

It's not a spacy2folia issue as such, as foliavalidator already rejects the input file. However, folialint does see it as valid, so something fishy is going on:

$ foliavalidator Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination.Tok.folia.xml 
VALIDATION ERROR on full parse by library (stage 2/3), in Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination.Tok.folia.xml
ParseError: FoLiA exception in handling of <div> @ line 61 (in parent <text> @ parent line 60) : [InconsistentText] Text for <Division at 135350875733392 id=Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1 set=https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/tei2folia/divisions.foliaset.ttl class=book>, is inconsistent: EXPECTED (deep text after normalization) *****>
Frontmatter Titlepage {ii} {1} THE HISTORICAL IMAGINATION An Inaugural Lecture DELIVERED BEFORE THE UNIVERSITY OF OXFORD ON 28 OCTOBER 1935 By R. G. COLLINGWOOD WAYNFLETE PROFESSOR OF METAPHYSICAL PHILOSOPHY OXFORD AT THE CLARENDON PRESS 1935 {2} Copyright page OXFORD UNIVERSITY PRESS AMEN HOUSE, E.C. 4 London | Edinburgh | Glasgow New York | Toronto | Melbourne Capetown | Bombay | Calcutta Madras | Shanghai HUMPHREY MILFORD PUBLISHER TO THE UNIVERSITY PRINTED IN GREAT BRITAIN {3}
****> BUT FOUND (strict text after normalization) ****>
Frontmatter Titlepage {ii} {1} THE HISTORICAL IMAGINATION An Inaugural Lecture DELIVERED BEFORE THE UNIVERSITY OF OXFORD ON 28 OCTOBER 1935 By R. G. COLLINGWOOD WAYNFLETE PROFESSOR OF METAPHYSICAL PHILOSOPHY OXFORD AT THE CLARENDON PRESS 1935 {2} Copyright page OXFORD UNIVERSITY PRESS AMEN HOUSE, E.C. 4 London | Edinburgh | Glasgow New York | Toronto | Melbourne Capetown | Bombay | Calcutta Madras | Shanghai | HUMPHREY MILFORD PUBLISHER TO THE UNIVERSITY PRINTED IN GREAT BRITAIN {3}
******* DEVIATION POINT:  Shanghai <*HERE*>| HUMPHREY
(also checked against older rules prior to FoLiA v2.4.1)

I assume the table structure was already present in the TEI, there are also some empty cells and paragraphs in the relevant section:

              <cell xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.2">
                <t> Shanghai </t>
                <s xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.2.s.1">
                  <t>Shanghai</t>
                  <w xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.2.s.1.w.1" class="WORD">
                    <t>Shanghai</t>
                  </w>
                </s>
              </cell>
              <cell xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.3"/>
            </row>
          </table>
          <p xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.p.4" class="p"/>
          <p xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.p.5" class="p">
            <t> HUMPHREY MILFORD
 </t>

@proycon
Copy link
Owner

proycon commented Mar 25, 2024

So who is right regarding the text validation here? The textdelimiter for cell is |, foliapy does not output the last text delimiter for cell, if I recall because it is superseded by that of row and table (that would be correct behaviour). However, there is an empty cell, that should cause the text delimiter after Shanghai to be outputted (and probably why libfolia does so).

If I remove the empty cell, and I remove the trailing | from the original text (which is cheating, I know), both validators accept it.

All things considered I'd say foliapy is in the wrong here, moving this issue there.

@proycon proycon self-assigned this Mar 25, 2024
@proycon proycon transferred this issue from proycon/spacy2folia Mar 25, 2024
@proycon proycon added the bug Something isn't working label Mar 25, 2024
@proycon
Copy link
Owner

proycon commented Mar 25, 2024

This relates to proycon/foliatools#41

@proycon proycon changed the title Major 'inconsistent text' problem Major 'inconsistent text' problem when handling empty cells Mar 25, 2024
@proycon proycon added the ready Done but not released yet, pending closure on release label Mar 25, 2024
@proycon proycon closed this as completed Mar 26, 2024
@martinreynaert
Copy link
Author

Hi proycon,

Thank you for working on this problem!

Alas, the issue is not completely solved. Of the 164 files that failed with the previous release, still 44 fail with this new one.

The error messages are much the same. Would it be possible to further look into this, please!

I will get you two of the still failing files via my usual route. Both validate with my folialint. Foliavalidator does find deviation points in both.

Hope those help!

@proycon proycon reopened this Mar 27, 2024
@proycon
Copy link
Owner

proycon commented Mar 27, 2024

In these two files that still fail, we see an empty row with empty cells:

Error (excerpt):

Expected (deep text after normalization):  of bees. 387 |
Found (strict text):  of bees. | | 387 
******* DEVIATION POINT:  of bees. <*HERE*>| | 387 |

Original text:

359 | 288, n. 366, 292-3 | Hermann Müller on sexual differences of bees.
  |   |  
387 | 308 | Sounds produced by moths.

FoLiA XML:

                  <w xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.9.cell.3.s.1.w.7" class="WORD" space="no">
                    <t>bees</t>
                  </w>
                  <w xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.9.cell.3.s.1.w.8" class="PUNCTUATION">
                    <t>.</t>
                  </w>
                </s>
              </cell>
            </row>
            <row xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10">
              <cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10.cell.1"/>
              <cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10.cell.2"/>
              <cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10.cell.3"/>
            </row>
            <row xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11">
              <cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11.cell.1">
                <t> 387 </t>
                <s xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11.cell.1.s.1">
                  <t>387</t>
                  <w xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11.cell.1.s.1.w.1" class="NUMBER">
                    <t>387</t>
                  </w>
                </s>
              </cell>

The reconstructed indeed misses two textdelimiters for cell.

@proycon
Copy link
Owner

proycon commented Mar 27, 2024

@martinreynaert I implemented a fix again. Can you try the development version first? (pip install -U git+https://github.com/proycon/foliapy), just to ensure they all pass now prior to release.

@martinreynaert
Copy link
Author

This has helped, thanks, proycon!

Six of 44 files now fail on what seem to be right-to-left characters, though. Only foliavalidator sees these problems, neither tei2folia, uct or folialint do.

Can you please fix this too?

@proycon
Copy link
Owner

proycon commented Mar 29, 2024

Fixed in v2.5.11

@proycon proycon closed this as completed Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready Done but not released yet, pending closure on release
Projects
None yet
Development

No branches or pull requests

2 participants