-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major 'inconsistent text' problem when handling empty cells #28
Comments
On Mon Mar 25, 2024 at 3:05 PM CET, martinreynaert wrote:
I could get you the tokenized FoLiA version of this one.
Yes, please, that would help in debugging this issue.
|
It's not a spacy2folia issue as such, as foliavalidator already rejects the input file. However, folialint does see it as valid, so something fishy is going on:
I assume the table structure was already present in the TEI, there are also some empty cells and paragraphs in the relevant section: <cell xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.2">
<t> Shanghai </t>
<s xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.2.s.1">
<t>Shanghai</t>
<w xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.2.s.1.w.1" class="WORD">
<t>Shanghai</t>
</w>
</s>
</cell>
<cell xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.table.1.row.4.cell.3"/>
</row>
</table>
<p xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.p.4" class="p"/>
<p xml:id="Collingwood-A-1935-PastMasters_eng_Collingwood_Philosophical_Texts_2nd_Release_The_Historical_Imagination-V1.text.div.1.div.1.div.2.p.5" class="p">
<t> HUMPHREY MILFORD
</t> |
So who is right regarding the text validation here? The textdelimiter for cell is If I remove the empty cell, and I remove the trailing All things considered I'd say foliapy is in the wrong here, moving this issue there. |
This relates to proycon/foliatools#41 |
Hi proycon, Thank you for working on this problem! Alas, the issue is not completely solved. Of the 164 files that failed with the previous release, still 44 fail with this new one. The error messages are much the same. Would it be possible to further look into this, please! I will get you two of the still failing files via my usual route. Both validate with my folialint. Foliavalidator does find deviation points in both. Hope those help! |
In these two files that still fail, we see an empty row with empty cells: Error (excerpt):
Original text:
FoLiA XML: <w xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.9.cell.3.s.1.w.7" class="WORD" space="no">
<t>bees</t>
</w>
<w xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.9.cell.3.s.1.w.8" class="PUNCTUATION">
<t>.</t>
</w>
</s>
</cell>
</row>
<row xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10">
<cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10.cell.1"/>
<cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10.cell.2"/>
<cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.10.cell.3"/>
</row>
<row xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11">
<cell xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11.cell.1">
<t> 387 </t>
<s xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11.cell.1.s.1">
<t>387</t>
<w xml:id="Darwin-A-1877-PastMasters_eng_The_Descent_of_Man_and_Selection_in_Relation_to_Sex_PartI-V1.text.div.1.div.1.div.5.table.3.row.11.cell.1.s.1.w.1" class="NUMBER">
<t>387</t>
</w>
</s>
</cell> The reconstructed indeed misses two textdelimiters for cell. |
@martinreynaert I implemented a fix again. Can you try the development version first? ( |
This has helped, thanks, proycon! Six of 44 files now fail on what seem to be right-to-left characters, though. Only foliavalidator sees these problems, neither tei2folia, uct or folialint do. Can you please fix this too? |
Fixed in v2.5.11 |
Hi,
I have 568 fine FoLiA xml files, obtained first by running tei2folia, tokenized next by Ucto. I need to add lemmatization and POS by way of spacy2folia.
This unexpectedly fails on 164, or almost 29% of these, unfortunately.
Far as I can see their stderr messages all report the same issue. I give you the core of the smallest file's error report:
folia.main.ParseError: FoLiA exception in handling of
Frontmatter Titlepage {ii} {1} THE HISTORICAL IMAGINATION An Inaugural Lecture DELIVERED BEFORE THE UNIVERSITY OF OXFORD ON 28 OCTOBER 1935 By R. G. COLLINGWOOD WAYNFLETE PROFESSOR OF METAPHYSICAL PHILOSOPHY OXFORD AT THE CLARENDON PRESS 1935 {2} Copyright page OXFORD UNIVERSITY PRESS AMEN HOUSE, E.C. 4 London | Edinburgh | Glasgow New York | Toronto | Melbourne Capetown | Bombay | Calcutta Madras | Shanghai HUMPHREY MILFORD PUBLISHER TO THE UNIVERSITY PRINTED IN GREAT BRITAIN {3}
****> BUT FOUND (strict text after normalization) ****>
Frontmatter Titlepage {ii} {1} THE HISTORICAL IMAGINATION An Inaugural Lecture DELIVERED BEFORE THE UNIVERSITY OF OXFORD ON 28 OCTOBER 1935 By R. G. COLLINGWOOD WAYNFLETE PROFESSOR OF METAPHYSICAL PHILOSOPHY OXFORD AT THE CLARENDON PRESS 1935 {2} Copyright page OXFORD UNIVERSITY PRESS AMEN HOUSE, E.C. 4 London | Edinburgh | Glasgow New York | Toronto | Melbourne Capetown | Bombay | Calcutta Madras | Shanghai | HUMPHREY MILFORD PUBLISHER TO THE UNIVERSITY PRINTED IN GREAT BRITAIN {3}
******* DEVIATION POINT: Shanghai <HERE>| HUMPHREY
(also checked against older rules prior to FoLiA v2.4.1)
I could get you the tokenized FoLiA version of this one.
I hope this can expediently be solved!
Thank you, as ever.
The text was updated successfully, but these errors were encountered: