Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with Open XML SDK v3.0 and excels with malformed hyperlinks #1636

Open
ilayna opened this issue Jan 4, 2024 · 13 comments · Fixed by #1644
Open

Bug with Open XML SDK v3.0 and excels with malformed hyperlinks #1636

ilayna opened this issue Jan 4, 2024 · 13 comments · Fixed by #1644

Comments

@ilayna
Copy link

ilayna commented Jan 4, 2024

Describe the bug
Getting the root element of a WorkbookPart of an excel with invalid hyperlinks causes the following exception:

System.ObjectDisposedException: Cannot access a closed stream.
Object name: 'DeflateStream'.
   at System.IO.Compression.DeflateStream.<EnsureNotDisposed>g__ThrowStreamClosedException|42_0()
   at System.IO.Compression.DeflateStream.ReadCore(Span`1 buffer)
   at System.IO.Compression.DeflateStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Packaging.ZipWrappingStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.Xml.XmlTextReaderImpl.InitStreamInput(Uri baseUri, String baseUriStr, Stream stream, Byte[] bytes, Int32 byteCount, Encoding encoding)
   at System.Xml.XmlTextReaderImpl.FinishInitStream()
   at System.Xml.XmlTextReaderImpl..ctor(Stream stream, Byte[] bytes, Int32 byteCount, XmlReaderSettings settings, Uri baseUri, String baseUriStr, XmlParserContext context, Boolean closeInput)
   at System.Xml.XmlReaderSettings.CreateReader(Stream input, Uri baseUri, String baseUriString, XmlParserContext inputContext)
   at System.Xml.XmlReader.Create(Stream input, XmlReaderSettings settings)
   at DocumentFormat.OpenXml.XmlConvertingReaderFactory.Create(Stream partStream, IOpenXmlNamespaceResolver resolver, XmlReaderSettings settings, Boolean strictRelationshipFound)
   at DocumentFormat.OpenXml.OpenXmlPartRootElement.LoadFromPart(OpenXmlPart openXmlPart, Stream partStream)
   at DocumentFormat.OpenXml.Packaging.OpenXmlPart.LoadDomTree[T]()
   at DocumentFormat.OpenXml.Packaging.WorkbookPart.get_Workbook()
   at DocumentFormat.OpenXml.Packaging.WorkbookPart.get_PartRootElement()
   at DocumentFormat.OpenXml.Packaging.OpenXmlPart.get_RootElement()

Opening the .xlsx file with excel, selecting all columns and clicking remove hyperlinks solves the problem.

To Reproduce

class Program
{
    private static async Task Main(string[] args)
    {
        var excelParser = new ExcelParser(NullLogger<ExcelParser>.Instance);
        var excels = Directory.GetFiles(@"Excels", "*.xlsx");
        await using var stream = File.OpenRead(excels.FirstOrDefault()); // this is the excel (believe me ;))
        await excelParser.ToolsFromExcelTask(stream);
    }
}


  public ValueTask<ICollection<ToolModel>> ToolsFromExcelTask(Stream excelStream)
    {
        using var spreadSheetDocument = SpreadsheetDocument.Open(excelStream, false);
        var workbookPart =  spreadSheetDocument.WorkbookPart;


        /*
         *  Bug lies somewhere in the following line, it throws an ObjectDisposedException
         *  on some excels (not all)
         */
        var sheets = workbookPart.RootElement.GetFirstChild<Sheets>().Elements<Sheet>().ToList();
      
        var sharedStringTable = workbookPart.SharedStringTablePart?.SharedStringTable;
        ........
}

Steps to reproduce the behavior:
Move the excel with the malformed hyperlinks to "Excels".
Run the following code.
See error.
Note: in practice, i tried it on many files, with some it worked others not so much...
with 3-4 files it seems to always happen with the same ones, when I tested with many many files I couldn't really point it to specific files.

Observed behavior
Threw an error.

Expected behavior
Not to.

Desktop (please complete the following information):

  • OS: MacOs 14.2.1
  • Office version 16.80
  • .NET Target: .NET Core 8 (tried with 6 also)
  • DocumentFormat.OpenXml Version: 3.0.0

Additional context
When I ran the same program with OpenXml 2.2 it threw UriMalformed (or something like that) and crashed at
using var spreadSheetDocument = SpreadsheetDocument.Open(excelStream, false);

Opening the .xlsx file with excel, selecting all columns and clicking remove hyperlinks solves the problem.
the hyperlinks are in the format of som_e_thing_@LIK_th is
I can't attach the files I tested with, sorry.

Might be related to Add better support for malformed URIs #1322

@ilayna ilayna changed the title Bug with Open XML SDK v3.0 Bug with Open XML SDK v3.0 and excels with malformed hyperlinks Jan 4, 2024
@tlecomte
Copy link

tlecomte commented Jan 5, 2024

We encountered the same problem and found that it comes from this line:

Calling .Position is only possible with a Stream where CanSeek is true. This is not true for a FileStream, or a stream from a ZipArchive.

A workaround is to copy the stream to a MemoryStream first. But maybe this method should check CanSeek before calling feature.Stream.Position = 0; ?

@ilayna
Copy link
Author

ilayna commented Jan 5, 2024

A workaround is to copy the stream to a MemoryStream first. But maybe this method should check CanSeek before calling feature.Stream.Position = 0; ?

I tried that, read it and then copied it to a memorysteam with .CopyTo, didn't change anything.
anything else worked ?

@twsouthwick
Copy link
Member

I'd rather not have to copy to a memory stream first - let me take a look at the repro. Thanks for bringing it to our attention

@twsouthwick twsouthwick self-assigned this Jan 8, 2024
@twsouthwick twsouthwick added this to the v3.1 milestone Jan 8, 2024
@twsouthwick twsouthwick added the bug label Jan 8, 2024
twsouthwick added a commit that referenced this issue Jan 9, 2024
- We must be able to seek to create a temporary file requried for uri handling
- We should throw the same exception we did in v2.x
- If compat mode is on, this will throw at open time (i.e. it will eagerly load everything)

Fixes #1636
@twsouthwick
Copy link
Member

Ok here's what I'm putting in for a v3.0.1 release:

  • Check if stream is seekable to enable malformed uri handling
  • Throw same exception as v2.x if malformed URI is found and not handled
  • NOTE: we now lazily expand parts so a malformed URI will be thrown when it is encountered instead of on package open; so you'll want to ensure your exception handling isn't expecting it on the SpreadsheetDocument.Open(...) call.

If you want to enable the framework to handle readonly/non-seekable streams, the current set up won't work well for that. I've been working on a builder pattern to set up a pipeline of sorts for what should happen when you open a package (see #1541). Once we get v3.0.1 out, I'll enable it again on ci builds for testing via the ci and we can look at an opt-in mechanism for scenarios like this.

@twsouthwick
Copy link
Member

@ilayna I couldn't repro the exact error your seeing, so the fix ups I'm pushing in may not actually fix this issue. Can you either verify once it's merged in or provide a repro I can try out?

twsouthwick added a commit that referenced this issue Jan 9, 2024
- We must be able to seek to create a temporary file requried for uri
handling
- We should throw the same exception we did in v2.x
- If compat mode is on, this will throw at open time (i.e. it will
eagerly load everything)

Fixes #1636
@twsouthwick
Copy link
Member

@ilayna the fix is merged in and a build will be available within the hour on the CI feed (documented on the README). Can you take a look to see if this fixes your issue?

@ilayna
Copy link
Author

ilayna commented Jan 9, 2024

@ilayna the fix is merged in and a build will be available within the hour on the CI feed (documented on the README). Can you take a look to see if this fixes your issue?

OC, I'll check and get back to you.

@ilayna
Copy link
Author

ilayna commented Jan 9, 2024

@twsouthwick
This didn't fix the issue..
I still get the following when a stream is opened with isEditable: false,

ObjectDisposedException: Cannot access a disposed object. Object name: 'System.IO.Compression.DeflateStream'.
System.IO.Compression.DeflateStream.ReadCore(Span<byte> buffer)
System.Xml.XmlTextReaderImpl.InitStreamInput(Uri baseUri, string baseUriStr, Stream stream, byte[] bytes, int byteCount, Encoding encoding)
System.Xml.XmlTextReaderImpl.FinishInitStream()
System.Xml.XmlTextReaderImpl..ctor(Stream stream, byte[] bytes, int byteCount, XmlReaderSettings settings, Uri baseUri, string baseUriStr, XmlParserContext context, bool closeInput)
System.Xml.XmlReaderSettings.CreateReader(Stream input, Uri baseUri, string baseUriString, XmlParserContext inputContext)
DocumentFormat.OpenXml.OpenXmlPartRootElement.LoadFromPart(OpenXmlPart openXmlPart, Stream partStream)
DocumentFormat.OpenXml.Packaging.OpenXmlPart.LoadDomTree<T>()
DocumentFormat.OpenXml.Packaging.WorkbookPart.get_Workbook()
DocumentFormat.OpenXml.Packaging.WorkbookPart.get_PartRootElement()
DocumentFormat.OpenXml.Packaging.OpenXmlPart.get_RootElement()

On the other hand, when I use

using var spreadSheetDocument = SpreadsheetDocument.Open(excelStream, true);, It works..
opens #1636

@twsouthwick twsouthwick reopened this Jan 9, 2024
@twsouthwick
Copy link
Member

@ilayna I'll need to see a full repro - I have added tests in #1644 that tries to mimic what you're describing (non-seekable, non editable) but I have not seen the issue you're showing.

@ilayna
Copy link
Author

ilayna commented Jan 9, 2024

Okay, I'll try to come up with something,
Anyway the stream is seekable and editable, it is being passed with editable false to spreadsheet open.

@twsouthwick
Copy link
Member

Thanks - I believe there is something else at play. I think I got fixated by #1636 (comment) when it was a different issue. Let me know when you have something - we'll be cutting a 3.0.1 release with the fixes so far, but will be able to do a 3.0.2 if this doesn't require public api changes.

@vogla
Copy link

vogla commented Mar 11, 2024

I've got the same problem after upgrading to version 3.0.1: When there is a malformed URI in the parsed excel file, I get "System.ObjectDisposedException: Cannot access a closed stream."
That happens because the new "solution" to malformed URIs in version 3.0.1 is to blindly try writing into the provided stream, without checking whether the stream is actually writeable. Please change the implementation to ensure that at least a recogniseable exception is thrown in this case - with a message that says something about a malformed URI - in order to make exception handling easier and more reliable. Thanks!

Besides, let me share how I deal with this problem in my app implementation: First I open the provided file-stream in read-only mode and try to parse it, because that is the most memory-efficient way to handle it. In case that an excepation is thrown, due to a malformed URI, The file-stream is copied into a byte-buffer. Then, a writable stream is opened from the byte-buffer and the document is parsed again from the buffer. Due to the stream being writable, the malformed URIs can be corrected in the second attempt. Since malformed URIs don't occur in many documents, this approach reduces memory consumption by applying a byte-buffer only when necessary. I wish there was a better solution, but this at least works.

@twsouthwick
Copy link
Member

This does appear similar to #1802 which I found a root cause for.

There's been some work in this area - can someone check with a CI build of 3.2 or wait for the release of that (later this week)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants