[Audio] Soundfile/libsndfile requirements too stringent for decoding mp3 files #5659

sanchit-gandhi · 2023-03-22T10:07:33Z

Describe the bug

I'm encountering several issues trying to load mp3 audio files using datasets on a TPU v4.

The PR #5573 updated the audio loading logic to rely solely on the soundfile/libsndfile libraries for loading audio samples, regardless of their file type.

The installation guide suggests that libsndfile is bundled in when soundfile is pip installed:

datasets/docs/source/installation.md

Lines 70 to 71 in e1af108

    
           To decode mp3 files, you need to have at least version 1.1.0 of the `libsndfile` system library. Usually, it's bundled with the python [`soundfile`](https://github.com/bastibe/python-soundfile) package, which is installed as an extra audio dependency for 🤗 Datasets. 
        
           For Linux, the required version of `libsndfile` is bundled with `soundfile` starting from version 0.12.0. You can run the following command to determine which version of `libsndfile` is being used by `soundfile`:

However, just pip installing soundfile==0.12.1 throws an error that libsndfile is missing:

pip install soundfile==0.12.1

Then:

>>> soundfile
>>> soundfile.__libsndfile_version__

Traceback (most recent call last):

  File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/soundfile.py", line 161, in <module>
    import _soundfile_data  # ImportError if this doesn't exist
ModuleNotFoundError: No module named '_soundfile_data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/soundfile.py", line 170, in <module>
    raise OSError('sndfile library not found using ctypes.util.find_library')
OSError: sndfile library not found using ctypes.util.find_library

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/sanchitgandhi/hf/lib/python3.8/site-packages/soundfile.py", line 192, in <module>
    _snd = _ffi.dlopen(_explicit_libname)
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory

Thus, I've followed the official instructions for installing the soundfile package from https://github.com/bastibe/python-soundfile#installation, which states that libsndfile needs to be installed separately as:

pip install --upgrade soundfile
sudo apt install libsndfile1

We can now import soundfile:

>>> import soundfile
>>> soundfile.__version__
'0.12.1'
>>> soundfile.__libsndfile_version__
'1.0.28'

We see that we have soundfile==0.12.1, which matches the datasets[audio] package constraints:

datasets/setup.py

Lines 144 to 147 in e1af108

    
           AUDIO_REQUIRE = [ 
        
               "soundfile>=0.12.1", 
        
               "librosa", 
        
           ]

But we have libsndfile==1.0.28, which is too low for decoding mp3 files:

datasets/src/datasets/config.py

Lines 136 to 138 in e1af108

    
           IS_MP3_SUPPORTED = importlib.util.find_spec("soundfile") is not None and version.parse( 
        
               importlib.import_module("soundfile").__libsndfile_version__ 
        
           ) >= version.parse("1.1.0")

Updating/upgrading the libsndfile doesn't change this:

sudo apt-get update
sudo apt-get upgrade

Is there any other suggestion for how to get a compatible libsndfile version? Currently, the version bundled with Ubuntu apt-get is too low for decoding mp3 files.

Maybe we could add this under setup.py such that we install the correct libsndfile version when we do pip install datasets[audio]? IMO this would help circumvent such version issues.

Steps to reproduce the bug

Environment described above. Loading mp3 files:

from datasets import load_dataset

common_voice_es = load_dataset("common_voice", "es", split="validation", streaming=True)
print(next(iter(common_voice_es)))

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 2
      1 common_voice_es = load_dataset("common_voice", "es", split="validation", streaming=True)
----> 2 print(next(iter(common_voice_es)))

File ~/datasets/src/datasets/iterable_dataset.py:941, in IterableDataset.__iter__(self)
    937 for key, example in ex_iterable:
    938     if self.features:
    939         # `IterableDataset` automatically fills missing columns with None.
    940         # This is done with `_apply_feature_types_on_example`.
--> 941         yield _apply_feature_types_on_example(
    942             example, self.features, token_per_repo_id=self._token_per_repo_id
    943         )
    944     else:
    945         yield example

File ~/datasets/src/datasets/iterable_dataset.py:700, in _apply_feature_types_on_example(example, features, token_per_repo_id)
    698 encoded_example = features.encode_example(example)
    699 # Decode example for Audio feature, e.g.
--> 700 decoded_example = features.decode_example(encoded_example, token_per_repo_id=token_per_repo_id)
    701 return decoded_example

File ~/datasets/src/datasets/features/features.py:1864, in Features.decode_example(self, example, token_per_repo_id)
   1850 def decode_example(self, example: dict, token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None):
   1851     """Decode example with custom feature decoding.
   1852 
   1853     Args:
   (...)
   1861         `dict[str, Any]`
   1862     """
-> 1864     return {
   1865         column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
   1866         if self._column_requires_decoding[column_name]
   1867         else value
   1868         for column_name, (feature, value) in zip_dict(
   1869             {key: value for key, value in self.items() if key in example}, example
   1870         )
   1871     }

File ~/datasets/src/datasets/features/features.py:1865, in <dictcomp>(.0)
   1850 def decode_example(self, example: dict, token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None):
   1851     """Decode example with custom feature decoding.
   1852 
   1853     Args:
   (...)
   1861         `dict[str, Any]`
   1862     """
   1864     return {
-> 1865         column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
   1866         if self._column_requires_decoding[column_name]
   1867         else value
   1868         for column_name, (feature, value) in zip_dict(
   1869             {key: value for key, value in self.items() if key in example}, example
   1870         )
   1871     }

File ~/datasets/src/datasets/features/features.py:1308, in decode_nested_example(schema, obj, token_per_repo_id)
   1305 elif isinstance(schema, (Audio, Image)):
   1306     # we pass the token to read and decode files from private repositories in streaming mode
   1307     if obj is not None and schema.decode:
-> 1308         return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
   1309 return obj

File ~/datasets/src/datasets/features/audio.py:167, in Audio.decode_example(self, value, token_per_repo_id)
    162     raise RuntimeError(
    163         "Decoding 'opus' files requires system library 'libsndfile'>=1.0.31, "
    164         'You can try to update `soundfile` python library: `pip install "soundfile>=0.12.1"`. '
    165     )
    166 elif not config.IS_MP3_SUPPORTED and audio_format == "mp3":
--> 167     raise RuntimeError(
    168         "Decoding 'mp3' files requires system library 'libsndfile'>=1.1.0, "
    169         'You can try to update `soundfile` python library: `pip install "soundfile>=0.12.1"`. '
    170     )
    172 if file is None:
    173     token_per_repo_id = token_per_repo_id or {}

RuntimeError: Decoding 'mp3' files requires system library 'libsndfile'>=1.1.0, You can try to update `soundfile` python library: `pip install "soundfile>=0.12.1"`.

Expected behavior

Load mp3 files!

Environment info

datasets version: 2.10.2.dev0
Platform: Linux-5.13.0-1023-gcp-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.13.1
PyArrow version: 11.0.0
Pandas version: 1.5.3
Soundfile version: 0.12.1
Libsndfile version: 1.0.28

The text was updated successfully, but these errors were encountered:

sanchit-gandhi · 2023-03-22T10:08:56Z

cc @polinaeterna @lhoestq

polinaeterna · 2023-03-22T13:37:54Z

@sanchit-gandhi can you please also post the logs of pip install soundfile==0.12.1? To check what wheel is being installed or if it's being built from source (I think it's the latter case).
Required libsndfile binary should be bundeled with soundfile wheel but I assume it might not be the case for some non standard Linux distributions.
The only solution for using soundfile here is to build libsndfile from source:

git clone https://github.com/libsndfile/libsndfile.git
cd libsndfile/
autoreconf -vif
./configure --enable-werror 
make
make install

for this, some building libraries should be installed, for Debian/Ubuntu it's like:

apt install autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python

but for other Linux distributions it might be different.

When the binary is compiled, it should be put into location where soundfile would search for it (the directory is named _soundfile_data), it depends on wherelibsdfile (from the previous step) and soundfile were installed, might be something like this:

cp /usr/local/lib/libsndfile.so /usr/local/lib/python3.7/dist-packages/_soundfile_data/
cp /usr/local/lib/libsndfile.la /usr/local/lib/python3.7/dist-packages/_soundfile_data/

Another solution is to not use soundfile and apply custom processing function with torchaudio while setting decode=False in Audio feature and passing custom function to .map.

lhoestq · 2023-03-22T13:40:59Z

Not sure if it may help, but you could also try updating pip before installing soundfile

peregilk · 2023-03-25T16:02:46Z

@lhoestq @sanchit-gandhi. I encountered the same error (also on the TPU v4) when trying to run datasets from source.

Downgrading soundfile with pip install soundfile==0.12.0 seems to fix the issue for me.

lhoestq · 2023-03-27T12:40:09Z

Maybe let's open an issue at https://github.com/bastibe/python-soundfile/issues in case they might know why you get OSError: cannot load library 'libsndfile.so' ?

Rishabh-Choudhry · 2023-04-02T23:46:48Z

@sanchit-gandhi can you please also post the logs of pip install soundfile==0.12.1? To check what wheel is being installed or if it's being built from source (I think it's the latter case). Required libsndfile binary should be bundeled with soundfile wheel but I assume it might not be the case for some non standard Linux distributions. The only solution for using soundfile here is to build libsndfile from source:
git clone https://github.com/libsndfile/libsndfile.git
cd libsndfile/
autoreconf -vif
./configure --enable-werror 
make
make install

This fixed the issue for me. After installing libsndfile as described above, I had to uninstall soundfile and re-install it with this command. pip install "soundfile>=0.12.1"

sanchit-gandhi · 2023-04-07T08:49:33Z

Thank you so much for the comprehensive instructions @polinaeterna! Also confirming that they worked for me 🤗 In my case, I had to run several of these commands under "sudo" for privileges, but otherwise this workaround gave a successful libsndfile install:

Grab source code:

git clone https://github.com/libsndfile/libsndfile.git

Set up a build environment:

sudo apt install autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python

Build and test libsndfile:

autoreconf -vif
./configure --enable-werror
sudo make
sudo make check

Create _soundfile_data submodule (if it does not exist already):

sudo mkdir /usr/local/lib/python3.8/dist-packages/_soundfile_data/

Copy libsndfile files into submodule:

sudo cp /usr/local/lib/libsndfile.* /usr/local/lib/python3.8/dist-packages/_soundfile_data/

sanchit-gandhi · 2023-04-07T08:51:22Z

On a different machine, I also tried separately by first upgrading pip, then installing soundfile. This worked too! Thanks @lhoestq 🙌

YuchengWang · 2023-04-28T03:25:39Z

@sanchit-gandhi can you please also post the logs of pip install soundfile==0.12.1? To check what wheel is being installed or if it's being built from source (I think it's the latter case). Required libsndfile binary should be bundeled with soundfile wheel but I assume it might not be the case for some non standard Linux distributions. The only solution for using soundfile here is to build libsndfile from source:
git clone https://github.com/libsndfile/libsndfile.git
cd libsndfile/
autoreconf -vif
./configure --enable-werror 
make
make install
for this, some building libraries should be installed, for Debian/Ubuntu it's like:
apt install autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python
but for other Linux distributions it might be different.

When the binary is compiled, it should be put into location where soundfile would search for it (the directory is named _soundfile_data), it depends on wherelibsdfile (from the previous step) and soundfile were installed, might be something like this:
cp /usr/local/lib/libsndfile.so /usr/local/lib/python3.7/dist-packages/_soundfile_data/
cp /usr/local/lib/libsndfile.la /usr/local/lib/python3.7/dist-packages/_soundfile_data/
Another solution is to not use soundfile and apply custom processing function with torchaudio while setting decode=False in Audio feature and passing custom function to .map.

Thanks, the solution solved my problem.

Purge uninstall libsndfile, uninstall python-soundfile.
Build libsndfile from source code and install.
Build python-soundfile from source code and install
Well done.

brthor · 2023-08-25T07:06:52Z

Thank you so much for the comprehensive instructions @polinaeterna! Also confirming that they worked for me 🤗 In my case, I had to run several of these commands under "sudo" for privileges, but otherwise this workaround gave a successful libsndfile install:

Grab source code:
git clone https://github.com/libsndfile/libsndfile.git
Set up a build environment:
sudo apt install autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python
Build and test libsndfile:
autoreconf -vif
./configure --enable-werror
sudo make
sudo make check
Create _soundfile_data submodule (if it does not exist already):
sudo mkdir /usr/local/lib/python3.8/dist-packages/_soundfile_data/
Copy libsndfile files into submodule:
sudo cp /usr/local/lib/libsndfile.* /usr/local/lib/python3.8/dist-packages/_soundfile_data/

I had to run 'make install' or the /usr/local/lib/libsndfile.* files didn't exist.

It's working though!

snoop2head · 2023-11-10T15:42:32Z

I had the same issue but it is working now! Thanks for all of your comments!

naarkhoo · 2024-01-17T13:59:04Z

I had the same issue on SageMaker but not on Colab;
The soundfile versioning was fine.

my approach to solve it was to match {"numpy", "numba"} exact versions

! pip install "numpy==1.23.5"
! pip install "numpy==0.58.1"

the numbers are from Colab where successfully I could do the job.

husichao666 · 2024-07-12T01:35:00Z

Thank you so much for the comprehensive instructions @polinaeterna! Also confirming that they worked for me 🤗 In my case, I had to run several of these commands under "sudo" for privileges, but otherwise this workaround gave a successful libsndfile install:

Grab source code:
git clone https://github.com/libsndfile/libsndfile.git
Set up a build environment:
sudo apt install autoconf autogen automake build-essential libasound2-dev \
  libflac-dev libogg-dev libtool libvorbis-dev libopus-dev libmp3lame-dev \
  libmpg123-dev pkg-config python
Build and test libsndfile:
autoreconf -vif
./configure --enable-werror
sudo make
sudo make check
Create _soundfile_data submodule (if it does not exist already):
sudo mkdir /usr/local/lib/python3.8/dist-packages/_soundfile_data/
Copy libsndfile files into submodule:
sudo cp /usr/local/lib/libsndfile.* /usr/local/lib/python3.8/dist-packages/_soundfile_data/

It works and don't forget to "apt uninstall libsndfile1" after installing it from source code.

sanchit-gandhi closed this as completed Apr 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Audio] Soundfile/libsndfile requirements too stringent for decoding mp3 files #5659

[Audio] Soundfile/libsndfile requirements too stringent for decoding mp3 files #5659

sanchit-gandhi commented Mar 22, 2023 •

edited

Loading

sanchit-gandhi commented Mar 22, 2023

polinaeterna commented Mar 22, 2023 •

edited

Loading

lhoestq commented Mar 22, 2023

peregilk commented Mar 25, 2023

lhoestq commented Mar 27, 2023 •

edited

Loading

Rishabh-Choudhry commented Apr 2, 2023

sanchit-gandhi commented Apr 7, 2023

sanchit-gandhi commented Apr 7, 2023

YuchengWang commented Apr 28, 2023

brthor commented Aug 25, 2023 •

edited

Loading

snoop2head commented Nov 10, 2023

naarkhoo commented Jan 17, 2024 •

edited

Loading

husichao666 commented Jul 12, 2024

[Audio] Soundfile/libsndfile requirements too stringent for decoding mp3 files #5659

[Audio] Soundfile/libsndfile requirements too stringent for decoding mp3 files #5659

Comments

sanchit-gandhi commented Mar 22, 2023 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

sanchit-gandhi commented Mar 22, 2023

polinaeterna commented Mar 22, 2023 • edited Loading

lhoestq commented Mar 22, 2023

peregilk commented Mar 25, 2023

lhoestq commented Mar 27, 2023 • edited Loading

Rishabh-Choudhry commented Apr 2, 2023

sanchit-gandhi commented Apr 7, 2023

sanchit-gandhi commented Apr 7, 2023

YuchengWang commented Apr 28, 2023

brthor commented Aug 25, 2023 • edited Loading

snoop2head commented Nov 10, 2023

naarkhoo commented Jan 17, 2024 • edited Loading

husichao666 commented Jul 12, 2024

sanchit-gandhi commented Mar 22, 2023 •

edited

Loading

polinaeterna commented Mar 22, 2023 •

edited

Loading

lhoestq commented Mar 27, 2023 •

edited

Loading

brthor commented Aug 25, 2023 •

edited

Loading

naarkhoo commented Jan 17, 2024 •

edited

Loading