Skip to content

Latest commit

 

History

History
87 lines (64 loc) · 4.21 KB

README.MD

File metadata and controls

87 lines (64 loc) · 4.21 KB

PDF to Audiobook converter

A Python script that converts PDF files into mp3 files by reading them out loud using Google Deepmind's Wavenet which is my personal favorite when it comes to speech synthesis from written text.

In my personal experience, audio generated by this script will be more comfortable to listen than the worse readers in classical human read audiobooks.

I have implemented this script because there just were no ready to use solutions available that I could find on Google to convert full books to audio.

Disclaimer

You might not have the rights to process or create duplicates of books or any textual work. I will not take any responsibility for what this script is used for and advise you to carefully read the copyright notice in the book or pdf file you want to convert.

Output

Generated audio will be encoded as MP3. Encoding and quality settings can easily be changed in the code.

Requirements

You will need a service account private key for Google's Cloud Text to Speech api. Python 3.* and the requirements.txt must be installed. ffmpeg must be present in your path variable or the path of your script.

Constraints

Input files

The system works well with pdf files that are ebooks, research papers or anything that contains text, but is not recommendable for scanned pdf files.

Text to speech api limits and cost

Google's Cloud Text to Speech api will let you read 1.000.000 characters per month with Wavenet voices for free. Exceeding this limit will result in relatively high cost. Offline versions of this system are available but not implemented in this prototype.

To avoid any cost for using the api, a character counter is implemented in the script. Consider commenting out the line of code that converts the text to audio when playing around with the code and to see how many characters your pdf file contains.

In addition, the tts api will limit requests to a maximum of 5000 characters. Therefore any text extracted from a given pdf file will first be joined and then cut into chunks < 5000 characters. Cutting of the chunks will be done at the last space character found in a subset of the text < 5000 characters. Also SSML break commands will not be cut.

This process results in a lot of MP3 chunks that are then joined with ffmpeg, using the metadata of the first file.

Getting started

Download ffmpeg and set your path variable accordingly.

pip install -r requirements.txt

Set your paths and input files in the script variables, keep in mind that these are relative to your project directory if no absolute paths are used. The script can be easily adapted to process all files in the input folder with a for loop and a call to os.listdir(indir)

infile = "mypdffile.pdf"
inpath = "in/"
outpath = "out/"
joinedpath = "joined/"

On Windows run
python main.py

On Linux run
python3 main.py

SSML preprocessing

Several breaks will be added for paragraphs, colons, dots, etc to make the audio easier to listen to. Also several Unicode latin small ligature symbols will be replaced accordingly.

References

https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

https://cloud.google.com/text-to-speech

http://ffmpeg.org/

ToDo

  • Find chapters automatically and cut chunks in respect to the chapters as far as possible, name audio chunks accordingly.
  • Provide seperate audio files for the chapters
  • Find and cut table of contents, bibliography, etc automatically or let the user provide offset values
  • Find and cut page numbers automatically or let the user provide offset values

Contributing

If you improve the script, I would highly appreciate if you create a pull request with your improvements. Learn how that works here https://kbroman.org/github_tutorial/pages/fork.html.