Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Youtube Transcript Summarizer using NLP #940

Open
sindhuja184 opened this issue Oct 22, 2024 · 8 comments
Open

Youtube Transcript Summarizer using NLP #940

sindhuja184 opened this issue Oct 22, 2024 · 8 comments
Labels
Status: Up for Grabs Up for grabs issue. WoC 4.0 Winter of Code 4.0 by GDG IIITK

Comments

@sindhuja184
Copy link

Deep Learning Simplified Repository (Proposing new issue)

🔴 Project Title : Youtube Transcript Summarizer

🔴 Aim :The aim of the YouTube Transcript Summarizer is to provide concise, meaningful summaries by reducing transcript length by 80%, allowing users to quickly grasp the key points of a video.

🔴 Dataset : The dataset used would typically be the transcripts of YouTube videos

🔴 Approach : The YouTube Transcript Summarizer employs Natural Language Processing (NLP) techniques to provide concise summaries of video transcripts. The process begins with extracting the transcript, followed by preprocessing to clean and tokenize the text. The chosen algorithm then analyzes the content to generate a summary, significantly reducing the original length while retaining essential points. This approach enables users to quickly grasp the core message of a video without sifting through lengthy transcripts.(Transcripts are take with the help of youtube transcript summariser)


📍 Follow the Guidelines to Contribute in the Project :

  • You need to create a separate folder named as the Project Title.
  • Inside that folder, there will be four main components.
    • Images - To store the required images.
    • Dataset - To store the dataset or, information/source about the dataset.
    • Model - To store the machine learning model you've created using the dataset.
    • requirements.txt - This file will contain the required packages/libraries to run the project in other machines.
  • Inside the Model folder, the README.md file must be filled up properly, with proper visualizations and conclusions.

🔴🟡 Points to Note :

  • The issues will be assigned on a first come first serve basis, 1 Issue == 1 PR.
  • "Issue Title" and "PR Title should be the same. Include issue number along with it.
  • Follow Contributing Guidelines & Code of Conduct before start Contributing.

To be Mentioned while taking the issue :

  • Full name : Sindhuja Didugu
  • GitHub Profile Link : https://github.com/sindhuja184
  • Email ID :[email protected]
  • Participant ID (if applicable):
  • Approach for this Project :The YouTube Transcript Summarizer employs Natural Language Processing (NLP) techniques to provide concise summaries of video transcripts. The process begins with extracting the transcript, followed by preprocessing to clean and tokenize the text. The chosen algorithm then analyzes the content to generate a summary, significantly reducing the original length while retaining essential points. This approach enables users to quickly grasp the core message of a video without sifting through lengthy transcripts.(Transcripts are take with the help of youtube transcript summariser)
  • What is your participant role? (Mention the Open Source program) GSSOC ext- Participant

Happy Contributing 🚀

All the best. Enjoy your open source journey ahead. 😎

Copy link

Thank you for creating this issue! We'll look into it as soon as possible. Your contributions are highly appreciated! 😊

@Abhiiesante
Copy link

Can you please assign this issue to me under 𝗚𝗦𝗦𝗼𝗖 '𝟮𝟰 𝗘𝘅𝘁𝗲𝗻𝗱𝗲𝗱, Hacktoberfest-accepted

@abhisheks008
Copy link
Owner

Can you please assign this issue to me under 𝗚𝗦𝗦𝗼𝗖 '𝟮𝟰 𝗘𝘅𝘁𝗲𝗻𝗱𝗲𝗱, Hacktoberfest-accepted

As this issue is raised by @sindhuja184, this issue can't be assigned to you.

@abhisheks008
Copy link
Owner

@sindhuja184 can you please elaborate the approach you are planning for this problem statement?

@sindhuja184
Copy link
Author

The aim of the project is to summarize the transcripts of the youtube video.

  1. Initially I would extract the transcript of the youtube video with the help of Youtube Transcript API.(Here I would need the video ID of the youtube video).
  2. Then split the text into chunks with each of size some tokens.(Summarization models have a token limit, so spliting is mandatory here.)
  3. Then by using Hugging face transformers I would summarize the text.(I would like to select facebook, bart-large-cnn model).
  4. Then, combine the summaries.

This is the approach I am planning to follow @abhisheks008

@abhisheks008
Copy link
Owner

Apart from huggingface, any other algorithms you are comfortable with? As the project repository requires at least 3 model implementations for each problem statement.

@abhisheks008 abhisheks008 added Status: Up for Grabs Up for grabs issue. ieee-igdtuw IEEE IGDTUW Open Source Week 2024 labels Nov 10, 2024
@abhisheks008 abhisheks008 removed the ieee-igdtuw IEEE IGDTUW Open Source Week 2024 label Nov 19, 2024
@abhisheks008 abhisheks008 added the WoC 4.0 Winter of Code 4.0 by GDG IIITK label Jan 1, 2025
@Jaegerbawmb
Copy link

Jaegerbawmb commented Jan 6, 2025

Hello , I'd like to contribute to this.
Full name : Sanyukta Gokhale
GitHub Profile Link : https://github.com/Jaegerbawmb
Email ID : [email protected]
Approach for this Project : I would begin by extracting transcripts from this dataset: https://github.com/chris-lovejoy/youtube-titles-and-transcripts?tab=readme-ov-file. Then preprocess the text by removing filler words and non-informative content, followed by tokenizing and segmenting the text. I will use TextRank algorithm for extractive summarization, and then move to more advanced models like T5 and BART for abstractive summarization. Evaluation of the quality of the summaries will be done using metrics like ROUGE or BLEU.
Participating in: WoC 4.0

@abhisheks008
Copy link
Owner

Hi @Jaegerbawmb it's a pretty good approach tbh. But regarding the proposal I need to check with the core team, how they are approaching this thing. Will get back to you once I get any information regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Up for Grabs Up for grabs issue. WoC 4.0 Winter of Code 4.0 by GDG IIITK
Projects
None yet
Development

No branches or pull requests

4 participants