Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]Keep reading the the pdf file again & again when i use PDFImageReader even when the pdf file already exist in database #2014

Open
mahendra867 opened this issue Feb 5, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@mahendra867
Copy link

mahendra867 commented Feb 5, 2025

Description

Briefly describe the issue you’re experiencing or the bug you’ve found.

code

import os
from dotenv import load_dotenv
from agno.agent import Agent
from agno.embedder.azure_openai import AzureOpenAIEmbedder
from agno.knowledge.pdf import PDFKnowledgeBase, PDFImageReader, PDFReader
from agno.vectordb.pgvector import PgVector, SearchType
from agno.models.openai import OpenAIChat
from agno.storage.agent.postgres import PostgresAgentStorage
from sqlalchemy import create_engine, inspect, text
from agno.vectordb.pgvector.index import Ivfflat, HNSW
from agno.embedder.openai import OpenAIEmbedder
#from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from agno.document.chunking.recursive import RecursiveChunking

Load environment variables

load_dotenv()
print("Environment variables loaded.")

Fetch API keys and endpoint from environment variables

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"

Function to check if the table exists

def check_table_exists(engine, schema, table_name):
print(f"Checking if table '{table_name}' exists in schema '{schema}'...")
inspector = inspect(engine)
exists = inspector.has_table(table_name, schema=schema)
print(f"Table '{table_name}' exists: {exists}")
return exists

Function to check if a PDF file is already in the database

def is_pdf_in_db(engine, schema, table_name, pdf_name):
pdf_name_no_ext = os.path.splitext(pdf_name)[0] # Remove the .pdf extension
print(f"Checking if PDF '{pdf_name_no_ext}' exists in table '{table_name}'...")
with engine.connect() as connection:
result = connection.execute(
text(f"SELECT 1 FROM {schema}.{table_name} WHERE name = :name"),
{"name": pdf_name_no_ext}
)
exists = result.fetchone() is not None
print(f"PDF '{pdf_name_no_ext}' exists in table '{table_name}': {exists}")
return exists

Set up the PDF knowledge base with vector database

print("Setting up PDF knowledge base with vector database...")
pdf_knowledge_base = PDFKnowledgeBase(
path="D:\Projects\agentic_new_rag\pdfs",
vector_db=PgVector(
table_name="updated_rag009",
schema='ai',
db_url=db_url,
search_type=SearchType.hybrid,
vector_index=HNSW(),
embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
id="text-embedding-ada-002",
dimensions=1536,
encoding_format="float"
)
),
reader=PDFImageReader(chunk=False), # Use a default reader,
chunking_strategy=RecursiveChunking(chunk_size=4000,overlap = 800),
documents=3,
)

#chunking_strategy=RecursiveChunking(),

Define the PgAgentStorage with connection to database

Create a SQLAlchemy engine

engine = create_engine(db_url)

Before loading, check if the table exists and create if not

if not check_table_exists(engine, "ai", "updated_rag009"):
print("Table does not exist. Creating the table...")
pdf_knowledge_base.load(recreate=True, upsert=True) # Create the table
else:
print("Table exists. Skipping table creation...")
pdf_knowledge_base.load(recreate=False, skip_existing=True) # Skip existing table

Check if PDFs are already in the database and process accordingly

pdf_folder = "D:\Projects\agentic_new_rag\pdfs"
for pdf_file in os.listdir(pdf_folder):
if pdf_file.endswith(".pdf"):
if not is_pdf_in_db(engine, "ai", "updated_rag009", pdf_file):
print(f"Processing {pdf_file}...")
# Process the PDF file with PDFImageReader
pdf_reader = PDFImageReader(chunk=True)
pdf_reader.read(os.path.join(pdf_folder, pdf_file))
else:
print(f"Skipping {pdf_file}, already in database.")

Initialize the RAG agent

print("Initializing RAG Agent...")
rag_agent = Agent(
name="Agentic RAG Application",
agent_id="rag-agent",
model=OpenAIChat(id="gpt-4o-mini"),
knowledge=pdf_knowledge_base,
add_context=True,
search_knowledge=True,
read_chat_history=True,
debug_mode=True,
#storage=storage,
description=(
"You are an intelligent retrieval assistant specialized in utilizing knowledge stored in "
"a curated set of documents related"
),
instructions=[
"you have access of the documents and their corresponding file names are: "

],
markdown=True

)

print("RAG Agent initialized.")

Print the agent's response to a query

rag_agent.print_response("give me full Table 1. Volume 1 of DoDM 5200.01 Cancellation Actions of 520045m", stream=True)

Steps to Reproduce

List the steps needed to encounter this bug or issue.

Agent Configuration (if applicable)

Provide relevant agent configuration.

Agentic Rag configuration iam using

Expected Behavior

What did you expect to happen?

based on the above code it should skip reading this pdf 520045m.pdf by PDFImageReader because this 520045m.pdf already present in the pg_vector even though this pdf present 520045m in pgvector its keep reading everytime when i run the code

but when i use PdfReader it starts skip reading the pdf when this 520045m pdf already got present in the pg vector

the issue is with PdfImageReader its not skipping reading the pdf files

please work on this issue

Actual Behavior

What actually happened instead?

it should skip reading the pdf file actually when the pdf name already exist in table but its not skip reading the pdf by PdfImageReader

Screenshots or Logs (if applicable)

Include any relevant screenshots or error logs that demonstrate the issue.

LOGS of PdfReader

these are the logs for PdfReader since 520045m is already present in the table it skipped reading

PS D:\Projects\agentic_new_rag> & C:/Users/wel/anaconda3/envs/venv/python.exe "d:/Projects/agentic_new_rag/final_Agentic_rag_UI.py"
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag009' exists in schema 'ai'...
Table 'updated_rag009' exists: True
Table exists. Skipping table creation...
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Added 0 documents to knowledge base
INFO Reading: CMMC101
INFO Added 0 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag009'...
PDF '520045m' exists in table 'updated_rag009': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag009'...
PDF 'CMMC101' exists in table 'updated_rag009': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.

LOGS of PdfImageReader 1st run

Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag0990' exists in schema 'ai'...
Table 'updated_rag0990' exists: False
Table does not exist. Creating the table...
INFO Dropping collection
INFO Table 'ai.updated_rag0990' does not exist.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Upserted batch of 68 documents.
INFO Added 68 documents to knowledge base
INFO Reading: CMMC101
INFO Upserted batch of 21 documents.
INFO Added 21 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag0990'...
PDF '520045m' exists in table 'updated_rag0990': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag0990'...
PDF 'CMMC101' exists in table 'updated_rag0990': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.

LOGS of PdfImageReader 2nd run of the same code with same pdf file with PdfImageReader, even though it did not add the any documents to pg vector since pdf file already present in pg vectot but it initially it took lot of time for reading the pdf file by PdfImageReader but for PdfReader as it neither read the pdf file nor added documents ,

Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag0990' exists in schema 'ai'...
Table 'updated_rag0990' exists: True
Table exists. Skipping table creation...
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Added 0 documents to knowledge base
INFO Reading: CMMC101
INFO Added 0 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag0990'...
PDF '520045m' exists in table 'updated_rag0990': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag0990'...
PDF 'CMMC101' exists in table 'updated_rag0990': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.

Environment

  • OS: (e.g. macOS, Windows 11)
  • Browser (if relevant): (e.g. Chrome 108, Firefox 107)
  • Agno Version: (e.g. v1.0.0)
  • External Dependency Versions: (e.g., yfinance 0.2.52)
  • Additional Environment Details: (e.g., Python 3.10)

Possible Solutions (optional)

Suggest any ideas you might have to fix or address the issue.

Additional Context

Add any other context or details about the problem here.
solve this issue as quick as possible

@mahendra867 mahendra867 added the bug Something isn't working label Feb 5, 2025
@mahendra867 mahendra867 changed the title [Bug]Keep reading the the pdf again & again when i use PDFImageReader even when the pdf file already exist in database [Bug]Keep reading the the pdf file again & again when i use PDFImageReader even when the pdf file already exist in database Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant