ePrint Database Project

Goal : getting every ePrint papers in a database the fastest and the cleaner as possible. Later goal : have a nice web UI where you can download PDF from several sources (eprint, NIST...), make refine tuning with ollama or get alerted from new papers.

Technologies :

scripting : golang
SGBD : postgreSQL
concurrency

Result of retrieving datas from every 2024 papers :

Execution time: 4m43.4106983s

Result of retrieving datas from every 2024, 2023 and 2022 papers in concurrency :

Total execution time = ~10m40s

For this test, I launched three goroutines one for each year. I think it is a poor concurrency design and I can find better.

Notes :

too much goroutines making simultaneous request at one server lead to this error: read: connection reset by peer
Find a better design for my concurrency

Design ideas

Stages :

Retrieve data about a paper : title, pdf url, category
Download the PDF
store the raw binary in the database

Draw of the process :


Start: GetPapersYear -> For each papers: RetrieveDataPaper -> DownloadPaper -> InsertBinary (into the dabase)

First idea :

A fixed number N of goroutines for stages 1, 2 and 3
(ie : I create 100 goroutines for each stages and when they have finished one task, they continue with the next one)

Second idea :

N goroutines for each stages that cannot exceed a limit P of goroutines.
(ie : I create goroutines for my task until I reached the limit P. Then I wait that some of them has done to create a new ones)

More ideas :
Basically the same but using a pipeline with channels. A fourth idea idea could be to use custom rating limit with work-stealing queue. I shall explore and test those ideas.

Statistics

Introduction

Statistics are the data I need to correctly download the precise amount of papers available on ePrint. For now the code is a bit goofy because I'm doing a request each time the tool is executed for retrieving data such as categories names, number of PDF from past years. Futur improving will be to have a fine strategy for optimizing the number of request and winning time by calling intelligently the website.

Rate limit issue

In order to anticipate rate limit issue (from hardware or the ePrint server), I decided to make a small analysis of how many request I will need, how many insertion in my database etc

There is the volume of paper for each years :

"2024":1799, "2023":1971, "2022":1781, "2021":1705, "2020":1620,
"2019":1498, "2018":1249, "2017":1262, "2016":1195, "2015":1255, "2014":1029, "2013":881, "2012":733, "2011":714, "2010":660, 
"2009":638, "2008":545, "2007":482, "2006":485, "2005":469, "2004":375, "2003":265, "2002":195, "2001":113, "2000":69,
"1999":24, "1998":26, "1997":15, "1996": 16,

Years between 2014 and 2024 have more than one thousand papers which I consider to be the years who need more goroutines. In the other case years, between 1996 and 2013, there is only less than one thousand papers or even a few dozen which means we don't need too much goroutines.

Alerts

I'm developing an alert system that allows through a channel to communicate errors and failures. For the moment the system is very simple and I still need to make improvements but I'm working hard on designing a good system that handle every situation.

Currently, my system is a predermined set of flags but it is a bit naive. Normally every cased are handle but what if I get an unknow error ? For instance, while I was making my test for downloading with goroutines, I got a rejection from ePrint website. Now I know this error exists but it means I need to anticipate every possible error and also the case where idk what it is.

A second point is the exit action. With my alerts system you can chose if you want to continue your program or quit. This is okay for the moment, but when my program will run with hundred of gouroutines, how am I going to manage the exiting of all those goroutines ?

Finnaly, the third point is about the strategy behind continuing the program even after a failed attempt of downloading a PDF. While I was correcting my code I saw that I was continuing my script in cases where I already knew the url was incorrect. Like a switch case but without break. So the program was running on the same url for nothing. I need to find better way to continue my program and skipping immediatly when an error about wrong url occurs. And even for the rejected connection I was talking about latly: How can I stop temporaly my program, keep thing frozed where I was, wait a bit and then continue like nothing happened ?

Docker image

My docker image is still under development.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
sql		sql
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ePrint Database Project

Design ideas

Statistics

Introduction

Rate limit issue

Alerts

Docker image

Sources

About

Languages

Bl4omArchie/ePrint-DB

Folders and files

Latest commit

History

Repository files navigation

ePrint Database Project

Design ideas

Statistics

Introduction

Rate limit issue

Alerts

Docker image

Sources

About

Topics

Resources

Stars

Watchers

Forks

Languages