Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help to import e-mails at a Closed Google Group... #15

Closed
marceliogp opened this issue May 20, 2016 · 17 comments
Closed

Help to import e-mails at a Closed Google Group... #15

marceliogp opened this issue May 20, 2016 · 17 comments

Comments

@marceliogp
Copy link

Hi Friend,

I have a problem, because I need to export to mbox all Google Group's Messages, but in this group I need to logon with my Google Account and password. In a google account we have a special character '@' and '.'. In my password I have some more specials charecter.

Whe I try to use "crawler.sh" script, it dosen't ask a username and password, and it returns HTTP Error 403 Forbidden.

Do you know, How can I resolve this problem? How can I pass to "crawler.sh" script my Google Username and password (using some special characters, like !@#$%&* and others)?

Congratulations for your "crawler.sh" script and your work and thanks for your help.

Please, if you have a little time, Can you see it and send to me a answer for this problem at authentication in a closed Google Group?

Thanks for all,

Marcélio G. Pereira
[email protected]

@icy
Copy link
Owner

icy commented May 20, 2016

@marceliogp If you are using Firefox, please install some add-ons to manage your cookies. Then log in to your google groups, and use the add-ons to export cookies to a file and specify a wget options for the script to load. All basic steps are described here

https://github.com/icy/google-group-crawler#private-group

Google authentication is complex and the script can't handle that. Please try with cookie and let me know if you have any problem.

Thanks,

@marceliogp
Copy link
Author

marceliogp commented May 23, 2016

@icy, very very thank you for your help.

I downloaded all topics from my private google group (about 7800 messages). Your script make a directory structure like:

  • ../ group name /mbox
  • ../ group name /msgs
  • ../ group name /threads

I used new add-on on my thunderbird to import mbox, but it didn't works well. Thunderbird showed a lot of folders (same name of the files inside on 'mbox' folder) without any message.

Do I need same script of software to convert this structure to PST format (or another) that I can import in a other Off-line E-mail Manager Software (like thunderbird, outlook or other)?

Thunderbird add-on: ImportExportTools (at mozilla's web site)

Congratulation to you for the best work at 'google-group-crawler' and very thank you for your help.

Best regards,

Marcélio G. Pereira
[email protected]

@icy
Copy link
Owner

icy commented May 23, 2016

@marceliogp Sorry for the confusion. For some reason, I used the mbox name, but the files are in RFC 822 format instead.

If you can write some scripts you may see it's quite trivial to convert the RFC 822 files to mbox files. That was exactly I did for my groups, but unfortunately I couldn't see them here in my terminal's history :(

You make take a look at this http://askubuntu.com/questions/13967/importing-mail-files-of-type-message-rfc822 it seems Thunderbid can import those files. Please give it a try and let me know if you still need some support.

Thanks a lot!

@icy
Copy link
Owner

icy commented May 23, 2016

As I recall, all I needed to do is add a header line, as below

Original file

Received: by 10.68.228.227 with SMTP id sl3mr645728pbc.5.1345774109533;
        Thu, 23 Aug 2012 19:08:29 -0700 (PDT)
...
Date: Fri, 24 Aug 2012 09:08:26 +0700
From: "Nguyen Vu Hung (vuhung)" <[email protected]>

Now insert the From and Date field at top of the file, and keep all other lines remained

From [email protected] Fri, 24 Aug 2012 09:08:26 +0700
Received: by 10.68.228.227 with SMTP id sl3mr645728pbc.5.1345774109533;
        Thu, 23 Aug 2012 19:08:29 -0700 (PDT)
...

Now your have a very correct file which can be seen by your mbox importer. I should have added some notice / guidelines about this.!

@icy
Copy link
Owner

icy commented May 25, 2016

@icy
Copy link
Owner

icy commented Aug 31, 2016

@marceliogp Are you able to solve your problem? Thanks a lot.

@marceliogp
Copy link
Author

Hi friend,

I'm not able to resolve this problem. I used your script to download all
messages, but now I don't know how can I search some subject inside of this
structure.

There aren't any tools to make all structure, that your script makes, for
same email tools, like a thunderbird, outlook or other else.

If you find any tools to help me. I'm really apreciate.

Att.,

Marcelio G. Pereira
Analista de Sistemas
WebSite:
E-Mails: [email protected]

Esta mensagem, incluindo seus anexos, tem caráter confidencial e seu
conteúdo é restrito ao destinatário da mensagem. Caso você tenha recebido
esta mensagem por engano, queira por favor retorná-la ao destinatário e
apagá-la de seus arquivos. Qualquer uso não autorizado, replicação ou

disseminação desta mensagem ou parte dela é expressamente proibido.

2016-08-31 20:17 GMT-03:00 Ky-Anh Huynh [email protected]:

@marceliogp https://github.com/marceliogp Are you able to solve your
problem? Thanks a lot.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AI8yIsboVi9FmSXrDX-3p_IVobGStzMNks5qlgumgaJpZM4Ii1VI
.

@icy
Copy link
Owner

icy commented Sep 1, 2016

I'm sorry to hear that. I will take a look at your problem. Stay tuned.

@marceliogp
Copy link
Author

Thanks for all...

Att.,

Marcelio G. Pereira
Analista de Sistemas
WebSite:
E-Mails: [email protected]

Esta mensagem, incluindo seus anexos, tem caráter confidencial e seu
conteúdo é restrito ao destinatário da mensagem. Caso você tenha recebido
esta mensagem por engano, queira por favor retorná-la ao destinatário e
apagá-la de seus arquivos. Qualquer uso não autorizado, replicação ou

disseminação desta mensagem ou parte dela é expressamente proibido.

2016-09-01 10:17 GMT-03:00 Ky-Anh Huynh [email protected]:

I'm sorry to hear that. I will take a look at your problem. Stay tuned.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AI8yInIurcI5ustI076dvSibiFZcCPoKks5qltBlgaJpZM4Ii1VI
.

@icy
Copy link
Owner

icy commented May 6, 2017

sorry for my late response. Are you able to resolve your problem? Would you mind taking a look at a similar problem here #16? Thanks.

@cryptoque
Copy link

cryptoque commented Dec 1, 2017

Hey @icy , thanks for your script. I tried it and three empty folders are quickly generated (mbox, msg, threads), except that threads contains a file called t.0 which is also empty. Commandline returns:

:: Skipping './mygroupname//threads/t.0' (downloaded with 'forum/mygroupname')

Anything I have missed there?

Thanks for your help.

@icy
Copy link
Owner

icy commented Dec 1, 2017

hi @cryptoque,

Are you working with closed groups? Then the problem may be due to wrong cookie. I will give another tests with my closed group if there is any changes from google.

@cryptoque
Copy link

@icy thanks for the quick reply! It is an open group:

Anyone from the xx organization can view content.
Anyone can apply to join.
Only members can post.
Anyone from the xx organization can view the list of members.

Initially without the cookie file I got a 403 forbidden, however after exporting the cookie file as instructed and rerunning, I saw the 403 error message gone and the following is returned:


__wget_hook () 
{ 
    :
}
__wget__ () 
{ 
    if [[ ! -f "$1" ]]; then
        wget --user-agent="$_USER_AGENT" $_WGET_OPTIONS "$2" -O "$1";
        __wget_hook "$1" "$2";
    fi
}

@icy
Copy link
Owner

icy commented Dec 1, 2017

Oh I see. For organization's group I think you may need to set up environment variable _ORG as seen here https://github.com/icy/google-group-crawler/blob/ee4cfe61ee83270fadef08c87acbe5876d77ff24/README.md#group-on-google-apps . Please try again if that helps. You may need to delete all directories generated by the script before trying.

@cryptoque
Copy link

@icy Yes, that is what I tried initially, with _ORG set. I will try to locate the problem with more tests, hopefully.

@icy
Copy link
Owner

icy commented Dec 1, 2017

I'm sorry that didn't help. I will try to test today if there is any problem. In the mean time, you may want to initial some wget command manually using your cookie file: crawler.sh generates a bash script, and if you look at the file, you can see the full wget command. Adding --verbose option to $_WGET_OPTIONS also helps

@icy icy mentioned this issue Apr 12, 2020
@icy
Copy link
Owner

icy commented Apr 13, 2020

This topic contains a few different issues. As now the script switches to use curl, please try it out if you have the same issues with the output. To process the results mbox files, please look at for example https://github.com/icy/google-group-crawler#what-to-do-with-your-local-archive.

Thanks a lot.

@icy icy closed this as completed Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants