Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cnetthinner seg faults #4883

Closed
lwellerastro opened this issue Apr 4, 2022 · 8 comments
Closed

cnetthinner seg faults #4883

lwellerastro opened this issue Apr 4, 2022 · 8 comments
Labels
bug Something isn't working Products Issues which are impacting the products group

Comments

@lwellerastro
Copy link
Contributor

lwellerastro commented Apr 4, 2022

ISIS version(s) affected: 5.0.2, 6.0.0

Description
cnetthinner segfaults and dumps an enormous core (20 G+) after "reading control points" and "adding control points to network".

This is a very large network so I initially sent the job to the cluster and asked for all of mem1 since it has the most memory (365 G). The program segfaults in about 1-1.5 hours and only maxes out at about 25 G. I have also run the program directly on astrovm4 which was using the same amount of memory when the program seg faulted so at this point it does not appear to be a memory problem. That being said, if and when this program is able to operate properly on this particular network, it must be run on mem1 and I think we will run up against memory issues because a smaller yet very large Kaguya TC network of the south pole use 265 G for cnetthinner to run.

How to reproduce
Network available under my users work directory Isis3Tests/CnetThinner/.

cnetthinner cnet=SouthPole_2020_Merged_Lidar2Image_redo12.net onet=SouthPole_2020_Merged_Lidar2Image_redo12_Thin.net  maxpoints=600 minpoints=100

cnetthinner: Reading Control Points...
100% Processed
cnetthinner: Adding Control Points to Network...
100% Processed
Segmentation fault (core dumped)

Additional context
This network was recently made available via #3871 as a semi-successful LROC network where jigsaw can solve for camera velocities but is unable to solve for camera velocities (or acceleration) and spacecraft position. I was trying to trouble-shoot the network outside of jigsaw and was running various programs when I tried cnetthinner and found it does not work (cnetstats, cnetcheck and some other network oriented programs run fine). Cnetthinner fails on both redo12 and redo13 networks that have been recently listed in the jigsaw post.

I am hoping it might be easier to find what the problem is with this network via cnetthinner since it gets right to the point and might be easier to debug for problems that jigsaw.

@lwellerastro lwellerastro added bug Something isn't working Products Issues which are impacting the products group labels Apr 4, 2022
@AustinSanders AustinSanders moved this to In Progress in ASC Software Support May 3, 2022
@antonhibl antonhibl assigned antonhibl and unassigned antonhibl Feb 24, 2023
@jrcain-usgs jrcain-usgs moved this to In Progress in FY23 Q2 Support Mar 28, 2023
@AustinSanders AustinSanders removed their assignment Mar 28, 2023
@lwellerastro
Copy link
Contributor Author

I have a somewhat older version of the failing network that has more images, points and and slightly less (-13k) measures than the failing network in the original post that cnetthinner ran on successfully under isis7.1.0 (using ~75G over 7.5 hours).

The network creating the seg fault in theory should run so I'm wondering if it is corrupt in some way despite other programs running on it (cnetstats, cnetcheck, jigsaw w/ limited solve for parameters set). If there is interest in the successful network let me know.

@chkim-usgs
Copy link
Contributor

Hi @lwellerastro,

I was able to recreate a segfault error with the network referenced in the initial post but not necessarily the same one as I only received a segfault error without these print statements:

cnetthinner: Reading Control Points...
100% Processed
cnetthinner: Adding Control Points to Network...
100% Processed

Possibly due to a memory issue as I tried running this on my machine.

Could I see the successful network you mentioned above?

@lwellerastro
Copy link
Contributor Author

Hi @chkim-usgs, I created a subdirectory named SuccessfulNetwork/ in the directory mentioned above (will edit that to remove some detail). It appears I ran cnetthinner a little different as far as min/maxpoints are concerned, but I don't think that's what caused the other version of the network to fail. The input and output network are under SuccessfulNetwork/ as well as the print.prt.

Here's my command:
cnetthinner cnet=SouthPole_2017Merged_SP_and_Lidar2Image1_2023SNs.net onet=SouthPole_2017Merged_SP_and_Lidar2Image1_2023SNs_Thin.net minpoints=350 maxpoints=500

I had to send this to the cluster to get adequate memory to run and you will have to do the same. This successful run used about 75G of memory and ran for over 7 hours.

Please see proc.scr in the directory to see how to send it there in a single command. I think you should have access even without a directory on /scratch which is not needed here. You will need to be on an astro machine such as astrovm4 or astrovm5 in order to use the cluster. These systems have limited/insufficient memory and are shared resources and should not be used for this particular program and network.

@lwellerastro
Copy link
Contributor Author

Based on comments in #5354, I ran cnetedit on the failing network and cnetthinner now runs successfully on it (74G of memory, 8.5 hours). Seems the bug has been identified cleaning up invalid points worked around the issue.

A clean network exists under my user work area Isis3Tests/CnetThinner/CleanNet/SouthPole_2020_Merged_Lidar2Image_redo12_Edit.net

@kledmundson
Copy link
Contributor

kledmundson commented Nov 30, 2023 via email

@lwellerastro
Copy link
Contributor Author

lwellerastro commented Dec 4, 2023

@lwellerastro Based on my better understanding of the problem and the modified description above, now wondering if your success after removing ignored points and measures is truly related to this or is a fluke. Probably the easiest way to sort it is to just run a test version (if possible) on your original network after the fix for this is submitted.

I agree that that the network should be tested as is when the fix for #5354 becomes available just in case the seg fault I encountered is not related that bug.

@lwellerastro
Copy link
Contributor Author

I tested the original posted network under newly released isis8.0.2 and cnetthinner continues to segfault and dump a core despite changes via #5354, so my recent success was a fluke.

This is still low priority and perhaps not worth the effort since there is a work around (running cnetedit on the network resulted in a successful cnetthinner run) and this particular network has a complicated past while new, improved versions currently exist.

I personally think it's ok to close this post and re-open in the future if a different network runs into a similar issue. I'll leave it open for a bit in case there are any objections closing.

@lwellerastro
Copy link
Contributor Author

No longer a need to keep to this open with work around and having moved away from using this particular network for any products.

@github-project-automation github-project-automation bot moved this from In Progress to Done in ASC Software Support Mar 7, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in FY23 Q2 Support Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Products Issues which are impacting the products group
Projects
Development

No branches or pull requests

7 participants