Aspect-web crashes if it loses contact with Aspect during initial library sync

Note: this is related to this Aspect issue.

Observed behavior

In the middle of initial library sync, the “primary” Aspect instance stopped responding (because the laptop switched to hibernate mode). After attempting to sync 4362 additional files and receiving a [main(8gc4) WRN] Failed to synchronize file metadata for large%20videos/example.MOV: Failed to connect to 192.168.1.1:56944: refused the server simply quit.

Expected behavior

If the connected instance disappears, I would expect aspect-web to attempt to reconnect at reasonable intervals until the connection can be restored. Only then should it continue to advance through the files queued for synchronizing.
Further, I would expect that the server would gracefully handle the scenario where the connected instance disappears. Especially as this could easily have been a mobile device leaving the network.

Steps required to reproduce

  1. Create a library with a sufficiently large number of images (my library has 174,164 images at 1.4TB)
  2. Sync library with remote aspect-web (mine running on Debian)
  3. While initially syncing with Aspect, force the Aspect instance to stop responding (pull the network cable, put system to sleep, etc)
  4. Wait…
  5. Aspect-web will eventually crash or otherwise shut down.

Operating system/Hardware used

  1. Aspect RC11 running on MacOS Sequoia 15.0 on M1 Max 16" MacBook Pro
  2. Aspect-web RC11 running on Debian GNU/Linux 11 (bullseye)

Not sure if this related but I have gotten thousands of these warnings Failed to open JIT cached file for reading and many of these Image shaders not initialized. Falling back to CPU during my initial sync. Eventually the server just stops responding, even when the remote Aspect is still running

The crash should be fixed with RC-12 (both, on the server side and on the client side). The continuous “refused” errors should also be gone as soon as the second device has woken up again. However, while asleep, there will still be errors reported, the best approach to handling this still needs to be determined.

Image shaders not initialized. Falling back to CPU

This indicates that the image data is being processed, which should not happen for the server version. I’ll have to see where this could potentially come from.

Failed to open JIT cached file for reading

Have there been any network disconnects (possibly because of going into standby) in that run? In that case it would be “expected” and should be fixed in RC-12, otherwise I’m not sure why that would happen, except maybe if there are images marked as missing in the source library. I’ll check how that behaves for me.

Re the Failed to open JIT cached file for reading, there were certainly disconnects at some (many) points during this initial sync. I can’t say for sure if there were any during this specific aspect-web execution, though.

RC12 seems to be a bit better, though I am not certain. At this point aspect-web reports ~4000 files remaining but the process only downloads 10 or so before shutting down. I now have the execution command in a bash while loop to keep things going. I’ve seen a few patterns from no error at all (it downloads between the start of Updating export collections... and then silently quits some time after Library organization set up. In others I’ve seen a couple of messages. I’ve added sample logs below from the most recent runs. The two errors reported here seem pretty consistent at this point.

Listening for requests on https://0.0.0.0:35243/                                                         
Listening for requests on http://0.0.0.0:8083/                                                           
Acquiring library lock...                                                                                
Loading cache...                                                                                         
Loading revisions...                                                                                     
Loading library...                                                                                       
Loading existing library in file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library 
Preloading cached metadata...                                                                            
Loading missing metadata...                                                                              
Loading weak checksums...                                                                                
Setting up duplicate detector...                                                                         
Computing reduced initial file relations...                                                              
Triggering full initial file relations...                                                                
ENABLE FOR file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library                  
[main(82n7) WRN] Failed to get API for instance: Connecting TLS tunnel returned an error: non-recoverable
 socket I/O error: 0 (Success)                                                                           
Updating export collections...                                                                           
[main(yASe) ERR] Failed to fetch changes: Unknown device ID: 6D47D06D.76FA1D02.940B5824.3B186CFE.4FF11CBD
.10BB0456.8C60E7B5.3332FDB7                                                                              
Updating local storage flags...                                                                          
Library organizaton set up.                                                                                                                                      
Listening for requests on https://0.0.0.0:35867/                                                         
Listening for requests on http://0.0.0.0:8083/                                                           
Acquiring library lock...                                                                                
Loading cache...                                                                                         
Loading revisions...                                                                                     
Loading library...                                                                                       
Loading existing library in file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library 
Preloading cached metadata...                                                                            
Loading missing metadata...                                                                              
Loading weak checksums...                                                                                
Setting up duplicate detector...                                                                         
Computing reduced initial file relations...                                                              
Triggering full initial file relations...
ENABLE FOR file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library
Updating export collections...
[main(ehiF) ERR] Failed to fetch changes: Unknown device ID: 6D47D06D.76FA1D02.940B5824.3B186CFE.4FF11CBD.10BB0456.8C60E7B5.3332FDB7
Updating local storage flags...
Library organizaton set up.

Of note, this is the same library and same sync instance I’ve been reporting against for the last week or two. It’s nearing the finish line but running out of gas and just blew the 3rd tire:)

Hope all of these reports are helping you dial in the remaining bugs.

Once I can get this first library sync to work I intend to try again from scratch. Possibly even using 2 instances on the server talking to one another if that is even possible.

Keep up the great work. Looking forward to RC13!

Can you see which exit code the process reports when it gets terminated by running echo $% right afterwards? If that doesn’t result in anything useful, running ./aspect-web --vv would log some additional debug information that might give a clue.

If everything else fails we might have to run this in GDB/LLDB to hopefully get to the place where it fails. In the meantime, I will try to abuse my test library setup at bit during the clone process to try to reproduce this locally.

sorry, should have clarified. it’s exiting normally (exit 0). my loop script exits if it sees any non-zero exit code

Just reran with -vv. Logs attached
aspect-sync.1.txt (530.7 KB)