In the middle of initial library sync, the “primary” Aspect instance stopped responding (because the laptop switched to hibernate mode). After attempting to sync 4362 additional files and receiving a [main(8gc4) WRN] Failed to synchronize file metadata for large%20videos/example.MOV: Failed to connect to 192.168.1.1:56944: refused the server simply quit.
Expected behavior
If the connected instance disappears, I would expect aspect-web to attempt to reconnect at reasonable intervals until the connection can be restored. Only then should it continue to advance through the files queued for synchronizing.
Further, I would expect that the server would gracefully handle the scenario where the connected instance disappears. Especially as this could easily have been a mobile device leaving the network.
Steps required to reproduce
Create a library with a sufficiently large number of images (my library has 174,164 images at 1.4TB)
Sync library with remote aspect-web (mine running on Debian)
While initially syncing with Aspect, force the Aspect instance to stop responding (pull the network cable, put system to sleep, etc)
Wait…
Aspect-web will eventually crash or otherwise shut down.
Not sure if this related but I have gotten thousands of these warnings Failed to open JIT cached file for reading and many of these Image shaders not initialized. Falling back to CPU during my initial sync. Eventually the server just stops responding, even when the remote Aspect is still running
The crash should be fixed with RC-12 (both, on the server side and on the client side). The continuous “refused” errors should also be gone as soon as the second device has woken up again. However, while asleep, there will still be errors reported, the best approach to handling this still needs to be determined.
Image shaders not initialized. Falling back to CPU
This indicates that the image data is being processed, which should not happen for the server version. I’ll have to see where this could potentially come from.
Failed to open JIT cached file for reading
Have there been any network disconnects (possibly because of going into standby) in that run? In that case it would be “expected” and should be fixed in RC-12, otherwise I’m not sure why that would happen, except maybe if there are images marked as missing in the source library. I’ll check how that behaves for me.
Re the Failed to open JIT cached file for reading, there were certainly disconnects at some (many) points during this initial sync. I can’t say for sure if there were any during this specific aspect-web execution, though.
RC12 seems to be a bit better, though I am not certain. At this point aspect-web reports ~4000 files remaining but the process only downloads 10 or so before shutting down. I now have the execution command in a bash while loop to keep things going. I’ve seen a few patterns from no error at all (it downloads between the start of Updating export collections... and then silently quits some time after Library organization set up. In others I’ve seen a couple of messages. I’ve added sample logs below from the most recent runs. The two errors reported here seem pretty consistent at this point.
Listening for requests on https://0.0.0.0:35243/
Listening for requests on http://0.0.0.0:8083/
Acquiring library lock...
Loading cache...
Loading revisions...
Loading library...
Loading existing library in file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library
Preloading cached metadata...
Loading missing metadata...
Loading weak checksums...
Setting up duplicate detector...
Computing reduced initial file relations...
Triggering full initial file relations...
ENABLE FOR file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library
[main(82n7) WRN] Failed to get API for instance: Connecting TLS tunnel returned an error: non-recoverable
socket I/O error: 0 (Success)
Updating export collections...
[main(yASe) ERR] Failed to fetch changes: Unknown device ID: 6D47D06D.76FA1D02.940B5824.3B186CFE.4FF11CBD
.10BB0456.8C60E7B5.3332FDB7
Updating local storage flags...
Library organizaton set up.
Listening for requests on https://0.0.0.0:35867/
Listening for requests on http://0.0.0.0:8083/
Acquiring library lock...
Loading cache...
Loading revisions...
Loading library...
Loading existing library in file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library
Preloading cached metadata...
Loading missing metadata...
Loading weak checksums...
Setting up duplicate detector...
Computing reduced initial file relations...
Triggering full initial file relations...
ENABLE FOR file:///home/chris/.local/lib/aspect-web/libraries/Family%20Aspect%20Library
Updating export collections...
[main(ehiF) ERR] Failed to fetch changes: Unknown device ID: 6D47D06D.76FA1D02.940B5824.3B186CFE.4FF11CBD.10BB0456.8C60E7B5.3332FDB7
Updating local storage flags...
Library organizaton set up.
Of note, this is the same library and same sync instance I’ve been reporting against for the last week or two. It’s nearing the finish line but running out of gas and just blew the 3rd tire:)
Hope all of these reports are helping you dial in the remaining bugs.
Once I can get this first library sync to work I intend to try again from scratch. Possibly even using 2 instances on the server talking to one another if that is even possible.
Can you see which exit code the process reports when it gets terminated by running echo $% right afterwards? If that doesn’t result in anything useful, running ./aspect-web --vv would log some additional debug information that might give a clue.
If everything else fails we might have to run this in GDB/LLDB to hopefully get to the place where it fails. In the meantime, I will try to abuse my test library setup at bit during the clone process to try to reproduce this locally.