Because the other side may not be listening when the compute is done, and you don't want to cache the result of the computation because of privacy.
The sequence of events is:
1. Phone fires off a request to the backend. 2. Phone waits for response from backend.
The gap between 1 and 2 cannot be long because the phone is burning battery the entire time while it's waiting, so there are limits to how long you can reasonably expect the device to wait before it hangs up.
In a less privacy-sensitive architecture you could:
1. Phone fires off request to the backend. Gets a token for response lookup later. 2. Phone checks for a response later with the token.
But that requires the backend to hold onto the response, which for privacy-sensitive applications you don't want!
Then, if the AIs are positive, the human principals can talk
Seems quite reasonable!