# End-to-end encryption End-to-end encryption is rather complicated. Beyond the [bare-bones implementation](https://matrix.org/docs/guides/end-to-end-encryption-implementation-guide) and [advanced e2ee features](https://matrix.org/docs/guides/implementing-more-advanced-e-2-ee-features-such-as-cross-signing) a lot of miscellaneous small tricks have been added to make the e2ee experience smoother. This file acts as a list of said tricks, in no particular order. ## Rotate of megolm sessions Megolm sessions are rotated (cleared) after encrypting 100 messages or one week, whatever happens earlier. Additionally, megolm sessions are rotated if a device leaves the room. If a new device joins the room the megolm session is re-used and it is sent at a later index to that device. ## Requesting known SSSS secrets Upon new login you can either self-verify and cache SSSS secrets with your recovery passphrase / recovery key to get the cross-signing and megolm backup keys, or you can self-verify via emoji and afterwards requests from other devices, after successful self-verification. For SSSS secrets we want to cache (self-signing key, user-signing key, megolm backup key) we automatically request those secrets from other devices after successful self-verification, if we weren't verified before and we don't have them cached. Additionally, if we still don't have the secrets cached, we try to intelligently guess if other of our own verified devices are online, max. once per 15 min. This is triggered on receiving `to_device` events from ourself and getting messages down `/sync` from ourself that weren't sent by us. ## Starting megolm sessions while typing In order to speed up sending of messages in e2ee rooms, megolm sessions are already created and sent while a user is typing in the room. While this in theory can result in a megolm session being used to encrypt zero messages (a device of the room is being removed between typing and sending), in most cases this will increase sending performance. ## Auto-reply to foreign key requests When sending a megolm session we record to which device at which index we send the megolm session. On key requests from other users, we automatically forward the megolm session at the index noted, as in theory they should have that key anyways. This helps to improve recovery from unable to decrypts. ## Chunked priority sending of megolm keys In the background we record the last activity time of all devices. This is determined on when we received the last encrypted `to_device` message of that device. (It could be optimized by also including encrypted room events). Now, when creating a megolm session, we sort the device list, and chunk it into chunks of 20. We wait for the first chunk to send, and send the remaining chunks in the background. This way we make sure that the devices active right now get the key for sure right away, and then, prioritized by activity, the next devices get the keys seemlessly in the background. As we implemented auto-reply to foreign key requests other devices can already request the key before it got received, also ensuring high-availability in case of a badly sorted list. ## OTK (One-Time Key) upload and failure Because libolm can only hold up to 100 OTKs at all times, we must not upload 100 OTKs. If we were to do that then another person might claim an OTK and, before they send you a `to_device` message, you'd upload a new OTK to fill up the 100 OTKs again, forgetting the OTK the other person used. So, we try to keep the OTKs uploaded at roughly 2/3, so 66 keys. Additionally, we must make sure that we do not lose any OTKs uploaded, even if the upload request failed. So we store the olm account, and thus the OTKs, both before and after requesting. We only mark the OTKs as uploaded after the request was successful. If now the upload fails, we already stored the non-uploaded OTKs. Thus, next time when attempting to upload, we take the non-uploaded OTKs into account for how many to create, and then re-try the uploading. ```mermaid graph sync(Sync response says more than half of all OTKs have been used) --> generate(Generate new OTKs, so that we have up to 2/3rd of all full) generate --> store(Store Olm-Account and OTKs in database) store --> upload(Attempt to upload OTKs) upload -- Success --> mark(Mark OTKs as uploaded) mark --> store2(Store Olm-Account and OTKs in database) upload -- Failure --> fail(Don't do anything) fail --> sync ``` ## Auto-recreate corrupted olm sessions If we receive an encrypted `to_device` message that we can't decrypt, that means the olm session with the remote device got corrupted. So, we create a new olm session and send an encrypted `m.dummy` via `to_device` messaging to signal the new olm session. ## Replay of sent `to_device` messages As olm is a double-ratchet the ratchet on the receiving and the sending client must be the same. So, a lost `to_device` event could be fatal to the olm session. Thus, we record all sent `to_device` messages that failed to send. Before sending the next `to_device` message (and periodically after `/sync`) we empty that queue, to make sure that the `to_device` messages are sent, and thus the olm ratchets stay in sync. ```mermaid graph trigger(Trigger to send a to_device message) --> queue(Attempt to re-send all existing to_device messages from the queue) queue -- Failure --> add_queue(Add to_device message to queue) queue -- Success --> remove_queue(Remove sent to_device messages from queue) remove_queue --> send(Attempt to actually send the to_device message) send -- Success --> Done send -- Failure --> add_queue ``` Additionally, when sending an encrypted `to_device` event to a device, we remember that content, one message per recipient device. Now, if we receive an encrypted `m.dummy`, this usually indicates that the remote device started a new olm session, likely due to corruption. So, we re-send the saved content, as it might e.g. contain a megolm key needed to decrypt messages.