# End-to-end encryption

End-to-end encryption is rather complicated. Beyond the
[bare-bones implementation](https://matrix.org/docs/guides/end-to-end-encryption-implementation-guide)
and [advanced e2ee features](https://matrix.org/docs/guides/implementing-more-advanced-e-2-ee-features-such-as-cross-signing)
a lot of miscellaneous small tricks have been added to make the e2ee experience smoother. This file
acts as a list of said tricks, in no particular order.

## Rotate of megolm sessions
Megolm sessions are rotated (cleared) after encrypting 100 messages or one week, whatever happens
earlier. Additionally, megolm sessions are rotated if a device leaves the room. If a new device joins
the room the megolm session is re-used and it is sent at a later index to that device.

## Requesting known SSSS secrets
Upon new login you can either self-verify and cache SSSS secrets with your recovery passphrase / recovery
key to get the cross-signing and megolm backup keys, or you can self-verify via emoji and afterwards
requests from other devices, after successful self-verification.

For SSSS secrets we want to cache (self-signing key, user-signing key, megolm backup key) we automatically
request those secrets from other devices after successful self-verification, if we weren't verified
before and we don't have them cached.

Additionally, if we still don't have the secrets cached, we try to intelligently guess if other of
our own verified devices are online, max. once per 15 min. This is triggered on receiving `to_device`
events from ourself and getting messages down `/sync` from ourself that weren't sent by us.

## Starting megolm sessions while typing
In order to speed up sending of messages in e2ee rooms, megolm sessions are already created and sent
while a user is typing in the room. While this in theory can result in a megolm session being used to
encrypt zero messages (a device of the room is being removed between typing and sending), in most cases
this will increase sending performance.

## Auto-reply to foreign key requests
When sending a megolm session we record to which device at which index we send the megolm session. On
key requests from other users, we automatically forward the megolm session at the index noted, as in
theory they should have that key anyways. This helps to improve recovery from unable to decrypts.

## Chunked priority sending of megolm keys
In the background we record the last activity time of all devices. This is determined on when we
received the last encrypted `to_device` message of that device. (It could be optimized by also including
encrypted room events). Now, when creating a megolm session, we sort the device list, and chunk it into
chunks of 20. We wait for the first chunk to send, and send the remaining chunks in the background.
This way we make sure that the devices active right now get the key for sure right away, and then,
prioritized by activity, the next devices get the keys seemlessly in the background.

As we implemented auto-reply to foreign key requests other devices can already request the key before
it got received, also ensuring high-availability in case of a badly sorted list.

## OTK (One-Time Key) upload and failure
Because libolm can only hold up to 100 OTKs at all times, we must not upload 100 OTKs. If we were to
do that then another person might claim an OTK and, before they send you a `to_device` message, you'd
upload a new OTK to fill up the 100 OTKs again, forgetting the OTK the other person used. So, we try
to keep the OTKs uploaded at roughly 2/3, so 66 keys.

Additionally, we must make sure that we do not lose any OTKs uploaded, even if the upload request
failed. So we store the olm account, and thus the OTKs, both before and after requesting. We only
mark the OTKs as uploaded after the request was successful.

If now the upload fails, we already stored the non-uploaded OTKs. Thus, next time when attempting to
upload, we take the non-uploaded OTKs into account for how many to create, and then re-try the
uploading.

```mermaid
graph
  sync(Sync response says more than half of all OTKs have been used) --> generate(Generate new OTKs, so that we have up to 2/3rd of all full)
  generate --> store(Store Olm-Account and OTKs in database)
  store --> upload(Attempt to upload OTKs)
  upload -- Success --> mark(Mark OTKs as uploaded)
  mark --> store2(Store Olm-Account and OTKs in database)
  upload -- Failure --> fail(Don't do anything)
  fail --> sync
```

## Auto-recreate corrupted olm sessions
If we receive an encrypted `to_device` message that we can't decrypt, that means the olm session with
the remote device got corrupted. So, we create a new olm session and send an encrypted `m.dummy` via
`to_device` messaging to signal the new olm session.

## Replay of sent `to_device` messages
As olm is a double-ratchet the ratchet on the receiving and the sending client must be the same. So,
a lost `to_device` event could be fatal to the olm session. Thus, we record all sent `to_device` messages
that failed to send. Before sending the next `to_device` message (and periodically after `/sync`) we
empty that queue, to make sure that the `to_device` messages are sent, and thus the olm ratchets stay
in sync.

```mermaid
graph
  trigger(Trigger to send a to_device message) --> queue(Attempt to re-send all existing to_device messages from the queue)
  queue -- Failure --> add_queue(Add to_device message to queue)
  queue -- Success --> remove_queue(Remove sent to_device messages from queue)
  remove_queue --> send(Attempt to actually send the to_device message)
  send -- Success --> Done
  send -- Failure --> add_queue
```

Additionally, when sending an encrypted `to_device` event to a device, we remember that content, one
message per recipient device. Now, if we receive an encrypted `m.dummy`, this usually indicates that
the remote device started a new olm session, likely due to corruption. So, we re-send the saved
content, as it might e.g. contain a megolm key needed to decrypt messages.