diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md deleted file mode 100644 index e913f18b..00000000 --- a/doc/src/SUMMARY.md +++ /dev/null @@ -1 +0,0 @@ -[About](readme.md) diff --git a/doc/book.toml b/docs/book.toml similarity index 100% rename from doc/book.toml rename to docs/book.toml diff --git a/doc/mermaid-init.js b/docs/mermaid-init.js similarity index 100% rename from doc/mermaid-init.js rename to docs/mermaid-init.js diff --git a/doc/mermaid.css b/docs/mermaid.css similarity index 100% rename from doc/mermaid.css rename to docs/mermaid.css diff --git a/doc/mermaid.min.js b/docs/mermaid.min.js similarity index 100% rename from doc/mermaid.min.js rename to docs/mermaid.min.js diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md new file mode 100644 index 00000000..7430d08b --- /dev/null +++ b/docs/src/SUMMARY.md @@ -0,0 +1,2 @@ +[About](readme.md) +[End-to-end encryption](e2ee.md) diff --git a/docs/src/e2ee.md b/docs/src/e2ee.md new file mode 100644 index 00000000..5357085c --- /dev/null +++ b/docs/src/e2ee.md @@ -0,0 +1,99 @@ +# End-to-end encryption + +End-to-end encryption is rather complicated. Beyond the +[bare-bones implementation](https://matrix.org/docs/guides/end-to-end-encryption-implementation-guide) +and [advanced e2ee features](https://matrix.org/docs/guides/implementing-more-advanced-e-2-ee-features-such-as-cross-signing) +a lot of miscellaneous small tricks have been added to make the e2ee experience smoother. This file +acts as a list of said tricks, in no particular order. + +## Rotate of megolm sessions +Megolm sessions are rotated (cleared) after encrypting 100 messages or one week, whatever happens +earlier. Additionally, megolm sessions are rotated if a device leaves the room. If a new device joins +the room the megolm session is re-used and it is sent at a later index to that device. + +## Requesting known SSSS secrets +Upon new login you can either self-verify and cache SSSS secrets with your recovery passphrase / recovery +key to get the cross-signing and megolm backup keys, or you can self-verify via emoji and afterwards +requests from other devices, after successful self-verification. + +For SSSS secrets we want to cache (self-signing key, user-signing key, megolm backup key) we automatically +request those secrets from other devices after successful self-verification, if we weren't verified +before and we don't have them cached. + +Additionally, if we still don't have the secrets cached, we try to intelligently guess if other of +our own verified devices are online, max. once per 15 min. This is triggered on receiving `to_device` +events from ourself and getting messages down `/sync` from ourself that weren't sent by us. + +## Starting megolm sessions while typing +In order to speed up sending of messages in e2ee rooms, megolm sessions are already created and sent +while a user is typing in the room. While this in theory can result in a megolm session being used to +encrypt zero messages (a device of the room is being removed between typing and sending), in most cases +this will increase sending performance. + +## Auto-reply to foreign key requests +When sending a megolm session we record to which device at which index we send the megolm session. On +key requests from other users, we automatically forward the megolm session at the index noted, as in +theory they should have that key anyways. This helps to improve recovery from unable to decrypts. + +## Chunked priority sending of megolm keys +In the background we record the last activity time of all devices. This is determined on when we +received the last encrypted `to_device` message of that device. (It could be optimized by also including +encrypted room events). Now, when creating a megolm session, we sort the device list, and chunk it into +chunks of 20. We wait for the first chunk to send, and send the remaining chunks in the background. +This way we make sure that the devices active right now get the key for sure right away, and then, +prioritized by activity, the next devices get the keys seemlessly in the background. + +As we implemented auto-reply to foreign key requests other devices can already request the key before +it got received, also ensuring high-availability in case of a badly sorted list. + +## OTK (One-Time Key) upload and failure +Because libolm can only hold up to 100 OTKs at all times, we must not upload 100 OTKs. If we were to +do that then another person might claim an OTK and, before they send you a `to_device` message, you'd +upload a new OTK to fill up the 100 OTKs again, forgetting the OTK the other person used. So, we try +to keep the OTKs uploaded at roughly 2/3, so 66 keys. + +Additionally, we must make sure that we do not lose any OTKs uploaded, even if the upload request +failed. So we store the olm account, and thus the OTKs, both before and after requesting. We only +mark the OTKs as uploaded after the request was successful. + +If now the upload fails, we already stored the non-uploaded OTKs. Thus, next time when attempting to +upload, we take the non-uploaded OTKs into account for how many to create, and then re-try the +uploading. + +```mermaid +graph + sync(Sync response says more than half of all OTKs have been used) --> generate(Generate new OTKs, so that we have up to 2/3rd of all full) + generate --> store(Store Olm-Account and OTKs in database) + store --> upload(Attempt to upload OTKs) + upload -- Success --> mark(Mark OTKs as uploaded) + mark --> store2(Store Olm-Account and OTKs in database) + upload -- Failure --> fail(Don't do anything) + fail --> sync +``` + +## Auto-recreate corrupted olm sessions +If we receive an encrypted `to_device` message that we can't decrypt, that means the olm session with +the remote device got corrupted. So, we create a new olm session and send an encrypted `m.dummy` via +`to_device` messaging to signal the new olm session. + +## Replay of sent `to_device` messages +As olm is a double-ratchet the ratchet on the receiving and the sending client must be the same. So, +a lost `to_device` event could be fatal to the olm session. Thus, we record all sent `to_device` messages +that failed to send. Before sending the next `to_device` message (and periodically after `/sync`) we +empty that queue, to make sure that the `to_device` messages are sent, and thus the olm ratchets stay +in sync. + +```mermaid +graph + trigger(Trigger to send a to_device message) --> queue(Attempt to re-send all existing to_device messages from the queue) + queue -- Failure --> add_queue(Add to_device message to queue) + queue -- Success --> remove_queue(Remove sent to_device messages from queue) + remove_queue --> send(Attempt to actually send the to_device message) + send -- Success --> Done + send -- Failure --> add_queue +``` + +Additionally, when sending an encrypted `to_device` event to a device, we remember that content, one +message per recipient device. Now, if we receive an encrypted `m.dummy`, this usually indicates that +the remote device started a new olm session, likely due to corruption. So, we re-send the saved +content, as it might e.g. contain a megolm key needed to decrypt messages. diff --git a/doc/src/readme.md b/docs/src/readme.md similarity index 100% rename from doc/src/readme.md rename to docs/src/readme.md