chore: Add docs

2021-02-10 13:31:25 +01:00 · 2021-02-10 13:31:25 +01:00 · 15a444c006
parent 9fa36f736b
commit 15a444c006
8 changed files with 101 additions and 1 deletions
--- a/doc/src/SUMMARY.md
+++ b/doc/src/SUMMARY.md
@ -1 +0,0 @@
-[About](readme.md)
--- a/docs/book.toml
+++ b/docs/book.toml
--- a/docs/mermaid-init.js
+++ b/docs/mermaid-init.js
--- a/docs/mermaid.css
+++ b/docs/mermaid.css
--- a/docs/mermaid.min.js
+++ b/docs/mermaid.min.js
--- a/docs/src/SUMMARY.md
+++ b/docs/src/SUMMARY.md
@ -0,0 +1,2 @@
+[About](readme.md)
+[End-to-end encryption](e2ee.md)
--- a/docs/src/e2ee.md
+++ b/docs/src/e2ee.md
@ -0,0 +1,99 @@
+# End-to-end encryption
+
+End-to-end encryption is rather complicated. Beyond the
+[bare-bones implementation](https://matrix.org/docs/guides/end-to-end-encryption-implementation-guide)
+and [advanced e2ee features](https://matrix.org/docs/guides/implementing-more-advanced-e-2-ee-features-such-as-cross-signing)
+a lot of miscellaneous small tricks have been added to make the e2ee experience smoother. This file
+acts as a list of said tricks, in no particular order.
+
+## Rotate of megolm sessions
+Megolm sessions are rotated (cleared) after encrypting 100 messages or one week, whatever happens
+earlier. Additionally, megolm sessions are rotated if a device leaves the room. If a new device joins
+the room the megolm session is re-used and it is sent at a later index to that device.
+
+## Requesting known SSSS secrets
+Upon new login you can either self-verify and cache SSSS secrets with your recovery passphrase / recovery
+key to get the cross-signing and megolm backup keys, or you can self-verify via emoji and afterwards
+requests from other devices, after successful self-verification.
+
+For SSSS secrets we want to cache (self-signing key, user-signing key, megolm backup key) we automatically
+request those secrets from other devices after successful self-verification, if we weren't verified
+before and we don't have them cached.
+
+Additionally, if we still don't have the secrets cached, we try to intelligently guess if other of
+our own verified devices are online, max. once per 15 min. This is triggered on receiving `to_device`
+events from ourself and getting messages down `/sync` from ourself that weren't sent by us.
+
+## Starting megolm sessions while typing
+In order to speed up sending of messages in e2ee rooms, megolm sessions are already created and sent
+while a user is typing in the room. While this in theory can result in a megolm session being used to
+encrypt zero messages (a device of the room is being removed between typing and sending), in most cases
+this will increase sending performance.
+
+## Auto-reply to foreign key requests
+When sending a megolm session we record to which device at which index we send the megolm session. On
+key requests from other users, we automatically forward the megolm session at the index noted, as in
+theory they should have that key anyways. This helps to improve recovery from unable to decrypts.
+
+## Chunked priority sending of megolm keys
+In the background we record the last activity time of all devices. This is determined on when we
+received the last encrypted `to_device` message of that device. (It could be optimized by also including
+encrypted room events). Now, when creating a megolm session, we sort the device list, and chunk it into
+chunks of 20. We wait for the first chunk to send, and send the remaining chunks in the background.
+This way we make sure that the devices active right now get the key for sure right away, and then,
+prioritized by activity, the next devices get the keys seemlessly in the background.
+
+As we implemented auto-reply to foreign key requests other devices can already request the key before
+it got received, also ensuring high-availability in case of a badly sorted list.
+
+## OTK (One-Time Key) upload and failure
+Because libolm can only hold up to 100 OTKs at all times, we must not upload 100 OTKs. If we were to
+do that then another person might claim an OTK and, before they send you a `to_device` message, you'd
+upload a new OTK to fill up the 100 OTKs again, forgetting the OTK the other person used. So, we try
+to keep the OTKs uploaded at roughly 2/3, so 66 keys.
+
+Additionally, we must make sure that we do not lose any OTKs uploaded, even if the upload request
+failed. So we store the olm account, and thus the OTKs, both before and after requesting. We only
+mark the OTKs as uploaded after the request was successful.
+
+If now the upload fails, we already stored the non-uploaded OTKs. Thus, next time when attempting to
+upload, we take the non-uploaded OTKs into account for how many to create, and then re-try the
+uploading.
+
+```mermaid
+graph
+  sync(Sync response says more than half of all OTKs have been used) --> generate(Generate new OTKs, so that we have up to 2/3rd of all full)
+  generate --> store(Store Olm-Account and OTKs in database)
+  store --> upload(Attempt to upload OTKs)
+  upload -- Success --> mark(Mark OTKs as uploaded)
+  mark --> store2(Store Olm-Account and OTKs in database)
+  upload -- Failure --> fail(Don't do anything)
+  fail --> sync
+```
+
+## Auto-recreate corrupted olm sessions
+If we receive an encrypted `to_device` message that we can't decrypt, that means the olm session with
+the remote device got corrupted. So, we create a new olm session and send an encrypted `m.dummy` via
+`to_device` messaging to signal the new olm session.
+
+## Replay of sent `to_device` messages
+As olm is a double-ratchet the ratchet on the receiving and the sending client must be the same. So,
+a lost `to_device` event could be fatal to the olm session. Thus, we record all sent `to_device` messages
+that failed to send. Before sending the next `to_device` message (and periodically after `/sync`) we
+empty that queue, to make sure that the `to_device` messages are sent, and thus the olm ratchets stay
+in sync.
+
+```mermaid
+graph
+  trigger(Trigger to send a to_device message) --> queue(Attempt to re-send all existing to_device messages from the queue)
+  queue -- Failure --> add_queue(Add to_device message to queue)
+  queue -- Success --> remove_queue(Remove sent to_device messages from queue)
+  remove_queue --> send(Attempt to actually send the to_device message)
+  send -- Success --> Done
+  send -- Failure --> add_queue
+```
+
+Additionally, when sending an encrypted `to_device` event to a device, we remember that content, one
+message per recipient device. Now, if we receive an encrypted `m.dummy`, this usually indicates that
+the remote device started a new olm session, likely due to corruption. So, we re-send the saved
+content, as it might e.g. contain a megolm key needed to decrypt messages.
--- a/docs/src/readme.md
+++ b/docs/src/readme.md