GSoC - Bitrate Estimation and Congestion Control in the Jitsi Kernel (libjitsi).I am implementing some portion of the webrtc framework in the Jitsi App (meet.jit.si).
I am going to keep things very simple and start from the basics.
Things you should understand first;
What is a frame? Think of a frame as a picture, yeah, just a single selfie, a snapshot.
What is a video? Lets think of video as a sequence of Frames (several snaphots), Imagine you taking a selfie of yourself every 1 second while you are trying to sing your national anthem (capturing the shape of your mouth from start to finish). Okay, every 1 second isn't good enough, Lets try taking that selfie 30 times every seconds. Definitely, your hand can't click that fast so you would rather use the camera's inbuilt timer (very few people create motion pictures nowadays) or better still record yourself.
Assuming you recorded for 5 seconds and have a total of 5 seconds x 30 frames = 150 frames, If you gave a friend those 150 frames and they viewed those 150 frames(pictures) in 1 hour, they won't notice its actually a video, they would likely feel bored looking at 150 versions of you and yet not notice the difference in the shape of the mouth in every picture. But if they replayed those 150 frames in 5 seconds, They would re-create the video from the 150 frames and it would appear like they are watching a recorded version of you actually singing. And clearly the funny shape your mouth makes will be obvious. Alright that is the intuition and concept of video.
3 things you should take away are,1) You GENERATED 150frames (pictures or snapshots) in 5 seconds,
2) You TRANSFERED it to your friend (We don't know, maybe in a flash drive as a 5seconds video, maybe while you were generating it, you were transfering them imediately - This is called live streaming and it is what we are dealing with).
3) Your friend has to PLAY BACK 150 frames in 5 seconds for it to make sense (for it to appear as though you were really singing. Well may be lip singing since you didn't send any audio. But lets not worry about audio for now though the same concept applies.)
Now that is the basic concepts. Alot of things could go wrong at any of the stages, but lets assume
- that your video camera captures frames(pictures) well, and the tiny chip on your computer or electronic device can convert and manipulate (encode and compress, don't worry about the terms) those images to 0s and 1s, that means stage 1 is fine.
- lets assume your friend's computer can decode what your video camera captured and play it back., Stage 3 if fine.
STAGE 2 is where I come in for GSoC.What happens at stage 2?
When the 150 frames leaves your device and is sent into the internet, the 150 frames travels through the network to a "server" where both of you initially connected to. Imagine you sent the frames through an app called Jitsi (you can try it at meet.jit.si, it free and you don't need an account) or Google Hangout, you and your friend will have to be connected to Jitsi's or Google's server. So Jitsi's server in this case serves as the connector between both of you.
My Google Summer of Code Project involves implementing and tightening the nuts and bolts on the code that helps with estimating the number of bits out of the number of bits in the 150 frames, you are sending every second to the Jitsi server (Jitsi video-bridge + libjitsi library - don't worry, it is just a name). Also, the code that estimates the number of bits your friend is willing to receive every second (Yes, because you use a 4G broadband internet connection and your friend uses 3G, or because 100 persons share your network and only your friend uses his network - simply put, network connections on both sides differ.). The code (a portion of code in libjitsi) calculates and sends you (the sender) feedback on how congested your friends bandwidth is based on the reports your friends sends to the Jitsi server (a very large piece of code available for free for anyone to use (also called open-source code) ) about previous frames details, e.g arrival time (if there are any).
This works because when you send your frames to the server(video-bridge), it keeps a records of the details, the time it received the frame from you and the time it sends the frame to your friend. Therefore, with the report your friend sends back to the server about the details of the frames it has received (e.g the time the frame arrived and the sequence numbers of frames it received), the server can estimate the bandwidth (number of bits your friend can handle in a second) of your friend and tell you to adjust (slow down or ramp up your sending speed) your sending bitrate. The video-bridge can also use it to schedule how to send to your friend, the frames it has already received from you. All this is done to avoid overwhelming the receiver (your friend) who in this case might be receiving other packet streams, to avoid you from over sending extra high quality frames when your friend can only handle low quality frames due to network conditions (or vice versa). To avoid packet loss, yes this is what happens when you send more frames than your friend can handle, his memory (buffer) gets filled up and remaining chunks or frames are dropped until he has played back the frames or chunks in the buffer (remember, by "playing back", he watching the video and therefore emptying the buffer). This code is also necessary to be able to schedule how to send the data so that your friend can play them back in such a way that it will make sense.
I know at this point you are probably thinking, 5 seconds video - 150 frames, the network can send all this frames all at once in less than 5 seconds and it will arrive as a complete video file. Well not necessarily. In fact, frames are sent across the network using a standard known as RTP (Real-Time Transport Protocol) and usually, a single frame can be so big in size that it needs to be broken up and sent as different chunks. So it is necessary to be able to track the different chunks that make up a single frame hence we tag labels (RTP headers) to each chunk. At this point, what you need to know about RTP is that, a single frame out of the 150 frames can be split into different chunks e.g 1 frame or picture can be split in to 7 chunks, therefore, it will be sent as 7 RTP packets. Each of the 7 RTP packet will have a common unique label (SSRCs) to identify you - the sender , a time stamp when the packet was generated (this will be helpful for reconstructing the frame since all chunks belonging to the same frame should have the same time stamp, and can be used for playback), a sequence number showing the order of frames and chunks - for example, if one chunk is lost, we can easily check the missing sequence number and request the missing chunk. Note, if your 1st frame is split into 7 chunks, then we have sequence numbers from 1 to 7 on the RTP packets, and the second frame will start from sequence number 8. There are other fields contained in the RTP header. RTP packets has a brother packet that helps it transmit more detail data about the RTP packets that your friend has received so far. This packet is called an RTCP packet. I will leave the acronyms - RTP and RTCP - as an assignment.
If you have gotten to this point, you now have an intuition of what we am trying to achieve at the Jitsi bridge. So keep reading.
Part of what am working on to make the above features possible is to work on the Remote Bit Estimator. In Jitsi, and in webrtc, we have the Remote_Bitrate_Estimator_Single_Stream (RBESS) and the Remote_Bitrate_Estimator_Abs_Send_Time (RBEABS - Note I shortened it because it is long to type). What this two bit rate estimators try to achieve is the same however, they use different fields and ways to achieve that. In particular, The RBESS uses the RTP time stamp (when the packet was generated) to calculate the Remote Bitrate Estimate (RBE) while the RBEABS uses the time stamp when an RTP packet (could be a chunk) was actually sent to calculate the RBE. Okay don't be confused. I know I've mentioned two time stamps and you are probably asking how do we differentiate them. Well, the time stamp on the RTP packet itself indicates when the chunk or RTP packet was generated. To get the time stamp when the packets was actually sent, the sender (you) sends this in a separate message to the video-bridge in what is called and RTCP packet. Don't worry, just know the sender records and sends the actual send time separately.
Note, I have only covered for the case in which you are the one sending 150 frames as a live stream, Normally, in conversational live video streams, sending and receiving of video and audio data will be done by you and your friend. Therefore, you are both senders and receivers at the same time. Similarly, we can also view the Jitsi server (Video bridge) as a receiver and sender (e.g You send your RTP packet, the Jitsi server RECEIVES it and SENDS them to your friend and your friend receives it and vice versa).
To see the app in action, visit meet.jit.si
To see the code I've written, visit www.github.com/jcchuks/libjitsi
To see the original repo - www.github.com/jisti/libjitsi
To read up on webrtc, visit www.webrtc.org
To get up to speed with the webrtc implementations, you can read through;
Real time Communications Using WebRTC - A webRTC code lab by Google. (Go through)
Analysis and Design of the Google Congestion Control for Web Real-time Communication (WebRTC) - Academic Paper (Read)
A Google Congestion Control Algorithm for Real-Time Communication - 2016 (Read)
RTP Extensions for Transport-wide Congestion Control - 2015 (Read)
A Google Congestion Control Algorithm for Real-Time Communication-2015 (Reference)
RTCP message for Receiver Estimated Maximum Bitrate (Reference)
A Google Congestion Control Algorithm for Real-Time Communication on the World Wide Web - 2012 (Reference)
Also RFC 3550 (Reference)
June 24th, 2017. (Already 4 weeks in to GSoC)
Please report any errors, corrections or improvements to jcchuks_[at]_ymail_[dot] _com. (To retrieve email, please remove all underscores, convert [at] and [dot] to symbols ) Thanks.