Sync video and hide transcript

Slide 1 of 14

Fraunhofer FOKUSInstitut für Offene KommunikationssystemeAI-Powered Per-Scene Live Encoding Anita Chen | W3C Workshop on Web and Machine Learning| September 2020

Hello everybody, my name is Anita Chen and I work as a project manager at Fraunhofer Fokus in Berlin.

My lightning talk for this workshop will be about using AI methodologies to predict optimal video encoding ladders on the web.

Agenda1Per-Title Encoding: basics2Web-based AI Solutionfor Per-Title/Per-Scene Encoding3SummaryandOutlookAI-Powered Per-Scene Live Encoding 2

For this talk, I will first cover the basics of per-title encoding.

Next, I will introduce our web-based AI solution for per-title/per-scene encoding.

Lastly, I will recap this presentation as well as discuss next steps in this project.

1Per -Title Encoding - Basics

For the per-title encoding section, I will provide a brief overview of per-title encoding, its differences from other encoding methods, as well as its advantages and disadvantages.

PER-TITLE ENCODING| what? why?ResolutionBitrate (kb/s)416x234145640x360365512x384560768x432730768x43220001280x72030001280x72045001920x108060001920x108078004AI-Powered Per-Scene Live EncodingLow complexity /High redundancyHigh complexity /Less redundancyAnimationNature documentariesActionSportApple h264 Encoding LadderApplied across all types of content

In a standard encoding ladder, bitrate/resolution pairs are fixed and the same encoding settings are applied across all types of videos.

For example, with the Apple h264 encoding ladder, a 1080p video would be encoded with 7800 kbit/s.

However, there are various types of content: animation, nature documentaries, action/sports - which either have low/high complexity, as well as high/low redundancy, and the same encoding ladder is applied to all videos.

As a result, bitrates are either under or overused, which can result in increasing storage costs.

With per-title encoding, however, encoding settings are based on the content itself.

With per-scene encoding, encoding settings are adjusted based on different scenes within the content.

PER-TITLE ENCODING | how does it work?5AI-Powered Per-Scene Live Encoding EncodingBitrate (kb/s)VMAFPSNRStatic78009942.1Per-title337094.537.6Per-scene21709335.4Reference video: 1080p sports video

So, how does per-title encoding work?

First, the source video file is analyzed for its complexity.

With the analysis, several test encodes are produced to calculate its corresponding VMAF values.

These test encodes consist of different encoding settings.

Then, a convex hull is estimated, so that the resulting encoding ladder consists of bitrate/resolution pairs that lie close to the convex hull.

Finally, production encoding is performed, where a video is encoded based on the resulting ladder.

For purposes of comparison, we used a 1080p sports video and encoded it with 3 different methods.

In this context, the benchmark for quality comparisons are VMAF and PSNR, both of which are quality metrics for video files.

VMAF was developed by Netflix a few years ago in order to capture the perception of video quality in a more accurate fashion.

It is measured on a scale of 0-100, with 100 being perfect quality.

On slide 10, you can see a comparison of bitrates and quality scores between each type of encoding.

With per-title and per-scene encoding, bitrates and bandwidth are reduced by at least 50%, while maintaining the same quality without any perceptual loss.

Additionally, through comparing file sizes, we found that storage and delivery also decreased by at least 50%.

However, a major disadvantage in this method is that a large number of test encodes is required in order to derive a proper/accurate encoding ladder.

2Web -based AI Solutionfor Per-Title/Per-Scene Encoding

To overcome the challenges of per-title and per-scene encoding, we developed a few models that can predict video quality based on certain encoding settings.

With this technology, large sets of test encodings are not needed.

AI-Powered PTE & PSE | workflowAI-Powered Per-Scene Live Encoding 7Source videoLive & VoDComplexity Analysis...analyzesourceinformation(resolution, framerate), calculatequality, reviewtestencodingsforreferenceEncoding & PackagingCDNMachine Learning...selectappropriateregressionalgorithmbasedon complexityanalysisto predictoptimizedencodingprofileFeed videosourceintodistributionworkflowEncodevideowithoptimizedprofile...provideavailabilityforplaybackstreamingVideo PlayerPlay optimizedstreamReturn streamingmetricsandQoSparameters

As you can see on slides 7-8, we've developed a workflow with integrated machine learning models that can be adapted for per-title, per-scene and live encoding.

First, a source video is fed into the distribution workflow.

The source video's complexity is analyzed (such as resolution size, frame rate, etc.).

A regression algorithm is used to predict the optimized encoding profile, which is then applied to the video for production encoding.

The encoded video can then be made available for playback streaming.

In a live streaming scenario, 5-second clips are cut and analyzed in parallel.

AI-Powered PTE & PSE | Usage of Web APIsAI-Powered Per-Scene Live Encoding 8Source videoLive & VoDComplexity Analysis•Server-sideanalysis•Analysis datasentto modelvia endpoint•Client -side analysisEncoding & PackagingCDNMachine Learning•Client-sidepredictions•Model Loader& Web Neural Network API's•town hall meetings•live streamingscenarios•Video fileupload(browser)•Video uploadvia link (browser)Video Player•Current•Upcoming•Server-sideproductionencoding•UtilizeWebCodecsAPI for town hall meetings

presents the current and upcoming components for our end-to-end solution.

The main goal in integrating web API’s is to improve the performance and speed of our end-to-end solution for several use case scenarios.

Our current web solution includes basic browser functionalities and a video upload via the browser.

However, a more 'web-friendly' solution for this step in the workflow involves an automatic video upload via an URL.

Our current complexity analysis is conducted server-side.

All extracted video data is sent to our machine learning models via an endpoint.

Like Francois' stated in his presentation, “Media Processing Hooks Through the Web”, videos are processed and analyzed through frame extraction, which is also implemented for our video analysis component.

However, in our solution, the overall end-to-end solution becomes slower, as well as increase costs on the server-side when it comes to video analysis.

With that, a client-side analysis would improve the overall time of our solution.

For our machine learning component, the models are pre-trained and loaded in the browser to produce client-side predictions.

However, with each model, comes several predictions that must be filtered down in order to form a proper optimal encoding ladder.

To overcome this issue, we've developed an encoding ladder API in order to filter through the results, and select the bitrate/resolution pairs that lie close to the convex hull.

We are currently looking into the Web Neural Network and Model Loader API’s to increase the performance of our models for use case scenarios such as super resolution for larger company-wide conference calls, our live streaming solution, and ultimately, a faster andmore automated process for our models in terms of generating predictions.

In the encoding/packaging component, while our production encoding is done server-side, integrating the WebCodecs API for our live streaming townhall meetings can also optimize the overall end-to-end solution.

AI-Powered PTE & PSE | web interfaceAI-Powered Per-Scene Live Encoding9

On slide 9, you’ll find screenshots of our web interface, which follows our previously discussed workflow.

In the middle screenshot, you can see our server-side video analysis taking place.

So, features such as the video metadata, characteristics, classification, scene changes, and spatial/temporal information are extracted and fed as requests for the machine learning endpoint.

Once analyzed, any of our pre-trained models can be selected to determine its optimal encoding ladder, which consists of the resolution, bitrate, and predicted VMAF score.

With this browser-based solution, users can filter down the predictions to certain encoding ladder representations, and perform a production encoding on the selected representations.

AI-Powered PTE & PSE| algorithmsAI-Powered Per-Scene Live Encoding10Stacked Model•Framework: Keras•flexible in terms of its concept in combiningmodels•each model requires several cycles in finetuning in order to perform well togetherXGBoost•Framework: SKlearn•does not require as much normalization for training attributes•certain encoding methods can weaken the model performanceConvolutional Neural Network•Framework: Tensorflow•flexible and video-friendly in terms of processing•computationally expensive, training can be slow without a strong GPUModelsFeed Forward•Framework: Keras•robust enough to support missing inputvalues•requires a large amount of data for better performance

provides a brief overview of the models we’ve developed.

These models have been extensively trained on over 20 video attributes and thousands of test encodes derived from videos ranging from 360p - 1080p.

As a result, we’ve developed 4 working models, of which, XGBoost has been our best performing model in terms of accuracy.

We also have a convolutional neural network, which is geared towards image & video processing, as well as the feed forward and stacked model.

As mentioned in the GitHub issue of 'ML model formats', it would be necessary to have standardized frameworks.

What we are currently looking into developing is a serving pipeline that supports all of our models so that separate instances do not have to be built for each framework.

While we tried to build all models under one framework, certain models such as the XGBoost was not possible with Keras/Tensorflow.

3Summary & Outlook

In this talk, we've covered the concept of per-title encoding and a web/AI-based solution to automate the process as well as save the time and storage that usually follows the process of per-title encoding.

We've described a conventional static encoding ladder, where the same encoding settings are applied across all videos.

Then, the per-title encoding method, where bitrates and storage are saved.

Deep EncodeüNo computationally heavy test encodesüMetadataextractionand AI-basedimageprocessingfor contentanalysisüDeep Learning for appropriateencodingladdersSUMMARY & OUTLOOK | recapPer Scene EncodingConventional & Per Title EncodingAI-Powered Per-Scene Live Encoding 12

With our AI-based solution, the large number of test encodes are avoided, and bandwidth and storage costs are saved even more.

qModel Loader API:§Integrate into Solution§Live streaming & company-wide conference call use cases§Contribute to API to support Deep Encode on the WebqModel formatstandardization contributionqUtilize web codecs for live streaming scenarioqPredictionof othertypesof videoqualityqExploration of additionalvideofeature detectionalgorithmsSUMMARY & OUTLOOK | what's next?Per Scene EncodingConventional & Per Title EncodingAI-Powered Per-Scene Live Encoding 13

We are currently optimizing the overall workflow in order to have a faster performance time.

For example, enhancing our current architecture with using the model loader API that can load a custom pre-trained model.

We’d also like to contribute to this API and the model standardization issue in order to further support our use cases.

We are also optimizing our machine learning models in order to minimize the difference between our predicted and production encodes and exploring other types of metrics that can further enhance our models.

Anita Chen Project Manager Future Applicationsand Media (FAME)anita.chen@fokus.fraunhofer.deFraunhofer FOKUSBerlin, GermanyFAME Video Development Blog: https://websites.fraunhofer.de/video-dev/

Thank you for watching my presentation and have a nice day!

Keyboard shortcuts in the video player

Play/pause: space
Increase volume: up arrow
Decrease volume: down arrow
Seek forward: right arrow
Seek backward: left arrow
Captions on/off: C
Fullscreen on/off: F
Mute/unmute: M
Seek percent: 0-9

Previous: Privacy focused machine translation in Firefox All talks Next: A virtual character web meeting with expression enhance power by machine learning

Thanks to Futurice for sponsoring the workshop!

Video hosted by WebCastor on their StreamFizz platform.