Moving Picture Experts Group (MPEG)
|Leonardo Chiariglione (2009), Scholarpedia, 4(2):6600.||doi:10.4249/scholarpedia.6600||revision #91531 [link to/cite this article]|
The Moving Picture Experts Group (often abbreviated as MPEG) is a working group of ISO/IEC in charge of development of international standards for compression, decompression, processing, and coded representation of moving pictures, audio, and their combination, in order to satisfy a wide variety of applications.
H. Nyquist  and W. R. Bennett  laid the foundations of digital signal processing, the former by establishing the conditions for statistical equivalence between time-continuous and sampled signals, and the latter by setting statistical bounds to errors for quantised (so-called Pulse Code Modulation or PCM) signals, i.e. converted to a form suitable for handling by digital computing machines.
If analogue signals of primary interest to humans – audio and video – are converted to digital according to Nyquist’s and Bennett’s precepts (a process that will be henceforth called “digitisation”), very high bitrate PCM signals are obtained. Although “high” is a reflection of the technological times (the CD rate of 1.41 Mbits/s for stereo audio signals was exceedingly “high” in the early days of the internet thus prompting users to adopt the highly efficient MP3 compression format, see later), 216 Mbit/s of digital television is unmanageable even today in most open environments. This obstacle, along with the advantages to be gained by overcoming it, led to the creation of a new field of study: reduction of the bitrate of digitised audio and video signals, if possible without distortion, otherwise with a controlled distortion.
The first target application was in the speech area because of the drive started in the 1960’s to digitise the telecommunication networks and because telephone speech is from the beginning bound within the frequency spectrum of 0.3 to 3.4 kHz and therefore yields a rather reduced bitrate. Sampling at a sampling frequency of 8 kHz and 8 bits precision(companded, i.e. non-linearly quantised) gives a data rate of 64 kbit/s, as enshrined in International Telecommunication Union, Telecommunication Standardisation Sector (ITU-T) Recommendation G.711 ).
Various algorithms have been employed to compress speech signals. The most straightforward algorithms – DPCM (i.e. differential PCM) – were not particularly successful because of their reduced capability to compress down to 32 kbit/s – generally not enough to justify adoption of the technology in the network.
Digital video took longer to surface because the bitrate resulting from digitisation was 3 orders of magnitude larger. Still ITU-T Recommendation H.120 applied DPCM to contiguous video samples within a video frame (hence called “intraframe coding”) and achieved further reduction by exploiting correlation between contiguous frames (hence called “interframe coding”) to a subsampled version of TV signals for videoconference. Thus the input bitrate of about 40 Mbit/s could be reduced down to 1.5/2 Mbit/s. This system, too, was not particularly successful because the bitrate was still too high and the compression/decompression equipment too expensive.
In the 1980s many were working on video and audio coding. Nippon Hoso Kyokai (NHK) developed and deployed an innovative hybrid (analogue/digital) HDTV transmission system called MUSE that led the Europeans to devise their own solution called HD-MAC; ITU-T developed a new video compression Recommendation H.261 that applied intraframe Discrete Cosine Transform (DCT) coding with motion-compensated interframe prediction; RAI-Telettra and General Instrument developed and manufactured HDTV codecs at bitrates that were thought to be unachievable until then; Philips and RCA developed and manufactured systems for interactive video on compact disc (CD) called respectively CD-i and DVI; another branch of the ITU-T called CMTT studied a so called “contribution” (i.e. “between studios”) codec; a group of European companies and institutions developed the Digital Audio Broadcasting (DAB) system specifications within the Eureka project EU 147 DAB.
One might have thought that a buoyant competitive market should have been left free to produce its own results.
Instead MPEG was established as a working group of the International Organisation for Standardisation (ISO) with the idea that the only way for digital audio and video to succeed, in a relatively short time, was based on a reference standard without the myriad technological barriers that had been imposed on analogue audio and video. The right time for that standard was toward the end of the 1980s because video and audio compression performance and VLSI implementability were heading for their first intersection sometime in the early 1990s.
Interactive audio and video on CD was thought to be the first business case for the standard that was eventually called MPEG-1 . The standard is organised in the following five parts:
- Part 1 Systems
- Part 2 Video
- Part 3 Audio
- Part 4 Conformance testing
- Part 5 Software simulation
Systems (defined in part 1 of the standard) is a packet-based multiplexer that can carry m video streams and n audio streams, all with the same time base. The stream carries timing information so that the receiving device can reconstruct a faithful replica – within the accuracy enabled by the standard – of the information generated at the encoder.
Video (defined in part 2 of the standard) provides a powerful compression technique based on the following assumptions:
- Video is a 3D array (x,y,t) of image samples, referred to as “pixels” where x (<M) represents the horizontal direction of the screen, y (<N) the vertical direction (from top to bottom) and t represents the time.
- Pixels are organised in 8x8 spatial blocks
- Blocks are organised in 2x2 luminance blocks and 2 chrominance blocks called macroblocks
- A block is mapped to sixty-four DCT coefficients
- A macroblock has one motion vector
- Motion vectors and DCT coefficients are Variable Length Coded
MPEG-1 Video is a generic algorithm that can work with any parameter set. As this does not give enough guidance to build interoperable devices, MPEG-1 defines a Constrained Parameter Set satisfying the following conditions
- M ≤768
- N ≤576
- no. of macroblocks/picture ≤396 (352x288/256))
- no. of macroblocks/second ≤9900 (396x25)
- Picture rate ≤30 Hz
- Interpolated pictures ≤2
- Bitrate ≤1856 kbit/s
MPEG-1 Audio (defined in part 3 of the standard) includes three compatible versions called “layers” where
- Layer I, a subband coding scheme, contains the basic mapping of the digital audio input into 32 sub-bands, fixed segmentation to format the data into blocks, a psychoacoustic model to determine the adaptive bit allocation, and quantisation using block companding and formatting;
- Layer II, also a sub-band coding scheme, provides additional coding of bit allocation, scalefactors, samples, different framing;
- Layer III, a hybrid sub-band-DCT coding scheme, introduces increased frequency resolution based on a hybrid filterbank; a nonuniform quantiser, adaptive segmentation and entropy coding of the quantised frequency samples are also utilized.
A “layer n” decoder is capable of decoding bistreams of lower layers but not higher layers. A reference MPEG-1 diagram is given in Figure 1.
“MPEG-1 stream decoder” is specified by Part 1, “Video decoder” is specified by Part 2 and “Audio decoder” is specified by Part 3.
Specifically, MPEG-1 standardises syntax and semantics of the bitstream. In addition, only the decoding process is subject to the standard, while the process and decoder internal data representation is non-normative.
Additionally MPEG-1 has innovated the landscape of standards by providing
- The first integrated audio-visual standard with Systems, Video and Audio specification
- The first audio-visual standard defining the “receiver” and not the “transmitter”
- The first video coding standard independent of video format (NTSC/PAL/SECAM)
- The first standard jointly developed by all industries interested in audio and video
- The first standard developed entirely in software
- The first standard including a software implementation.
Performance of MPEG-1 Audio, as tested in the early 1990s is transparency at 384 kbit/s (Layer I), at 256 kbit/s (Layer II) and at 192 kbit/s (Layer III) where “Transparency” was defined by MPEG as a condition where experts (so-called golden ears) are statistically unable to distinguish the original PCM stereo sound sampled at 48 kHz with 16 bits/sample from the coded version.
Early on, MPEG saw the benefit of developing a software implementation of the standard. Therefore Part 4 of the MPEG-1 standard is called “Conformance”. It provides the means to check that an instance of a decoder and that an instance of a bitstream conform to the standard.
Part 5 of MPEG-1 “Reference Software” contains the C implementation of encoders and decoders. It is to be noted that encoders are not optimised (in quality and real-time performance). However, they generate/are capable of handling conforming bitstreams. Some commercial implementations have reportedly been derived from part 5 of MPEG-1
MPEG-2  was designed to be the standard enabling the digital transformation of the analogue television system designed half a century before. It comprises the following 10 standards (part 8 was not developed):
- Part 1 Systems
- Part 2 Video
- Part 3 Audio
- Part 4 Conformance testing
- Part 5 Software simulation
- Part 6 System extensions - DSM-CC
- Part 7 Advanced Audio Coding
- Part 8 VOID
- Part 9 System extension RTI
- Part 10 Conformance extension - DSM-CC
- Part 11 IPMP on MPEG-2 Systems
Systems defines an entity called Packetised Elementary Stream (PES). This is a compressed stream combined with system level information and packetised for use in two types of MPEG-2 Systems streams
- Program Stream (PS) combines one or more PESs which have a common time base, into a single stream (analogous to MPEG-1 Systems Multiplex). PS is designed for use in relatively error-free environments and is suitable for applications which may involve software processing. Program stream packets may be of variable and relatively great length.
- Transport stream (TS) combines one or more PESs with one or more independent time bases into a single stream. Elementary streams that share a common timebase form a program. TS is designed for use in error-prone environments, such as storage or transmission in lossy or noisy media. TS packets are 188 bytes long.
- Coding tools, i.e. particular functions required to achieve defined functionalities. Many MPEG-2 tools are drawn from MPEG-1 Video tools. Indeed if input video is progressive, one can say that MPEG-2 becomes MPEG-1. However there are also new tools, particularly for efficient compression of interlaced video and for different types of scalability (SNR, Spatial).
- Profiles, i.e. groups of tools designed to satisfy major application domains while maximising interoperability between domains.
MPEG-2 Systems and Video were developed jointly with the ITU-T with the acronyms H.222 and H.262, respectively.
Audio provides a multichannel-compatible extension of MPEG-1/Audio in the sense that it is
- Backward compatible: an MPEG-1/Audio decoder can decode the two channel components of an MPEG-2/Audio bitstream
- Forward compatible: an MPEG-2/Audio decoder can decode an MPEG-1/Audio bitstream, of course by producing a two-channel sound.
The standard also contains technology to extend the stereo compression features of MPEG-1 Audio. Unfortunately the backward compatibility of MPEG-2 Audio with MPEG-1 Audio limits its performance.
To overcome this limitation MPEG developed part 7 Advanced Audio Coding (AAC) to provide a multichannel solution without backward compatibility of Part 3. This employs a new algorithm to encode multichannel audio, providing improved performance, that materialises as transparency (again, the use of "high quality" instead of "transparency" is recommended; this sentence also needs to be edited) at 128 kbit/s per stereo signals. The coding gain is achieved through redundancy removal by means of a high-resolution transform and entropy coding, and irrelevancy removal by using a model of the human auditory system in connection with the coefficient quantization.
In addition to Conformance and Reference Software (parts 4 and 5, respectively), MPEG-2 also includes part 6 with the title Digital Storage Media Command and Control (DSM-CC) for device-to-device and device-to-network interaction and other standards.Figure 2 illustrates the main components of the standard.
“MPEG-2 stream decoder” is specified by Part 1, “Video decoder” by Part 2, “Audio decoder” by Part 3 and “Interaction” by Part 6.
MPEG-4  started as a standard for very low bitrate audio-visual coding, e.g. 10 kbit/s. Eventually MPEG-4 became that and a rather long list of other digital media technologies, some of which are
- Scene description
- Video coding
- Audio coding
- 3D graphics coding
- Synthetic audio coding
- Transport interface
- File Formats
- Open Font Format
- Symbolic Music Representation
- 3D Graphics Compression Model
MPEG-4 comprises 25 parts, some of which are still under development
- Part 1 Systems
- Part 2 Visual
- Part 3 Audio
- Part 4 Conformance testing
- Part 5 Reference Software
- Part 6 Delivery Multimedia Integration Framework
- Part 7 Optimised software for MPEG-4 tools
- Part 8 4 on IP framework
- Part 9 Reference Hardware Description
- Part 10 Advanced Video Coding
- Part 11 Scene Description and Application Engine
- Part 12 ISO Base Media File Format
- Part 13 IPMP Extensions
- Part 14 MP4 File Format
- Part 15 AVC File Format
- Part 16 Animation Framework eXtension (AFX)
- Part 17 Streaming Text Format
- Part 18 Font compression and streaming
- Part 19 Synthesized Texture Stream
- Part 20 Lightweight Application Scene Representation
- Part 21 MPEG-J Extension for rendering
- Part 22 Open Font Format
- Part 23 Symbolic Music Representation
- Part 24 Audio-System interaction
- Part 25 3D Graphics Compression Model
Systems (part 1) provides the architecture of the standard and roughly corresponds to the Systems parts of the MPEG-1 and MPEG-2 standards.
Visual (part 2) contains a large number of video coding tools that are employed in two very popular profiles: Simple Profile (SP) and Advanced Simple Profile (ASP).
In 2001, MPEG teamed with the Video Coding Experts Group of the ITU-T and established a Joint Video Team (JVT) which developed a new generation video codec called Advanced Video Coding (AVC) as part 10 of MPEG-4. AVC has roughly twice the compression capability of MPEG-2 and MPEG-4. Subsequently AVC was extended with scalability functions yielding Scalable Video Coding (SVC). Currently AVC is being further extended with Multiview Video Coding (MVC) capabilities.
Audio contains a large set of coding tools through which it is possible to construct several audio and speech coding algorithms
- MPEG-4 AAC, an extension of MPEG-2 AAC
- Twin Vector Quantisation (VQ)
- Speech coding based on Code Excited Linear Predictive (CELP) coding and on Parametric representation
- Spectral Band Replication (SBR) technology to provide high quality audio at ever reduced bitrate, as in High Efficiency AAC (HE AAC)
- Various forms of audio lossless coding.
Synthetic Audio, called “Structured Audio”, is included in part 3. It provides the means to code sound using structured descriptions that are interpreted by a Structured Audio decoder to perform music and sound-effect synthesis. The Structured Audio Tools are: Structured Audio Orchestra Language (SAOL) providing synthesis methods, Structured Audio Score Language (SASL/MIDI) providing control parameters and Structured Audio Sample Bank Format (SASBF) providing the actual sample data.
In addition to the usual Conformance and Reference Software (parts 4 and 5, respectively), MPEG-4 also includes Part 7 “Optimised software for MPEG-4 tools” that provides examples of reference software that not just implement the standard correctly but also in optimised form, and Part 9 “Reference Hardware Description” where the reference software is in VHSIC Hardware Description Language (VHDL) for synthesis of VLSI chips.
Part 6 “Delivery Multimedia Integration Framework” (DMIF) provides a standard interface to access various transport mechanisms.
Part 8 “4 on IP framework” complements the generic MPEG-4 RTP payload defined by IETF as RFC 3640 .
MPEG 1 and MPEG-2 assume that information in decoded form leaves the decoder as sequences of PCM samples but the standards are silent on what is done with them. Scene Description (part 11) provides technologies for the new functionality of “composing” different information elements in a “scene”.
The original technology is called Binary Format for MPEG-4 Scenes (BIFS) of which there exists a Java powered version called MPEG-J. A newer technology with similar functionalities is provided by Part 20 “Lightweight Application Scene Representation” (LASeR).
MPEG-4 provides standard solutions for coding of synthetic visual information for 3D graphics. These tools are specified in Part 2 - Face and Body Animation and 3D Mesh Compression, Part 11 - Interpolator Compression - and 16 - a complete framework, called Animation Framework eXtension (AFX), for efficiently coding the shape, texture and animation of interactive synthetic 3D objects. AFX attempts to unify MPEG-4’s tools related to 3D graphics.
An important component of AFX is 3D Mesh Coding to provide efficient encoding of 3-D polygonal meshes with
- Incremental representation: to enable a decoder to reconstruct a number of faces in a mesh proportional to the number of bits in the bit stream that have been processed.
- Error resilience: to enable a decoder to partially recover a mesh when subsets of the bit stream are missing and/or corrupted.
- Level of Detail (LOD) scalability: to enable a decoder to reconstruct a simplified version of the original mesh containing a reduced number of vertices from a subset of the bit stream with the advantage of reducing the rendering time of objects which are distant from the viewer (LOD management) and enabling less powerful rendering engines to render the object at a reduced quality.
AFX introduces as well an advanced animation model for articulated models, a hierarchical representation of urban environments and several modern coding tools for 3D data.
Part 25 “3D Graphics Compression Model” specifies an architectural model able to accommodate third-party eXtensible Markup Language (XML) based description of scene graphs and graphics primitives with (potential) binarisation tools and with MPEG-4 3D Graphics Compression tools.
The ISO Base Media File Format (part 12) is designed to contain timed media information for a presentation in a flexible, extensible format that facilitates interchange, management, editing, and presentation of the media. These may be ‘local’ to the system containing the presentation, or may be via a network or other stream delivery mechanism. Part 14 “MP4 File Format” extends the File Format to cover the needs of MPEG-4 scenes while part 15 “AVC File Format” supports the storage of AVC and MVC bitstreams.
Streaming Text Format (part 17 of MPEG-4) defines text streams that are capable of carrying Third Generation Partnership Program (3GPP) Timed Text (specified in 3GPP TS 26.245). To transport the text streams, a flexible framing structure is specified that can be adapted to the various transport layers, such as RTP/UDP/IP and MPEG-2 Transport and Program Stream, for use in media such as broadcast and optical discs.
Among the remaining MPEG-4 technologies it is worth mentioning Open Font Format (part 22). MPEG received a request from rights holders to convert the widely adopted OpenType specification to an ISO standard. As is the rule with MPEG standards, the OpenType specification was converted to a Working Draft and then balloted through the ISO-specified process of Committee Draft (CD), Final Committee Draft (FCD) and Final Draft International Standard (FDIS) stages.
The figure below provides a conceptual diagram of the structure of an MPEG-4 decoder with the role played by the main MPEG-4 technologies.
With reference to the figure the parts of the MPEG-4 standard specify the blocks as follows:
- Part 1 specifies “MPEG-4 stream decoder”
- Part 2 specifies “Video decoder”
- Part 3 specifies “Audio decoder”
- Part 6 specifies “Interaction”
- Part 8, 12, 14 and 15 specify “Transport”
- Part 11 and 20 specify “Composition decoder” and “Composition”
- Part 16 specifies “3DG decoder”
- Part 17 specifies “Stream text decoder”
- Part 18 and 22 specify “Font decoder”
- Part 19 specify “Synthesised texture decoder”
- Part 21 specifies “Rendering”
With MPEG-7  MPEG made a kind of departure from its previous audio and video compression standards because it addressed the issue of “describing features of multimedia content”. MPEG-7 provides the world’s most comprehensive set of audio-visual description tools, namely
- A set of descriptors (D) that represent features that include the syntax and semantics of the feature representation
- A set of Description Schemes (DS) that specify the structure and semantics of the relationships between D and DS components
- A Description Definition Language (DDL), based on XML Schema with extensions, that specifies DSs and can be used to extend and modify existing DSs
- A textual and binary encoding of Ds
System tools for multiplexing of descriptors, synchronization, transmission mechanisms, file formats, etc. MPEG-7 is organised in 12 parts and is still structured in a way that reminds one of the earlier MPEG standards.
- Part 1 Systems
- Part 2 Description Definition Language
- Part 3 Visual
- Part 4 Audio
- Part 5 Multimedia Description Schemes
- Part 6 Reference Software
- Part 7 Conformance
- Part 8 Extraction and Use of MPEG-7 Descriptions
- Part 9 Profiles
- Part 10 Schema definition
- Part 11 Profile schemas
- Part 12 Query Format
Systems (part 1) specifies the means for binarising DDL data, a methodology for carrying descriptions as streams and the means for accessing and synchronously consuming data.
Description Definition Language (part 2) standardises a language to specify Description Schemes and Descriptors derived from XML Schema to express relations, object orientation, composition, partial instantiation, etc.
Visual (part 3) offers a broad range of visual descriptors
- Grid layout (spatial structure)
- Colour: Colour space, dominant colour, colour layout
- Texture: Homogeneous texture, texture browsing
- Shape: Contour-based shape
- Motion: Motion activity
Audio (part 4) offers a broad range of audio descriptors
- Audio Description Framework
- Spoken Content DS
- Timbre DS
- Audio Independent Components
- Sound Effects
Part 5 “Multimedia Description Schemes” (MDS) defines elements (Ds and DSs) that are generic (neither purely visual nor purely audio). This is a summary list
- Basic Elements
- Schema Tools
- Content Description Tools
- Structure Description Tools
- Content Organization Description Tools
- Navigation and Access Description Tools
- User Interaction Description Tools.
Part 12 “Query Format” specifies the interface between a requester and a responder for multimedia content retrieval systems (e.g.: MPEG-7 databases). This enables users to describe their search criteria with a set of precise input parameters and additionally allows users to specify a set of preferred output parameters to depict the returned result sets.
In 1999, much before the Web 2.0 hype, MPEG started a project driven by the vision of a future where every human on the Earth is potentially an element of a network involving billions of content providers, value adders, packagers, service providers, resellers, consumers etc. while many technologies were already available, it was clear that to make this future real there was a need for an infrastructure enabling electronic commerce of digital content.
At the basis of this project, soon called MPEG-21 , there are two key concepts:
- Digital Item, a structured digital object with a standard representation, identification and metadata within the MPEG-21 framework and
- User, any entity that interacts in the MPEG-21 environment or makes use of Digital Items.
MPEG-21 is a collection of seventeen standards whose integration enables Users to perform all functions on Digital Items that enable the realisation of the vision described above.
- Part 1 Vision, Technologies and Strategy
- Part 2 Digital Item Declaration
- Part 3 Digital Item Identification and Description
- Part 4 IPMP Components
- Part 5 Rights Expression Language
- Part 6 Rights Data Dictionary
- Part 7 Digital Item Adaptation
- Part 8 Reference Software
- Part 9 File Format
- Part 10 Digital Item Processing
- Part 11 Evaluation Tools for Persistent Association
- Part 12 Test Bed for MPEG-21 Resource Delivery
- Part 13 VOID
- Part 14 Conformance
- Part 15 Event reporting
- Part 16 Binary format
- Part 17 Fragment Identification
- Part 18 Digital Item Streaming
Part 1 Vision, Technologies and Strategy is a Technical Report, and lays down the scope and development plan of the project.
The foundational element of MPEG-21 is the definition of a structure that can flexibly accommodate the many components of a multimedia object. This includes, of course, the resources (media), but also identifiers, metadata, encryption keys, licenses etc. The specification of this structure is provided by Part 2 Digital Item Declaration (DID).
Identification of Digital Items is a key requirement in the digital space where everything must be uniquely and unambiguously identified in order to be managed. In MPEG-21 this function is provided by Part 3 Digital Item Identification (DII), a standard to handle identifiers in Digital Items.
A Digital Item can contain resources or even portions of a Digital Item that are protected. The component technologies that are needed to process those resources (i.e. to make them available in a form that can be processed by a machine) need to be standardised. This is done by Part 4 Intellectual Property Management and Protection (IPMP) Components. IPMP is the MPEG acronym for Digital Rights Management (DRM).
In the digital space, licenses play a similar role to licenses in the real world. The difference is that real world licences are expressed in natural language and are understood by humans, while the former must be expressed in a form that can be processed by a machine. Part 5 Rights Expression Language (REL) provides the technology to express rights in a rich form that is comparable to the richness of the human language.
The language mentioned above is only capable of expressing the syntax of a rights expression but says nothing of the semantics of the “verbs”, e.g. copy, store, display etc., that are employed by the language (even though the MPEG REL provides the semantics of a few key verbs). A standard semantics for verbs commonly used in the media environment in general is given by Part 6 Rights Data Dictionary (RDD).
When a Digital Item and its resources are transported over the network it may be necessary to “adapt” (e.g. reduce in bitrate) them to varying conditions. When a Digital Item and its resources reach a device, the resources may need to be “adapted” (e.g. subsampled) to match (e.g., device capabilities). Part 7 Digital Item Adaptation (DIA) specifies the syntax and semantics of the tools that may be used to assist in the adaptation of Digital Items, metadata and resources.
As for most other MPEG standards, MPEG-21 has a reference software implementation. This is provided by Part 8 Reference Software.
A Digital Item is an XML structure that can be moved from one device to another “as is”. However, it may be convenient to use a standard file format because in this case a device knows, by virtue of the definition of the file format itself, where specific Digital Item structures can be found. This is provided by Part 9 File Format.
A Digital Item is a static XML structure that contains all elements necessary to describe the resources contained in it, e.g. description of content, DRM information, etc. However, a Digital Item does not natively provide a way for a Digital Item creator to suggest how a user can interact with the Digital Item. Providing this additional information is the scope of Part 10 Digital Item Processing (DIP).
It is possible to establish associations – called Persistent Association Technologies (PAT) in MPEG-21 – between resources and certain metadata related to the resource using such technologies as “watermarking” and “fingerprinting”. As it is probably not necessary, and certainly premature at this stage, to standardise these association methods, Part 11 Evaluation Tools for Persistent Association provides the means to evaluate the performance of a given PAT to see how well it fulfils the requirements of the intended application. This, however, is a Technical Report, i.e. it is a simply guide to users.
A software test bed has been developed to enable experimentation with different means of resource delivery. The software is provided by Part 12 Test Bed for MPEG-21 Resource Delivery. This, however, is a Technical Report, i.e. it is simply a tool to help users experiment.
Conformance of an implementation is of course needed for MPEG-21 technologies as well. The purpose of Part 14 Conformance is to provide the necessary test methodologies and suites to be used to assess the conformity of a bitstream (typically an XML document) and a decoder (typically a parser) to the relevant MPEG-21 standard.
Certain application domains require a technology that can generate an event every time an action specified in the “Event Report Request” (ERR) contained in a Digital Item is made on a resource. The technology achieving this is specified in Part 15 Event Reporting (ER).
In MPEG-7 Systems MPEG had standardised a technology that allows the lossless conversion of a typically very bulky XML document to a binary format, preserving the ability to efficiently parse the binarised XML format. That technology has now been moved to MPEG-B Part 1 “Binary MPEG format for XML” (BiM). Now MPEG-7 Part 1 Systems and MPEG-21 Part 16 Binary format essentially reference the BiM technology specified in MPEG-B Part 1.
There are cases where it is necessary to identify a specific fragment of a resource as opposed to the entire set of data. Part 17 Fragment Identification (FID) specifies a normative syntax for URI Fragment Identifiers to be used for addressing parts of a resource from a number of Internet Media Types.
While part 9 provides a solution to transport a Digital Item in a file, Digital Items may also be transported over a streaming mechanism (e.g. in broadcasting or over IP networks). Therefore part 18 Digital Item Streaming (DIS) provides the technology to achieve this when the streaming mechanism employed is MPEG-2 Transport Stream and RTP/UDP/IP.
- Media Value Chain Ontology will provide a standard representation of the terms in a vocabulary and their corresponding relationships for use in media value chains. An example is personal and commercial movies that include not only the movie itself but also related information like movie producer, movie owner, rights and limitations to modify the movie, as well as personal notes available to a certain user group. The ontology will initially focus on the areas of Intellectual Property, Authorisation Models, User Role Description, Context Description, and Social Tagging.
As clear from the above list, MPEG has produced many component standards. However, technology integration has been left to implementers. The result has been that, e.g. ATSC uses MPEG-2 Systems and Video but a different Audio than specified by MPEG, and DivX uses MPEG-4 Visual, MP3 and AVI. It is obviously within the scope of implementers to make such decisions, however this has shortcomings. It may take a long time to go from an MPEG standard to a product, while gratuitous incompatibilities between different implementations that often trouble end users may could be avoided with more careful choices.
With MPEG-A  MPEG has decided to engage in the area of “standard integration” considering that MPEG has (most of) the technologies needed, the internal expertise to do the integration job and the appropriate industry representation.
An interesting side-effect of the integration effort is that, while doing the integration, MPEG may discover (and actually has discovered) that not all components are there.
MPEG-A is still in full development (several parts are still to be completed). It currently comprises twelve parts.
- Part 1 Purpose for Multimedia Application Formats
- Part 2 Music Player Application Format
- Part 3 Photo Player Application Format
- Part 4 Musical Slide Show Application Format
- Part 5 Media Streaming Application Format
- Part 6 Professional Archival Application Format
- Part 7 Open Access Application Format
- Part 8 Portable Video Application Format
- Part 9 Digital Multimedia Broadcasting Application Format
- Part 10 Video Surveillance Application Format
- Part 11 Video Stereoscopic Application Format
- Part 12 Interactive Music Player Application Format
Part 1 Purpose for Multimedia Application Formats is a Technical Report, and lays down the scope and development plan of the project.
Part 2 Music Player Application Format has the purpose of enabling users to achieve an augmented experience of their sound resources by providing an “extended MP3 format”. This is achieved by adding more information in the now-ubiquitous MPEG File Format, namely MP3 Audio compression, MPEG-4/MPEG-21 File Format, an ID3 subset as MPEG-7 metadata and JPEG still picture compression.
Part 3 Photo Player Application Format has the purpose of enabling users to achieve an augmented experience of their photo resources by adding more information to the ubiquitous JPEG File Format, namely
- MPEG-7 Visual tools to describe visual properties of the images
- MPEG-7 MDS tools to carry simple generic metadata
- MPEG-7 System tools to support metadata binarisation
- MPEG-4 File Format
- EXIF (EXchangeable Image format)
The Music Player Application Format was designed as a simple format for enhanced MP3 players and the Photo Player Application Format combines JPEG still images with MPEG-7 metadata. Part 4 Musical Slideshow Application Format builds on top of the Music Player and the Photo Player Application Formats and is a superset of these two MAFs.
Part 5 Media Streaming Application Format specifies how to use specific MPEG technologies to build a full-fledged media player for streaming governed content. However, in order to have a complete media streaming set-up, it is necessary to deploy a number of devices: a Content Provider Device containing the Digital Items and the actual resources; a License Provider Device containing the associated licences; an IPMP Tool Provider Device that end user devices can access to get any IPMP Tools needed to make the resources usable; a Domain Management Device that handles sets of devices and users and a Media Streaming Player. The standard specifies the data formats and the protocols exchanged between a Media Streaming Player and the other devices.
The purpose of part 6 Professional Archival Application Format is to provide a standard packaging format for carriage of digital multimedia content, metadata to describe context information related to digital multimedia content stored in the archive, metadata to describe the logical structure of how the digital multimedia content is stored in the archive, identification of processing tools that are applied to the digital multimedia content as well as data protection and integrity tools, data governance tools, and data compression tools.
Part 7 Open Access Application Format defines a format designed for users who own rights to a piece of content and have an interest in releasing it in such a way that other users can freely access it but without making it public domain. The solution is the release of content that is governed in a “light-weight” form. The Open Access Application Format packages different contents into a single container file and provides a mechanism to attach metadata information, by using MPEG-7 and MPEG-21 technologies. The MPEG-21 REL is used to model the intentions of the license. MPEG-21 Event Reporting provides a feedback mechanism, which can notify the author, when a user wants to derive a content or extract an item out of the container file.
Part 8 Portable Video Application Format defines a format for the use of video files on portable devices giving users the possibility to use the content interactively.
Digital Multimedia Broadcasting (DMB) is a specification for the digital transmission of multimedia signals (especially video services) for mobile reception. Part 9 Digital Multimedia Broadcasting Application Format defines a standard file format that can be used to store in and exchange DMB content between DMB terminals. DMB Multimedia Application Format specifies how to combine the variety of DMB contents with associated information for a presentation in a well-defined format that facilitates interchange, management, editing, and presentation of the DMB contents.
Part 10 Video Surveillance Application Format provides a lightweight wrapper to the video content from the MPEG technologies, video coding, related metadata and file format, suitable for video surveillance.
Part 11 Video Stereoscopic Application Format provides a format for a creator to take and for a service provider to distribute stereoscopic images, enabling users to have more realistic experiences (with or without special glasses) and to store the stereoscopic content for possible redistribution.
Part 12 Interactive Music Application Format defines a format to package interactive music content with audio tracks before mixing, so users can freely control the individual audio tracks. This allows the producer to create several versions (producer mixing 1, producer mixing 2, karaoke, rhythmic, and so on) with just one piece of music, using the metadata structure for mixing information.
The maturing of multimedia technology is making less compelling the need to provide systems-video-audio “packages” as in previous MPEG standards (up to and including MPEG-7). Indeed various products and services currently available in the marketplace freely mix different technologies from the different standards and MPEG has done the same in its MPEG-A standards. To respond to the continuing need to cope with technological advances with new systems, video and audio standards, MPEG has started three new systems, video and audio standards “containers” called MPEG-B, MPEG-C and MPEG-D, respectively. MPEG-B  currently contains five parts.
- Part 1 Binary MPEG format for XML
- Part 2 Fragment Request Unit
- Part 3 XML Representation of IPMP-X messages
- Part 4 Codec Configuration Representation
- Part 5 Bitstream Syntax Description Language
Part 1 Binary MPEG format for XML (BiM) provides a standard set of generic technologies to transmit and compress XML documents, addressing a broad spectrum of applications and requirements. It relies on schema knowledge between encoder and decoder in order to reach high compression efficiency, and provides fragmentation mechanisms for ensuring transmission and processing flexibility.
Part 2 Fragment Request Unit specifies a technology enabling a terminal to request XML fragments of immediate interest. This significantly reduces processing and storage requirements at the terminal and can enable applications on constrained devices that would not otherwise be possible.
Part 3 XML Representation of IPMP-X Messages provides an XML representation of the IPMP-X messages defined in MPEG-4 part 13 with extensions.
Part 4 Codec Configuration Representation provides a compressed digital representation of a video decoder and of the corresponding bitstream, assuming that the receiving terminal shares a library of video coding tools with the transmitter.
Part 5 Bitstream Syntax Description Language provides a normative grammar to describe, in XML, the high-level syntax of a bitstream. The resulting XML document is called a Bitstream Syntax Description (BSD). BSD does replace the original binary format and, in most cases, it does not describe the bitstream on a bit-per-bit basis, but rather its high-level structure, e.g., how the bitstream is organized in layers or packets of data. BSD is itself scalable, i.e. it may describe the bitstream at different syntactic layers (e.g., finer or coarser levels of detail), depending on the application.
MPEG-C  currently contains four parts.
- Part 1 Accuracy specification for implementation of integer-output IDCT
- Part 2 Fixed point 8x8 DCT/IDCT
- Part 3 Auxiliary Video Data Representation
- Part 4 Video Tool Library
Part 1 Accuracy specification for implementation of integer-output IDCT specifies the IDCT accuracy that is equivalent to or extends the IEEE 1180 standard which has been withdrawn.
Part 2 Fixed-point 8x8 inverse discrete cosine transform and discrete cosine transform specifies a particular fixed-point approximation to the ideal 8x8 IDCT and DCT function, fulfilling the 8x8 IDCT conformance requirements for the MPEG-1, MPEG-2 and MPEG-4 part 2 video coding standards.
Part 3 Auxiliary Video Data Representation specifies how auxiliary data such as pixel-related depth or parallax values, are to be represented when encoded by MPEG video standards in the same way as ordinary picture data.
Part 4 Video Tool Library contains a collection of descriptions of video coding tools, called Functional Units, as referenced in MPEG-B Part 4.
MPEG-D, formally ISO/IEC 23003 MPEG Audio Technologies, currently contains 3 parts.
- Part 1 MPEG Surround
- Part 2 Spatial Audio Object Coding
- Part 3 Unified speech and audio coding
Part 1 MPEG Surround provides an efficient bridge between stereo and multichannel presentations in low-bitrate applications. The MPEG Surround technology supports very efficient parametric coding of multi-channel audio signals, so as to permit transmission of such signals over channels that typically support only the transmission of stereo (or even mono) signals. Moreover, MPEG Surround provides complete backward compatibility with non-multichannel audio systems.
Part 2 Spatial Audio Object Coding represents several audio objects by first combining the object signals into a mono or stereo signal, whilst extracting parameters from the individual object signals based on knowledge of human perception of the sound stage. These parameters are coded as a low bitrate side-channel that the decoder uses to render an audio scene from the stereo or mono down-mix, such that the aspects of the output composition can be decided at the time of decoding.
Part 3 Unified speech and audio coding, a standard still in the early phases of development, aims at defining a single technology that codes speech, music, and speech mixed with music, and that is consistently as good as the best of the state-of-the-art speech coders such as Adaptive Multi Rate – WideBand plus (AMR-WB+) and the state-of-the-art music coders (HE-AAC V2) in the 24 kbit/s stereo to 12 kbit/s mono operating range.
MPEG-E, also called MPEG Multimedia Middleware (M3W) , is a complete set of standards defining technologies required in a multimedia device. It is organised in eight parts
- Part 1 Architecture
- Part 2 Multimedia API
- Part 3 Component Model
- Part 4 Resource and Quality Management
- Part 5 Component Download
- Part 6 Fault Management
- Part 7 System Integrity Management
- Part 8 Reference Software and Conformance
Part 1 Architecture describes the M3W architecture and APIs.
Part 2 Multimedia API specifies access to the functionalities provided by conforming multimedia platforms such as Media Processing Services (including coding, decoding and trans-coding), Media Delivery Services (through files, streams, messages), Digital Rights Management (DRM) Services, Access to data (e.g. media content) and Access to, Edit and Search Metadata.
Part 3 Component Model specifies a technology enabling cost effective software development and an increase in productivity through software reuse and easy software integration.
Part 4 Resource and Quality Management specifies a framework for resource management aiming to optimise and guarantee the Quality of Service that is delivered to the end-user in a situation where resources are constrained.
Part 5 Component Download specifies a download framework enabling controlled download of software components to a device.
Part 6 Fault Management specifies a framework for fault management with the goal to have a dependable/reliable system in the context of faults. These can be introduced due to upgrades and extensions out of the control of the device vendor, or because it is impossible to test all traces and configurations in today’s complex software systems.
Part 7 System Integrity Management specifies a framework for integrity management with the goal to have controlled upgrading and extension, in the sense that there is a reduced chance of breaking the system during an upgrade/extension or to provide the ability to restore a consistent configuration.
Part 8 Reference Software and Conformance is the usual complement as with the other MPEG standards.
MPEG-M, also called MPEG eXtensible Middleware (MXM)  is a standard under development whose purpose is to promote the extended use of digital media content through increased interoperability and accelerated development of components, solutions and applications. This is achieved by specifying
- 1. The MXM architecture
- 2. The MXM components (by reference)
- 3. The MXM components APIs
- 4. The MXM applications API
- 5. The inter-MXM communication protocols
It is organised in three parts
Part 1 MXM Architecture and Technologies provides the reference architecture and lists the technologies that are included in the middleware,
Part 2 MXM API provides the APIs of the MXM Engines and of the MXM Orchestrator.
Part 3 MXM Reference Software and Conformance provides the MXM reference software, released as Open Source Software with a business freindly licence.
Ongoing and future activities
In its 20 years of existence MPEG has operated very much like a company churning out new products (standards) for its customers – the multimedia industry – very often by anticipating industry needs based on industry inputs and internal assessments.
These are some of the areas under investigation, at different stages of development (list in alphabetic order).
- In 3D Video (3DV), a shorter time-scale sub-project, new types of audio-visual systems are supported that allow users to view videos of real 3D space from different user viewpoints. 3DV is expected to be possible with advanced 3D displays, where M dense views must be generated from a sparse set of K transmitted views (typically K≤3) with associated depth data. The allowable range of view synthesis will be relatively narrow (20 degrees view angle from leftmost to rightmost view).
- Advanced IPTV Terminal is a standard being developed jointly by MPEG and ITU-T SG16 designed to enhance IPTV services by extending terminal capabilities with advanced features such as: Content generation, processing, and distribution by a large number of users; global, seamless and transparent use (regardless of geo-location, service provider, network provider and manufacturer) and diversity of user experience through easy download and installation of applications produced by a global community of developers.
- In Free-viewpoinT Video (FTV), a user can set the viewpoint to an almost arbitrary location and direction, which can be static, change abruptly, or vary continuously, within the limits that are given by the available camera setup. In tandem, the audio listening point is changed to track changes in viewpoint.
- Image and Video Signature Tools will be a standard supporting ultra-fast search for and identification of images/videos and their modified/edited versions, including a range of deformations, such as coding artifacts, blurring, colour-to-monochrome conversion, noise and geometric deformations such as scaling, rotation and significant cropping.
- Information Exchange between Virtual Worlds (MPEG-V) will provide a standard framework enabling the interoperability between virtual worlds (i.e. virtual spaces where people can work, interact, play, travel, learn and augment real life) and aspects of the real world (sensors, actuators, social and welfare systems, banking, insurance, travel, real estate and many others).
- The Presentation of Structured Information (PSI) standard will provide the means to present Structured Information, information that can e.g. be represented in XML complying to a given Schema. Presentation of this type of information, e.g. an Electronic Program Guide (EPG) in addition to audio and video is required in most service scenarios. MPEG has native Structured Information types: eXtensible MPEG-4 Textual format (XMT), LASeR, Digital Items, etc. Other forms of Structured Information have been defined by other bodies.
- The Representation of Sensory Experience (RoSE) standard will add “Sensory Effects” to an audio-visual bitstream leading to more realistic experiences in the consumption of audiovisual contents. These will include special effects such as turning on a flashbulb for lightning flash effects, opening/closing window curtains for a sensation of fear effect, as well as fragrance, flame and fog can be made by scent devices, flame-throwers, fog generators, and shaking chairs.
- Web, IP and Mobile TV (WIM TV) will be a standard enabling creation and distribution of rich media interactive content through some of the most promising delivery mechanisms thereby bringing the “create once publish anywhere” paradigm one step closer.
Who uses MPEG standards
Many products and services impacting the lives of millions of people are based on MPEG standard. This chapter will mention the most important.
- Video CD is the precursor of the DVD. It uses MPEG-1 Systems, Video and Audio Layer II to store one hour of video on a Compact Disc.
- Digital Audio Broadcasting uses MPEG-1 Audio Layer II to broadcast stereo audio via radio.
- MPEG-1 Audio Layer II is also widely used in digital television set top boxes.
- MPEG-1 Audio Layer III (MP3) is the quasi-universal choice for portable music.
- MPEG-2 Systems (Transport Stream) and MPEG-2 Video are almost universally used for digital television set top boxes.
- MPEG-2 Systems (Program Stream) and MPEG-2 Video are almost universally used for Digital Versatile Disc (DVD).
- MPEG-2 Advanced Audio Coding is used in Japanese digital television set top boxes.
- MPEG-4 Visual (Simple Profile) is used in most mobile handsets.
- MPEG-4 Visual (Advanced Simple Profile) is used to compress video material on Compact Disc.
- MPEG-4 Audio in various versions is used in many products (portable music players, mobile handsets etc.).
- MPEG-4 Advanced Video Coding is being used in a broad range of products (set top boxes, mobile handsets, portable video players etc.).
- MPEG-4 Binary Format for Scene (BIFS) is used in Digital Multimedia Broadcasting (DMB).
- MPEG-4 File Format is used in a variety of application domains, notably to store and exchange video files taken by mobile handsets.
- Elements of MPEG-4 Animation Framework eXtension (AFX) are used in mobile games.
- Lightweight Application Scene Representation (LASeR) is used in mobile handsets.
Elements of MPEG-7 are used in several commercial applications and referenced by the TV Anytime specifications.
- MPEG-21 Digital Item Declaration (DID) is used in commercial products.
- Several elements of MPEG-21 have been adopted by the Digital Media Project (DMP) for their open source Chillout® Interoperable DRM Platform.
MPEG is an offspring of traditional standardisation but has continuously innovated itself to cope with evolving technology and the inflow of new industries in need of multimedia standards. Some of the innovations are the definition of bitstream syntax and decoder-only standards with the ability to allow industry to compete in encoders, the definition of profiles and levels to increase interoperability between application domains without burdening some of them with unnecessary features, the execution of subjective tests to verify the performance of the audio and video coding standards, the release of a normative reference software implementation of a decoder and an informative software implementation of an encoder.
MPEG produces standards that are deliberately kept at a generic level so as to enhance their scope of use by more industries that can share the format while independently adding the elements that are specific of their application fields in contrast to the traditional approach of industries defining vertical standards without consideration of horizontal commonalities.
MPEG provides a unique route to convert new technology into standards because of its process of selecting technologies for introduction in new standards entirely on the basis of commonly agreed technical parameters. This has the advantage that MPEG standards are typically the best technical standards in a given field but also the disadvantage that sometimes a significant number of patents may be needed to practice the standards. Patent pools are typically established to solve this problem.
 H. Nyquist, "Certain topics in telegraph transmission theory", Trans. AIEE, vol. 47, pp. 617-644, Apr. 1928
 W. R. Bennett, “Spectra of Quantized Signals,” Bell Syst. Tech. J., vol. 27, pp 446-472, July 1948
 ITU-T Recommendation G.711, Pulse code modulation (PCM) of voice frequencies
 ITU-T Recommendation H.120, Codecs for videoconferencing using primary digital group transmission
 ISO/IEC 11172, Information Technology – Coding of moving pictures and associated audio at up to about 1.5 Mbit/s
 ISO/IEC 13818, Information Technology – Generic coding of moving pictures and associated audio
 ISO/IEC 14496, Information Technology – Coding of audio-visual objects
 IETF Request for Comments 3640, RTP Payload Format for Transport of MPEG-4 Elementary Streams
 ISO/IEC 15938, Information Technology – Multimedia content description interface
 ISO/IEC 21000, Information Technology – Multimedia framework
 ISO/IEC 23000, Information Technology – Multimedia Application Format
 ISO/IEC 23001, Information Technology – MPEG Systems Technologies
 ISO/IEC 23002, Information Technology – MPEG Video Technologies
 ISO/IEC 23004, Information Technology – MPEG Multimedia Middleware (M3W)
 ISO/IEC 23005, Information Technology – Information Exchange with Virtual Worlds
 ISO/IEC 23005, Information Technology – MPEG eXtensible Middleware
- Arkady Pikovsky and Michael Rosenblum (2007) Synchronization. Scholarpedia, 2(12):1459.
- 3DV 3D Video
- 3GPP Third Generation Partnership Program
- AAC Advanced Audio Coding
- AFX Animation Framework eXtension
- AMR-WB+ Adaptive Multi Rate – WideBand plus
- ASP Advanced Simple Profile
- AVC Advanced Video Coding
- BIFS Binary Format for MPEG-4 Scenes
- BiM Binary MPEG format for XML
- BSD Bitstream Syntax Description
- BSDL BSD Language
- CD Committee Draft
- CD Compact Disc
- CELP Code Excited Linear Predictive coding
- DAB Digital Audio Broadcasting
- DCT Discrete Cosine Transform
- DDL Description Definition Language
- DIA Digital Item Adaptation
- DID Digital Item Declaration
- DII Digital Item Identification
- DIP Digital Item Processing
- DIS Digital Item Streaming
- DMB Digital Multimedia Broadcasting
- DMIF Delivery Multimedia Integration Framework
- DMP Digital Media Project
- DPCM Differential PCM
- DRM Digital Rights Management
- DS Description Schemes
- DSM-CC Digital Storage Media Command and Control
- EPG Electronic Program Guide
- ER Event Reporting
- ERR Event Report Request
- EXIF EXchangeable Image Format
- FCD Final Committee Draft
- FDIS Final Draft International Standard
- FID Fragment Identification
- FTV Free-viewpoinT Video
- HE AAC High Efficiency AAC
- IDCT Inverse DCT
- IETF Internet Engineering Task Force
- IPMP Intellectual Property Management and Protection
- IPMP-X IPMP eXtensions
- ISO International Organisation for Standardisation
- ITU International Telecommunication Union
- ITU-T ITU, Telecommunication Standardisation Sector
- JVT Joint Video Team
- LASeR Lightweight Application Scene Representation
- LOD Level of Detail
- M3W MPEG Multimedia Middleware
- MAF Multimedia Application Format
- MDS Multimedia Description Schemes
- MP3 MPEG Audio Layer III
- MPEG Moving Picture Experts Group
- MVC Multiview Video Coding
- PAT Persistent Association Technologies
- PCM Pulse Code Modulation
- PES Packetised Elementary Stream
- PS Program Stream
- PSI Presentation of Structured Information
- RDD Rights Data Dictionary
- REL Rights Expression Language
- RFC Request For Comments
- RoSE Representation of Sensory Experience
- RTP Real Time Protocol
- SAOL Structured Audio Orchestra Language
- SASBF Structured Audio Sample Bank Format
- SASL Structured Audio Score Language
- SBR Spectral Band Replication
- SP Simple Profile
- SVC Scalable Video Coding
- TS Transport Stream
- VHDL VHSIC Hardware Description Language
- WIM TV Web, IP and Mobile TV
- XML eXtensible Markup Language
- XMT eXtensible MPEG-4 Textual format