Video surveillance technology

What are H.264 and H.265?

What are H.264 and H.265?

H2.64 is the tenth part of MPEG-4. It is a highly compressed digital video codec standard proposed by the Joint Video Team (JVT, Joint Video Tean) jointly formed by the ITU-T Video Coding Experts Group (VCEG) and the SO/IEC Moving Picture Experts Group (MPEG). This standard is often referred to as H.264/AVC (or AVC/H.264 or H.264/MPEG-4 AVC or MPEG-4/H.264AVC) and explicitly states its developers on both sides. The main parts of the H.264 standard are Access Unit delimiter (access unit separator) SEI (additional enhanced information), Primary Coded Picture (basic image coding), Redundant Coded picture (redundant image coding). There are also nstantaneous Decoding Refresh (DR instant decoding refresh), Hypothetical Reference Decoder (HRD, hypothetical reference decoding), Hypothetical Stream Scheduler (HSS, hypothetical stream scheduler). But now H.264 is gradually being replaced by H.265.

In August 2012, Ericsson introduced the first H.265 codec. Six months later, the International Telecommunications Union (ITU) officially approved the HEVC/H265 standard, which is called High Efficiency Video Coding, which is a considerable improvement over the previous H.264 standard. Huawei has the most core patents and is the leader of this standard. H.265 is designed to transmit higher-quality network video under limited bandwidth, and only needs half of the original bandwidth to play the same quality video. The H.265 standard also supports both 4K (4096×2160) and 8K (8192×4320) Ultra HD video. The coding architecture of H.265/HEVC is roughly similar to that of H.264/AVC, including Intra Prediction, Inter Prediction, Transform, and Quantization, Deblocking Filter, Entropy Coding and other modules. But in the HEVC coding architecture, the whole is divided into three basic units, namely the coding unit (Coding Unit, CU), the prediction unit (Predict Unit, PU) and the transformation unit (Transform Unit, TU).

H.265 is a new video coding standard formulated by ITU-T VCEG after H.264. The H.265 standard revolves around the existing video coding standard H.264, retaining some of the original technologies and improving some related technologies at the same time. The new technology uses advanced techniques to improve the relationship between code stream, encoding quality, delay and algorithm complexity to achieve optimal settings. H.264 can achieve standard definition digital image transmission at a speed lower than 1Mbt/s due to algorithm optimization; H.265 can realize the transmission of 720P (resolution 1280×720) ordinary high-definition audio and video transmission at a transmission speed of 1-2 Mbit/s.

What are Session Initiation Protocol, Media Server and Wide Dynamic Technology?

What are Session Initiation Protocol, Media Server and Wide Dynamic Technology?

Session Initiation Protocol, SIP

Session Initiation Protocol SIP: The Session Initiation Protocol is formulated by the Internet Engineering Task Force (lETF: Internet Engineering Task Force) and is a framework protocol for multi-party multimedia communication. It is a text-based application layer control protocol, independent of the underlying transport protocol, used to establish, modify and terminate two- or multi-party multimedia sessions on the P network.

This protocol is often used in the video networking platforms of Safe City and Xueliang Engineering.

Media Server, MS

Media Server is mostly used in large-scale video networking projects. Provide real-time media stream forwarding services, coal storage, historical media information retrieval and on-demand services. The media server receives media data from devices such as SIP devices, gateways or other media servers, and forwards the data to other single or multiple SIP clients and media servers according to instructions.

Wide Dynamic Technology, WDR

Wide dynamic technology (WDR): The so-called dynamic refers to the dynamic range, which refers to the range of change of a certain characteristic that can be changed. For the camera, its dynamic range refers to the camera’s ability to adapt to the light illumination in the shooting scene. Quantify its index and express it in decibels (dB). For example, the dynamic range of an ordinary CCD camera is 3dB, and the wide dynamic range can generally reach 80dB, and the good one can reach 100dB. Even so, compared with the human eye, it is still much worse. The dynamic range of the human eye can reach 1000dB, and the more advanced is that the eagle’s vision is 3.6 times that of the human eye.

So what is the concept of super wide dynamic and super wide dynamic? In fact, this is all artificial. Some manufacturers add a super in order to distinguish it from other manufacturers or to show their own wide dynamic effect. In fact, there are only so-called first and second generation differences. In order to improve the dynamic range of their own cameras, early camera manufacturers adopted the practice of double exposure imaging and then superimposed output. First expose the brighter background quickly to get a relatively clear background, and then slowly expose the object to get a relatively clear object, and then output the two images superimposed in the video memory. Doing so has an inherent disadvantage: one is that the camera has a delay in output, and there is serious smearing when shooting fast-moving objects. The other is that the sharpness is still not enough, especially when the background illumination is very strong and the contrast between the object and the background is large, it is difficult to image clearly.

Wide dynamic range was especially popular in early analog systems and digital systems. It was regarded as an important product selling point in the early days. In the A era, this technology has not been eliminated.

What are the common graphic (image) formats?

What are the common graphic (image) formats?

Generally speaking, the current graphics (image) formats can be roughly divided into two categories: one is bitmap; the other is called drawing class, vector class or object-oriented graphics (image). The former describes graphics (images) in the form of lattices, and the latter describes graphics (images) composed of geometric elements mathematically. Generally speaking, the latter expresses images in a detailed and realistic manner, and the resolution of the graphics (images) after scaling remains unchanged, and is widely used in professional-level graphics (images).

Before introducing the graphics (image) format, it is necessary to understand some related technical indicators of graphics (images): resolution, number of colors, and grayscale of graphics.

Resolution: divided into screen resolution and output resolution, the former is expressed by the number of lines per inch, the larger the value, the better the quality of the graphics (image); the latter is the precision of the impulse output device, expressed by the number of pixels per inch;

Color number and graphic grayscale: expressed in bits, generally written as 2 to the nth power, where n represents the number of bits. When the graphics (image) reaches 24 bits, it can express 16.77 million colors, that is, true color. Grayscale representation class. Let’s learn about the current common graphic file formats one by one through the characteristic suffix name of the graphic file (that is, as shown in Figure .bmp): BMP, DIB, PCP, DIF, WMF, GIF, JPG, TIF, EPS, PSD, CDR, IFF, TGA, PCD, MPT.

BMP (bit map picture): the most commonly used bitmap format on PC has two forms, compressed and uncompressed. This format can express colors from 2-bit to 24-bit, and the resolution can also be from 480×320 to 1024×768. This format is quite stable in the Window environment and is widely used in occasions where the file size is not limited.

DIB (device independent bitmap): The ability to describe images is basically the same as that of BMP, and it can run on a variety of hardware platforms, but the file size is larger.

PCP (PC paintbrush): A compressed and disk space-saving PC bitmap format created by Zsoft, which can represent up to 24-bit graphics (images). There was a certain market in the past, but with the rise of JPEG, its status has gradually declined.

DIF (drawing interchange format): a graphic file in AutoCAD, which stores graphics in ASCII mode, and shows that the graphics are very accurate in size and can be edited by large software such as CorelDraw and 3Ds.

WMF (Windows metafile format): Microsoft windows metafile, which has the characteristics of short file and pattern modeling. Graphics of this type are crude and can only be edited in Microsoft Office.

GIF (graphics interchange format): A compressed graphics format that can be processed by various graphics processing software on various platforms. The disadvantage is that it can only store up to 256 colors.

JPG (joint photographic expert group): A graphic format that can greatly compress graphic files. For the same picture, the files stored in JPG format are 1/10-1/20 of other types of graphic files, and the number of colors can reach up to 24 bits, so it is widely used in homepages on the Internet or picture libraries on the Internet.

TIF (tagged image file format): The file size is huge, but the amount of stored information is also huge, and there are more subtle-level information, which is conducive to the reproduction of the tone and color of the original. The format has two forms, compressed and uncompressed, and the maximum number of supported colors can reach 16M.

EPS (encapsulated PostScript): An ASCII graphic file described in the PostScript language, which can print high-quality graphics (images) on a Postscript graphics printer, and can represent up to 32-bit graphics (images). The format is divided into Photoshop EPS format adobe illustrator EPS and standard EPS format, which can be divided into graphic format and image format.

PSD (Photoshop standard): The standard file format in Photoshop, a format optimized for toshop.

CDR (CorelDraw): The file format of CorelDraw. In addition, CDX is a graphics (image) file that can be used by all Coreldraw applications, and is a mature CDR file.

IF (image file format): It is used for large-scale super graphics processing platforms, such as AMIGA machines, and Hollywood special effects blockbusters are mostly processed in this IF format. Shape (image) effects, including color texture and other realistic reproduction of the original scene. Of course, the computer resources such as memory and external memory consumed by this format are also huge.

TGA (tagged graphic): It is a graphic file format developed by True vision for its display card at an earlier time, and the maximum color number can reach 32 bits. VDA, PIX, WIN, BPX, ICB, etc. belong to its collateral.

 What are “full-duplex” and “half-duplex”, “brightness”, “hue” and “saturation”, and search for pictures by picture?

 What are "full-duplex" and "half-duplex", "brightness", "hue" and "saturation", and search for pictures by picture?

What are “full duplex” and “half duplex”

Full-duplex: can send and receive at the same time. Full-duplex requirements: There are separate channels for receiving and sending, which can be used to realize communication between two stations, star network, ring network, and cannot be used for bus network.

Half-duplex: It is impossible to send and receive at the same time, and the sending and receiving are time-divisional. Half-duplex requirements: transceivers can share the same channel, and can be used in local area networks of various topologies, most commonly used in bus networks. The half-duplex data rate is theoretically half of full-duplex.

What are “brightness”, “hue” and “saturation”

As long as color can be described by brightness, hue and saturation, any colored light seen by the human eye is the combined effect of these three characteristics. So what do brightness, hue, and saturation mean?

Brightness: It is the feeling of brightness caused by light acting on the human eye, which is related to the luminous intensity of the observed object.

Hue: It is the color feeling produced when the human eye sees light of one or more wavelengths. It reflects the class of color and is the basic characteristic that determines color. For example, red and brown refer to hue.

Saturation: refers to the purity of the color, that is, the degree to which white light is incorporated, or the depth of the color

For colored light of the same hue, the darker the saturation, the more vivid or pure the color. Hue and saturation are commonly referred to as chroma.

It can be seen that luminance is used to indicate the brightness of a certain color light, while chromaticity indicates the type and depth of color. In addition, the various colors of light commonly found in nature can be made by matching red (R), green (G), and blue (B) colors in different proportions; similarly, the vast majority of color light can also be decomposed into three colors of red, green and blue, which forms the most basic principle in chromaticity, the principle of three primary colors (RGB).

Search for pictures by picture

Search for pictures by picture has become the basic function of intelligent video surveillance system. Search by image is a professional search engine system that provides users with relevant graphic image data retrieval services in the video surveillance system or on the Internet by searching for image text or visual features. It is a subdivision of search engines. Search by entering keywords that are similar to the image name or content, and search by uploading images or image URLs that are similar to the search results.

Broadly speaking, image features include text-based features (such as keywords, annotations, etc.) and visual features (such as color, logo, texture, shape, etc.). Visual features can be further divided into general visual features and domain-related (locally specific) visual features. The former is used to describe the features common to all images, regardless of the specific type or inner core of the image, mainly including color, texture and shape; the latter is based on some prior knowledge (or assumptions) about the content of the described image, and is closely related to specific applications, such as human facial features or vehicle license plates or vehicle characteristics.

Searching for images by image has been used as a basic function of the A application. By providing a global or local feature, such as a photo of a vehicle, a license plate, a face, a body feature, etc., the user can quickly perform surveillance retrieval from the video image information database.

What are “black level”, “white level” and signal-to-noise ratio?

What are "black level", "white level" and signal-to-noise ratio?

What is “Black Level” and “White Level”

Black level: Define the corresponding signal level when the image data is 0. Adjusting the black level does not affect the amplification of the signal, but only translates the signal up and down. If you adjust the black level up, the image will be darker, if you adjust the black level down, the image will be brighter. When the black level of the camera is 0, the corresponding level below 0V is converted into image data 0, and the level above 0V is converted according to the magnification defined by the gain, and the maximum value is 255. Black level (also called absolute black level) setting, which is the lowest point of black. The so-called black lowest point is the electron beam energy emitted from the CRT picture tube. When the energy of the electron beam is lower than the basic energy that makes the phosphor (fluorescent substance) start to emit light, the black at the lowest position is displayed on the screen. The US NTSC color TV system positions the absolute black level at 7.5IRE, that is, signals below 7.5IRE will be displayed as black, while the Japanese TV system positions the absolute black level at the OIRE white level.

The white level corresponds to the black level, which defines the corresponding signal level when the image data is 255. The difference between it and the black level defines the gain from another angle. In quite a few applications the user cannot see the white level adjustment because the white level is fixed in the hardware circuit.

What is the signal to noise ratio

Signal-to-noise ratio (S/N, Signal/Noise) refers to the ratio between the signal strength of the maximum undistorted sound produced by the sound source and the noise strength at the same time, which is called the signal-to-noise ratio. That is, the ratio of useful signal power (Signal) to noise power (Noise) is referred to as signal-to-noise ratio (Signal/Noise), usually expressed in S/N, and the unit is decibel (dB). This calculation method is also applicable to image systems.

The ratio of the maximum fidelity output of a signal to unavoidable electronic noise in dB. The larger the value, the better. Below the index of 75dB, noise may be found in silence. In general, the signal-to-noise ratio of a sound card is often unsatisfactory due to too much high frequency interference in a computer.

The signal-to-noise ratio of the image captured by the camera and the sharpness of the image are both important indicators to measure the quality of the image. The image signal-to-noise ratio refers to the ratio of the size of the video signal to the size of the noise signal. The two are generated at the same time and cannot be separated. The noise signal is a useless signal, and its existence has an influence on the useful signal, but it cannot be separated from the video signal. Therefore, when choosing a camera, it is enough to select some useful signals that are relatively larger than the noise signals to a certain extent, so the ratio of the two is taken as the standard of measurement. If the signal-to-noise ratio of the image is large, the picture of the image will be clean, and there will be no noise interference (the main picture has snowflakes), and people will look very comfortable; if the signal-to-noise ratio of the image is small, the picture will be full of snowflakes, which will affect the normal viewing effect.

What is “line”, “progressive” and “interlaced”, illuminance/sensitivity and IRE?

What is "line", "progressive" and "interlaced", illuminance/sensitivity and IRE?

“Line”, “progressive” and “interlaced”

In traditional CRT analog TV, the scan of an electron beam in the horizontal direction is called “line”, or “line scan”.

Each frame of the TV is composed of several horizontal scanning lines. The PAL system is 625 lines/frame, and the NTSC system is 525 lines/frame. If all the lines in this frame are continuously completed from top to bottom line by line, or the scanning sequence is 1, 2, 3, …, 525, this scanning method is called progressive scanning.

In fact, one frame of ordinary TV needs to be completed by two scans. The first pass scans only odd-numbered lines, that is, lines 1, 3, 5, …, 525, and the second pass scans only even-numbered lines, that is, lines 2, 4, 6, …, 524. This scanning method is interlaced scanning. A picture containing only odd or even lines is called a “field”. The field containing only odd lines is called “odd field” or “top field”, and the field containing only even lines is called “even field” or “bottom field”. That is, an odd field plus an even field equals one “frame” (one image).


Illuminance is a unit that reflects light intensity. Its physical meaning is the luminous flux irradiated on a unit area. The unit of illuminance is the number of lumens (Lm) per square meter, also called Lux: 1Lux=1Lm/square meter. In the above formula, Lm is the unit of luminous flux, which is defined as the amount of light radiated by pure platinum at the melting temperature (about 1770 ° C), its surface area of 1/160m2 within a solid angle of 1 steradian.

In order to have a perceptual understanding of the amount of illumination, let’s take an example to calculate. A 100W incandescent lamp has a total luminous flux of about 1200Lm. If it is assumed that the luminous flux is evenly distributed on the hemisphere, the illuminance values at 1m and 5m away from the light source can be obtained according to the following steps: the area of a hemisphere with a radius of 1m is 2π×12=6.28m2, and the illuminance value at a distance of 1m from the light source is: 1200Lm/6.28m2=191Lux; similarly, the area of a hemisphere with a radius of 5m is: 2π×52=157m2, and the illuminance value at a distance of 5m from the light source is: 1200m/157m2=7.64Lux. It can be seen that the illuminance emitted from the point light source obeys the inverse square law.

1Lux is approximately equal to the illuminance of 1 candle at a distance of 1m. The minimum illuminance (Minimum Illumination) common in the camera parameter specification means that the camera can obtain a clear image only under the marked Lux value. The smaller the value, the better, indicating that the sensitivity of the CCD is higher. Under the same conditions, the illuminance required by a black-and-white camera is much less than 10 times lower than that of a color camera that still has to deal with color intensity.

What is IRE

IRE is the abbreviation of Institute of Radio Engineers. The video signal unit formulated by this institution is called IRE. Now, the IRE value is often used to represent different picture brightness. For example, 10IRE is darker than 20IRE, and the brightest level is 100IRE. So, what’s the difference between setting the absolute black level to 0IRE and 7.5IRE? Due to the limited performance of the early monitors, in fact, the areas where the brightness is lower than 7.5IRE on the screen basically cannot display the details, and it looks like black. By setting the black level to 7.IRE, some signal components can be removed, thereby simplifying the circuit structure to a certain extent. However, the performance of modern monitors has been greatly improved, and the details of the dark parts can be displayed well. At this time, setting the black level to OIRE can reproduce the picture perfectly.

What are “PAL format” and “NTSC format”, “field” and “frame”?

What are PAL format and NTSC format, field and frame

“PAL” and “NTSC”

Although the issue of “standard” is not mentioned much now, it is a very important concept in the era of analog video surveillance, just like the basic standard of whether a motor vehicle runs on the left or on the right.

PAL (Phase Alternating Line) is a TV system established in 1965 and is mainly used in China, Hong Kong, the Middle East and Europe. The color bandwidth of this format is 4.43Mz, the audio bandwidth is 6.5MHz, and the picture is 25 frames per second.

The NTSC (National Television System Committee, National Television System Committee) format is a color television broadcasting standard formulated by the National Television Development Committee of the United States in 1952. The United States, Canada, as well as China Taiwan, South Korea, the Philippines and other countries use this format. The color bandwidth of this system is 3.58MHz, the audio bandwidth is 6.0MHz, and the picture is 30 frames per second.

The reason why the NTSC system is 30 frames per second and the PAL system is 25 frames per second is because the mains electricity in the countries where NTSC is adopted is 110V/60Hz, so the field frequency signal in the TV directly samples the frequency of the AC power supply at 60Hz. Because two fields make up one frame, 60 divided by 2 equals 30, which is exactly the number of frames of the TV, and China’s mains electricity is 220V/50Hz, so the reason is the same as the above is 25 frames per second.

“Field” and “Frame”

In traditional CRT analog TV, a line scan, scanning in the vertical direction is called “field”, or “field scan”. Each TV frame is produced by scanning the screen twice, with the lines of the second scan filling the gaps left by the first scan. So a TV picture of 25 frames/s is actually 50 fields/s (30 frames/s and 60 fields/s respectively for NTSC).

The idea of “frame” comes from the early movies, a still image is called a “frame” (Frame). The picture in the film is 25 frames per second, because the persistence of vision of the human eye just meets the standard of 25 frames per second. Generally speaking, the number of frames, simply put, is the number of frames of pictures transmitted in 1s. It can also be understood that the graphics processor can refresh several times per second, usually expressed in FPS (Frames Per Second). Each frame is a still image, and displaying frames in rapid succession creates the illusion of motion. Higher frame rates result in smoother, more realistic animations. The more frames per second (fps), the smoother the motion displayed.

When a computer plays a video on a monitor, it just displays a series of full frames, without the TV trick of interleaving fields. So neither video formats nor MPEG compression techniques designed for computer monitors use fields. Traditional analog systems use a CRT monitor (similar to a TV) for monitoring, which involves “fields” and “frames.” The digital system uses LCD or a more advanced display (similar to a computer display) to process images using computer technology, so it only involves “frames”, which is also the difference between a digital monitoring system and an analog monitoring system.

Even in the era of artificial intelligence, “frame” is still a very important concept, and how to extract effective “frames” in continuous pictures is crucial. When extracting features of the same face, license plate, human body, and vehicle, how to avoid repeated extraction and extract the clearest picture lies in the “frame extraction” technology.