Musings of an engineer: The Quantization of Human Perception

We, as humans, like to think we're special, and so we strive for better technologies. We want our MP3s to be high-quality and our videos to be 1080p high-def. But why? It's because we think (and it's not our fault, our brains trick us into thinking this) that our perception is the be-all and end-all of sensory quality. We think that we see and hear things with infinite resolution. But we don't. In fact, our senses of sight and sound are quantized and discretized--which means there's a limit to the quality we can perceive.

"Quantized and discretized"... what does that mean? Discretization is the idea that a function (or signal, or sight/sound as pertaining to this post) is not perceived continuously, but rather that we "sample" them several times a second. It's true that the world around us is constantly in motion, however it has been proven (citation on its way...) that the brain only takes 8-15 visual snapshots of the world per second--think "frames per second"--, and then pieces them together to create vision.

Quantization is the idea that there is not an infinite depth of resolution in a system, but instead that the signal gets bucketed into various levels. Think of this as rounding a number up or down--3.14 becomes 3, 5.67 becomes 6. We do this too--nerve endings are only so sensitive, and can only relay a certain finite amount of "levels" of perception.

So let's look at the sense of sight. Light enters our eyes and is absorbed by our retinas. The human retina contains four types of nervous sensors: 3 types of cones, and one rod. The rod cells are much more sensitive to light, but have difficulty perceiving color (there are roughly 90 million rods in a human retina). The cones come in three varieties (S, M, and L; for Short, Medium, and Long) and are good at perceiving color in brighter light. (Incidentally, this is the reason why a dark room generally looks "grayscale"--the cones aren't sensitive enough to pick up color, but the rod cells are sensitive enough to at least perceive the low light levels.) The different lengths of cone cells absorb different wavelengths of light with varying efficiencies, and so most of us are "trichromatic". There's also a disorder (well, in humans it's a disorder) called tetrachromatism, which only affects females, but adds a fourth retinal cone cell. Tetrachromads can "see more colors" than trichromads, simply because of the additional absorption wavelength afforded by this mutant retinal cell.

But the point is that there are only 90 million rods and 4.5 million cones in the retina. Rods are very sensitive and can respond to just a single photon of light, but cones require tens to hundreds of photon collisions per sampling cycle to be activated. In this sense, our vision is quantized--at very least by the fact that our retinal cells respond to a discrete number of photons, but more likely by the fact that a cone will output the same signal when hit by 54 photons as it would when hit by 55 photons. Furthermore, a discrete number of retinal receptors means that there's a discrete number of "pixels" we can perceive. And color-wise, there are only so many combinations of S, M and L cones that can be activated by light. (Of course, this is different for tetrachromad mutants, which have a much higher resolution than the rest of us.)

Vision is then both quantized and discretized. We take only a dozen frames per second of the world around us, and those snapshots have only a few million gradients of color that we can perceive. Fortunately, the brain makes us think that this isn't true, and patches the snapshots together into a beautiful animation, as well as "blending" the colors together.

Sound operates similarly. Our brain does the same "snapshot" trick with our ears, and so our hearing is sampled several times a second. With the sense of hearing though, the quantization aspect is much more interesting.

In math, we have a tool called the "Fourier transform", which takes a signal and breaks it down into representative sine waves; the idea is that any waveform can be represented by an infinite sum of sine waves of varying frequencies and magnitudes. Most people don't know that our ears do exactly the same thing!

The cochlea is the organ in our inner ear responsible for the actual perception of sound. The ear drum is a membrane that vibrates with pressure waves in the air, and the "hammer, anvil, and stirrup" bones actually act as a system of levers to magnify that force. The energy from the vibrations of air is then transmitted to our cochlea, which is filled with fluid. The fluid vibrates roughly 20,000 tiny little hairs on the "Organ of Corti". The magical part is what happens when these hairs vibrate. Each of these 20,000 little hairs has a different length and thickness, which means that each one resonates at a different frequency. When an individual hair's frequency is excited (ie, that frequency occurs as part of a sound), the hair vibrates wildly, and these vibrations are picked up by very tiny, very special nerve cells below it. These signals are of course sent to the brain, and interpreted as sound.

The amazing part though, is that our ears do exactly what the Fourier transform does, except it accomplishes it mechanically. If you pluck a guitar string, the frequencies traveling through the air include the fundamental as well as integer harmonics of the fundamental frequency. When that hits our ear, only certain groups of hairs will vibrate. There's no infinite magic there though; we can only perceive as many different frequencies as there are hairs in our organ of Corti (or, only about 20,000 distinct frequencies). There's also an upper and lower limit to the frequencies we can hear, and these limits are governed by how big the largest hair is, and how little the smallest hair is (which is small enough to resonate at around 20kHz). If you were ever wondering why the typical sample rate of digital music is 44.1kHz, that's the reason; at frequencies above half the sample rate of a signal, sounds become aliased (perceived as much lower frequencies than they actually are). So some brilliant computer scientist realized that the highest frequency we could perceive was roughly 20kHz, and so made computer audio sample at just a little more than twice that, so that we could never hear the aliasing effects. And for more quantization: there are only a handful of nerve endings attached to these tiny hairs--the number of nerve endings excited by the hair's vibration determines the amplitude of that frequency's presence.

And as far as the Fourier transform goes: most people think the infinite sum is merely an approximation. As you know, it's impossible to have an infinite number of terms in the real world--so of COURSE the Fourier series has to be an approximation! Well, realize that the human ear does actually perform the Fourier series, with exactly as many "terms" as there are hairs on the organ of Corti--which is 15,000-20,000. Your brain simply reconstructs those 20,000 separate signals--it literally breaks down the sound traveling through the air and rebuilds it in your brain. If you're wondering how many terms to include in your audio-related Fourier series, just know that it's pointless to go past 20,000.

So if you ever thought that your computer monitor or digital music wasn't good enough for you, think again. These devices were engineered to be as good as we could perceive them (well, admittedly, the larger TV screens get, the higher resolution they need to display--but my argument still holds for audio!). Trust me, you really don't need your music to have a sample rate of anything higher than the standard 44.1kHz. Anything past that is just cork-sniffing.

Musings of an engineer

Wednesday, January 13, 2010

The Quantization of Human Perception

No comments:

Post a Comment

Great Posts from Tidal

Followers

Blog Archive

About Me