When, ten or fifteen years ago, the Chinese government decided to invest serious resources in censoring the internet few believed it could be done. The web just changes too fast, they said. Routing can be switched in a fraction of a second. Fast forward to the present, though and we can see the Great Firewall, as China's censorship system has become known, is actually pretty effective in controlling what Chinese citizens can see and say. It's not completely watertight; savvy users have learned to exploit holes as soon as they emerge before they are plugged by the authorities, but the average person's communications and access are heavily curtailed.
The Great Firewall operates using a combination of automated and manual filtering. It employs the cutting edge technology of China's technology giants, standard computer classification techniques and a great deal of manual labour with millions employed on a full or part-time basis.
Most subjects are not blocked on a permanent basis, some exceptions being the Tiananmen Square protests of 1989, democracy or human rights. Rather words, phrases and images are blocked temporarily, depending on what's going on at the time. In the weeks leading up to the 19th Congress of the Chinese Communist Party (CPP) last year, any references to that that carefully stage-managed event were routinely suppressed. But people chatting on WeChat or QQ - China's social media behemoths - can generally talk about that event now, provided they are not overtly critical.
After the recent CCP announcement that the increasingly authoritarian leader Xi Jinping is to be president for life - another forbidden topic of conversation - use of the Roman letter 'n' was banned after the censors noticed it was being used in the context of ‘an indeterminate number of years'. However, 'n' is now OK with the censors.
Dr Johannes Ullrich, dean of research at US security firm SANS Institute, has spent some time prodding the Great Firewall to work out how it functions, using a list of banned words compiled by CitizenLab of Canada. In particular, though, Ullrich has been studying the methods by which ‘bad' images are blocked so effectively.
"Latency is one of the most interesting parts, that they are able to do it so quickly," Ullrich told V3. "I expect they have a hybrid approach, with a machine learning part that identifies images as bad then a second part that will identify all copies of the image. In that way, not every image goes to the machine learning; only new images... If an image is absolutely identical then it doesn't need to go through that process."
One of Ullrich's experiments involves manipulating known banned images and characters, adding waviness to see if the Great Firewall still recognises them in their altered state. Generally speaking, it does, leading him to conclude that optical character recognition (OCR) is certainly part of the mix, with a simpler method used to spot images flagged as ‘bad' when they become widely distributed, using a hash of the image to filter against a library of hashes of known bad images. Evidence for this is that images shared in groups seem to be blocked faster than those sent to an individual.
Deleting headings containing known banned words and retrying also led to the images being blocked. This has led to his assumption that machine learning is in play here.
"They probably use some machine learning algorithms, but they have to have a library of known bad images and then those machine learning algorithms try to detect patterns among those images," Ullrich said. "That's why they're not disrupted by a major change like wiping out the heading. The other option is they don't recognise the entire image. If they run OCR on it they only look at as much of the image they can within whatever time they allot themselves, and once the time is up they stop the recognition and let the image pass if it didn't trip any of the image data at that point."
Scientists create a virtual reality simulation of a black hole sitting at the centre of the Milky Way
Simulations like this can help people understand complicated systems in the universe in a better way
The most luminous galaxy ever discovered is cannibalising at least three of its smaller neighbours, study finds
The galaxy radiates at 350 trillion times the luminosity of the Sun
Researchers modify genetic code of cancer-killing virus so it can target cells that protect cancer from immune system
Changing the genetic coding causes the infected cancer cells to produce a protein that kills the fibroblast cells that protect cancer
The findings can help improve the current understanding of brain development disorders, such as epilepsy or autism