How Computers Compress Text: Huffman Coding and Huffman Trees

  • Published on: 11 September 2017
  • Computers store text (or, at least, English text) as eight bits per character. There are plenty of more efficient ways that could work: so why don't we use them? And how can we fit more text into less space? Let's talk about Huffman coding, Huffman trees, and Will Smith.

    Thanks to the Cambridge Centre for Computing History:

    Thanks to Chris Hanel at Support Class for the graphics:

    Filmed by Tomek:

    And thanks to my proofreading team!

    I'm at
    on Twitter at
    on Facebook at
    and on Snapchat and Instagram as tomscottgo
  • Runtime : 6:31
  • tom scott tomscott the basics computer science huffman coding compression huffman tree lossless compression


  • Tom Scott
    Tom Scott   2 years ago

    This is the last of the three trial Basics videos! This pushed my quick-explanation skills to the limit, but I figure that "slow down the video and replay if necessary" is better than "let people get bored"...

  • traps are gay
    traps are gay   1 days ago

    So good that i find this after i spent hours figuring out huffman trees

  • traps are gay
    traps are gay   1 days ago

    So good that i find this after i spent hours figuring out huffman trees

  • hasan özçifçi
    hasan özçifçi   3 days ago

    great video, but, i did not get how the decoder computer sees the huffman tree?

  • ForboJack
    ForboJack   1 weeks ago

    Tom Scott: Text has to be losslessly compressed! Xerox: Hold my copy machine!!!

  • Joshua Boylan
    Joshua Boylan   1 weeks ago

    So zip files will find blocks of characters that are used in varying frequencies and turn those into bits? Like a word being 1?

  • Commentur The Great
    Commentur The Great   2 weeks ago

    "... or you'll send the wrong worms" duפn Nakk, tnet goכe vסz avvfai!

  • Sham Teal
    Sham Teal   2 weeks ago

    Those old BBC computers are worth their weight in gold, even if they aren't working.

  • stoeger 2
    stoeger 2   4 weeks ago

    Why don't they just use quantum bits, duh

  • Trurl
    Trurl   4 weeks ago

    Surprised we didn't learn about Huffman coding in university. Could have been mentioned right after the sorting algorithms...

  • Arun
    Arun   4 weeks ago

    wow, very well explained, using trees (and their uses) explained properly, text compression.

  • William Brennan
    William Brennan   1 months ago

    Anyone remember the thing like this in Gregor the Overlander? The Tree of Transmission, I think it was called?

  • Mike
    Mike   1 months ago

    A9 07 20 EE FF 60

  • Elliander Eldridge
    Elliander Eldridge   1 months ago

    What's really interesting is that genetics has similar rules. Codons, for example, are made of three nucleotides. Each nucleotide taking the place of a bit. The bits represent amino acids, but the start and stop codons are used in a manner similar to assembly programming. One big difference is that genetics isn't digital. Rather than two binary bit states there are four possible bit states: A, T, C, and G.To some it might seem like binary because the amino acids appear in pairs, but they pair between strands so that the genetic information can be copied. Each codon has a length of 3 bits, and because there are 4 possible states per bit that allows for 64 possible combinations.What's even more interesting is that there are over five times as many proteins as there are genes to code for. To oversimplify things, the epigenome acts as a kind of encryption layer switching genes on and off in order to change the way the instructions run. In other words, our DNA is heavily encrypted, but the same mechanisms also allow for the real-time adaptation to environmental conditions. This approach seems like it's more efficient then what's represented in this video. I'm curious if this could be applied in computer science.

  • dumnor
    dumnor   1 months ago

    1's and 0's or in other words haves and have-nots.

  • Winkel Yin
    Winkel Yin   1 months ago

    How would the space character be encoded in the example of the Huffman tree for wild wild west?

  • Kamil's View
    Kamil's View   1 months ago

    Hi Tom. Electrons do not actually go across the wire - electricity does, but not electrons.

  • Larho Cherqi
    Larho Cherqi   1 months ago

    Someone needs to make a program to do this.

  • Popsii
    Popsii   1 months ago

    In the days of the Sinclair Spectrum entire words were stored using a single byte of memory by having keywords among the ASCII list.

  • abababbb
    abababbb   1 months ago

    what does it mean by sending wrong worms?

  • ceruchi
    ceruchi   1 months ago

    Yo, the YT algorithm is giving me a Tom Scott renaissance right now and it's all so good. Love this guy!

  • pmailkeey
    pmailkeey   1 months ago

    if cpu as clev as H brn & comp. wrds not lets but gram sentcs ......> flnm.z files !

  • Sierra Whisky
    Sierra Whisky   1 months ago

    Just marvelous how you went from Acorn Atom to Acorn Electron and ended up at the BBC Micro, developed by Acorn. Although it's not chronologically right, it is a nice piece of (British) computer history, especially when you realise Acorn has an impressive legacy. Almost every single smartphone anywhere in the world uses technology from ARM. ARM, which was originally an acronym for Acorn RISC machines, derived from Acorn.It brings back good memories. 😉

  • Jack Shears
    Jack Shears   1 months ago

    2 computer science lesson now make sense after 6:30 of of Tom Scott explaining Huffman Code! Thank you

  • only photoshop
    only photoshop   1 months ago

    "or you may send the wrong woms"All comments: That jake waz awfulme an intellectual: plank xswegwoqrunqnrclw oe wrhcqioW HH

  • Andreas Hontzia
    Andreas Hontzia   1 months ago

    The animation of the red rectangle trying to parse the bits is awesome! I laughed really hard.

  • phil anglade
    phil anglade   1 months ago

    0:02 : Why "English" text ? It may work for every text, whatever the language. Or am I missing something ?

  • Rosscoff2000
    Rosscoff2000   1 months ago

    8 bits per character was a relatively modern luxury when i was young. Of course it started out with 5-bit encoding as used on punch paper tapes (no hole =0, hole=1). Giving 31 codes for letters and a serial shift code for numbers.Then we got US ASCII which was 7-bit. Then extended ASCII/ANSI 8-bit, and then of course the modern day 16-bit Unicode.

  • mlg
    mlg   2 months ago

    Spend way too much time on the basic stuff and not enough on the huffman coding and trees and why they actually work. As usual.....