How Computers Compress Text: Huffman Coding and Huffman Trees

Loading...
  • Published on: 11 September 2017
  • Computers store text (or, at least, English text) as eight bits per character. There are plenty of more efficient ways that could work: so why don't we use them? And how can we fit more text into less space? Let's talk about Huffman coding, Huffman trees, and Will Smith.

    Thanks to the Cambridge Centre for Computing History: http://www.computinghistory.org.uk/

    Thanks to Chris Hanel at Support Class for the graphics: http://supportclass.net

    Filmed by Tomek: https://youtube.com/tomek

    And thanks to my proofreading team!

    I'm at http://tomscott.com
    on Twitter at http://twitter.com/tomscott
    on Facebook at http://facebook.com/tomscott
    and on Snapchat and Instagram as tomscottgo
  • Runtime : 6:31
  • tom scott tomscott the basics computer science huffman coding compression huffman tree lossless compression

COMMENTS: 40

  • Tom Scott
    Tom Scott   2 years ago

    This is the last of the three trial Basics videos! This pushed my quick-explanation skills to the limit, but I figure that "slow down the video and replay if necessary" is better than "let people get bored"...

  • Bastiaan W
    Bastiaan W   5 days ago

    I knew this, but I never found a clever way to store the tree.

  • tjampman
    tjampman   1 weeks ago

    I didn't get it, how did it solve the bit/byte size problem?Am I gonna have to go to Wikipedia?

  • A.H.S.
    A.H.S.   2 weeks ago

    5:52 no one is gangsta until they can mathematically prove they are the best 😎

  • mamertvonn
    mamertvonn   3 weeks ago

    i've always have thought, that was how zip files work. I cant believe i wasnt too far of mark

  • Benedek Horváth
    Benedek Horváth   4 weeks ago

    The text in the thumbnail is not compressed, it is abjadic English.

  • Mr. Privat
    Mr. Privat   4 weeks ago

    Why u sing only 0 and 1 for selecting the path for the char?Why not using 4 paths each letter or number and using1) 002) 013) 104) 11

  • Khanh Liem Pham
    Khanh Liem Pham   1 months ago

    5:55 "the most efficient way..."/ɘ/, not /ɪ/, Tom.

  • Addison Chan
    Addison Chan   1 months ago

    Can you make one on zip files? Like pictures, text files, etc?

  • 64bitrobot
    64bitrobot   1 months ago

    Okay I'm really mad I wanted zip files explained!

  • Mustafa Ozan Alpay
    Mustafa Ozan Alpay   1 months ago

    Well, I have my algorithms final tomorrow and I wanted to have a quick recap, Tom Scott nailing it, again. Thanks!

  • Maskah leo
    Maskah leo   1 months ago

    until Pied Piper was invented followed by Nucleus

  • Spam Filtration
    Spam Filtration   1 months ago

    ayo you better give your editor a raise; this video is bang on

  • legoman7041
    legoman7041   1 months ago

    This was my favorite project in college.

  • DL
    DL   2 months ago

    good ol' graph theory - binary tree. Awesome!

  • Reflexez
    Reflexez   2 months ago

    who reloaded their video @0:45 😂 😂 😂 😂 😂

  • Redo From Start
    Redo From Start   2 months ago

    I’m rewatching loads of Tom Scott videos, and in finding a lot of little easter eggs in the worms he’s saying

  • charl
    charl   3 months ago

    If I had a computer take some Hoffman-compressed text and interpret it as an array of 8-bit characters could I compress something multiple times over?

  • MS Thalamus
    MS Thalamus   3 months ago

    Likely already said below, but lazy: the original ASCII code was only 7 bits, i.e. 128 possible values. The other 128 characters were added later as Extended ASCII. This included non-language characters, such as straight lines and 90 degree angles, often used to create "windows" or progress bars within DOS programs, for example. It might seem odd to have a 7 bit code, given that bytes are 8 bits wide, but... once upon a time, byte size varied from machine to machine. There were 6 bit bytes before 7 bit bytes came on the scene, and it wasn't until the IBM System/360 that we finally standardized on the 8 bit byte everyone* knows about today. As far as Huffman coding goes, that was a really clever approach!

  • rawnak
    rawnak   3 months ago

    This is basically how Morse code works

  • Werevampiwolf
    Werevampiwolf   3 months ago

    The fact that you used Wild Wild West is hilarious

  • Piramida Skripsi
    Piramida Skripsi   3 months ago

    thanks you brother,,best regard to Tom Scott and colinfurze

  • Coach Hannah
    Coach Hannah   4 months ago

    Wouldn’t a lookup table for maybe 10,000 ‘standard’ words be the most efficient method? Add compression for ‘words’ not on the list.

  • Sachin Nair
    Sachin Nair   4 months ago

    I'm a bit confused by the part where you said each character is basically a path of the tree. If some characters have shorter paths than others, how does the code or computer know where to break it up?

  • Peter Yianilos
    Peter Yianilos   5 months ago

    Just when I think I’ve seen your best work, Tom, along comes this. Excellent!

  • jarod D. vernat
    jarod D. vernat   5 months ago

    So good that i find this after i spent hours figuring out huffman trees

  • jarod D. vernat
    jarod D. vernat   5 months ago

    So good that i find this after i spent hours figuring out huffman trees

  • hasan özçifçi
    hasan özçifçi   5 months ago

    great video, but, i did not get how the decoder computer sees the huffman tree?

  • ForboJack
    ForboJack   5 months ago

    Tom Scott: Text has to be losslessly compressed! Xerox: Hold my copy machine!!!

  • Joshua B
    Joshua B   5 months ago

    So zip files will find blocks of characters that are used in varying frequencies and turn those into bits? Like a word being 1?

  • Commentur The Great
    Commentur The Great   5 months ago

    "... or you'll send the wrong worms" duפn Nakk, tnet goכe vסz avvfai!