Lab 0: Password cracking
Instructions on how to submit Lab 0: Please download all the required files from the lab0 github repo.

Code: Place your code answers in the template
sol.py
. Please include all code necessary to generate your solution in each of the respective methods. Do not just hard code working answers! 
Text: Place your written answers in the template
questions.txt
Upload all files (sol.py
, questions.txt
, and any supplementary files you used) to the lab0 gradescope assignment.
Gradescope autograder: Your code will be graded with the Gradescope autograder with a timeout of 5 minutes (2a), 10 minutes(2b), 2 seconds (3ae), 20 minutes(4b). Your code should reliably succeed in this timeframe.
There is a STRICT 6.0GB memory limit on Gradescope. This should be sufficient for reasonable solutions, however, if you generate very large dictionaries, sets, or lists, you may exceed this memory limit and the Gradescope tester will fail.
Plagiarism: Gradescope automatically runs a surprisingly effective plagiarismdetection tool on your submissions. Please do not copy code from your fellow students. Refer to the “Collaboration” section of the course info document for details on what types of collaboration are and aren’t allowed in 6.1600. If you are having trouble completing an assignment for whatever reason, please ask the course staff for help. We are often happy to give help and, in many cases, extensions too! We are not happy when we find copied code.
Optional reference reading: The BonehShoup book, Chapter 18.3 is a good place to look if you would like to see a very detailed formal treatment of the ideas covered in this problem set.
Problem 1: Storing passwords
Unixlike operating systems store password hashes in a file called /etc/shadow
.
Answer the following questions:

Typically, only the
root
user on a machine has access to theshadow
file. Theshadow
file only contains passwords hashes – not the passwords themselves. In one sentence, explain why it is important to protect the hashed passwords. 
Look online for information about the
shadow
file format. Why does the format allow the password hashes for each user to be hashed with a different hash function?Learn about the PBKDF2 hash function by reading Chapter 18.4.3 of the BonehShoup book. Then answer the following questions:

What security benefit does iterating a hash function provide?

Say that you are using PBKDF2 to protect passwords for authenticating to a popular web service (e.g., MIT Touchstone). How would you set the iteration count?

If you were instead using PBKDF2 to protect passwords for authenticating to your laptop, would your answer be the same or different as in part (d)? Why?
Problem 2: Cracking passwords
You MAY NOT use any offtheshelf passwordcracking programs or libraries to complete this problem.
In reality, we use hash functions with 256 bits of output, but in this problem we will work with a toy hash function that has a 48bit output.
For this problem, you will need to read through the code at hashall.py and sol.py. It will also be useful to run the grader locally grader.py.
That Python program reads each line from standard input, hashes the resulting string using SHA256, and writes the first 48 bits of the hash to standard output as a hex string.

We ran the following code to hash the secret value
$PASSWORD
using the toy hash function defined inhashall.py
:echo $PASSWORD  python hashall.py
Which outputs a hash.
Each student in 6.1600 has a unique
$PASSWORD
to find. To get the hash for your$PASSWORD
, submit an emptysol.py
to gradescope. And you should see a result like this for problem 2a:2a) test_2a  YOUR HASH TARGET: a33a874eb313
Replace line 31 in
grader.py
with the string hash target from gradescope.target = "FIND ME ON GRADESCOPE" > target = "a33a874eb313"
Write a program to find the value of
$PASSWORD
inproblem_2a()
insol.py
.Hint:
$PASSWORD
is a lowercase English word containing only the lettersaz
. 
If a user chooses a uniformly random string of 20 letters (
az
) as their password, and we hash the password with a standard cryptographic hash function with 256bit output, how many guesses on average will it take to recover their password? 
The file hashes.txt contains a large number of hashed passwords under the toy hash function defined in hashall.py. These hashes are unsalted; we computed them exactly as we computed the hash in part (A). Write a program to find a preimage of one of the hashed passwords.
The file hashes.txt will be included in the root directory of your solution. You may
open("hashes.txt")
insol.py
to compute your answer.Put your code in
problem_2c()
insol.py
. 
How would the cost of the preimagefinding attack change in part (C) if each hashed password were salted with a unique salt?
Problem 3: Collisions
In this problem, we will explore the cost of finding collisions in hash functions. The “Birthday Paradox” analysis we cover here is one of the most fundamental tools for reasoning about the security of cryptosystems, so it’s important to understand.
Consider throwing \(B\) balls into \(N\) bins at random. That is, for each ball, we pick a bin in \(\{1, \dots, N\}\) independently and uniformly at random, and toss the ball into that bin.

For a particular ball \(i\) and bin \(k\), what is the probability, as a function of \(B\) and \(N\), that ball \(i\) falls into bin \(k\)? Place your answer in
problem_3a(B,N)

For particular distinct balls \(i\) and \(j\), and for a particular bin \(k\), what is the probability, as a function of \(B\) and \(N\), that ball \(i\) and ball \(j\) fall into bin \(k\)? Place your answer in
problem_3b(B,N)

How many pairs of distinct balls \((i,j)\) are there in total, as a function of \(B\)? For example, if there were three balls \((B=3)\) there would be a total of three pairs: \((1, 2), (1, 3), (2, 3)\). Place your answer in
problem_3c(B)

Give a nontrivial upper bound, as a function of \(B\) and \(N\) on the probability that any two balls fall into the same bin. In other words, you will compute an expression of the form \(\Pr[\text{two balls in same bin}] \leq \text{???}\). Place your answer in
problem_3d(B,N)
Hint: Use the union bound. That is, if \(B_1\) and \(B_2\) are two bad events, then the probability that either bad event occurs is at most the sum of \(\Pr[B_1]\) and \(\Pr[B_2]\), even if the events are dependent. That is: \(\Pr[B_1 \lor B_2] \leq \Pr[B_1] + \Pr[B_2]\).

Let \(H \colon \{0,1\}^n \to \{0,1\}^n\) be a random function. That is, for all \(x \in \{0,1\}^n\), the value \(H(x)\) is a random \(n\)bit string chosen independent and uniformly at random. Place your answer in
problem_3e(L,n)
Using your answer to part (D), compute give a nontrivial upper bound on the probability that there is a collision among \(H(x_1), H(x_2), \dots, H(x_L)\), where \(x_1, \dots, x_L\) are distinct \(n\)bit strings with \(L \ll 2^n\).
Problem 4: Finding collisions
Let \(H\) be the hash function defined in hashbig.py. This hash function has a 56bit output and is NOT the same hash function as in Problem 2.

If you iterate the function \(H\) on itself:
\(H(x), H(H(x)), H(H(H(x))), \dots\),
eventually you will enter a cycle of repeating values. Explain how to use such a cycle to find a collision in \(H\) (with high probability) while storing \(\ll \sqrt{2^{56}}\) bits.

Write a program to find a collision in \(H\).
Your program should complete in fewer than 15 minutes on a modern laptop.
Place your code in
problem_4b()
insol.py
.Hint: Use your answer from Part (A).
Hint: For your program to run quickly, it should make as few memory accesses as possible. Making many lookups into a gigantic array or hash table will slow you down. In addition, a solution that stores a lookup table of size \(2^{30}\) will almost certainly run out of memory on Gradescope, so you will need to do something more clever.
Hint: For debugging, run your program on a smalloutput hash function first to make sure that it can actually find collisions.
Note: Do not hard code in a single working solution. Submit all the code you used to generate the collision in
problem_4b()
to run on gradescope.