District Court Holds that Running Hash Values on Computer Is A Search:
The case is United States v. Crist, 2008 WL 4682806 (M.D.Pa. October 22 2008) (Kane, C.J.). It's a child pornography case involving a warrantless search that raises a very interesting and important question of first impression: Is running a hash a Fourth Amendment search? (For background on what a "hash" is and why it matters, see here).
First, the facts. Crist is behind on his rent payments, and his landlord starts to evict him by hiring Sell to remove Crist's belongings and throw them away. Sell comes a cross Crist's computer, and he hands over the computer to his friend Hipple who he knows is looking for a computer. Hipple starts to look through the files, and he comes across child pornography: Hipple freaks out and calls the police. The police then conduct a warrantless forensic examination of the computer:
Also, it seems that the Government failed to make the strongest argument that running the hash isn't a search: If the hash is for a known image of child pornography, then running a hash is a direct analog to a drug-sniffing dog in Illinois v. Caballes, 543 U.S. 405 (2005). Although Caballes is cited in the opinion for other reasons, it seems that the government didn't make the Caballes argument.
It's possible that the argument wasn't raised because the agent made a hash of every file instead of running a search just for matches of known images. But I'm not sure that really makes a difference, and whether it does hinges on some interesting questions. Is the creation of the hash a search? Or is running a query that matches the hashes to known hashes and produces a positive hit a search? It might also break down based on how much the government saw of the machine while the hashes were being made: Perhaps the search occurred when the file structure was revealed to the officers (if it was in fact revealed). But if so, I'm not sure that the images themselves should be suppressed as compared to evidence more directly related to the revealing of the file structure.
Either way, this is a fascinating computer crime law issue that gets debated from time to time without any case law; I believe this is the first case on the topic. Ah, more grist for the mill of the forthcoming second edition of my computer crime casebook. Thanks to FourthAmendment.com for the mention of the opinion, and Matt Caplan for the .pdf.
First, the facts. Crist is behind on his rent payments, and his landlord starts to evict him by hiring Sell to remove Crist's belongings and throw them away. Sell comes a cross Crist's computer, and he hands over the computer to his friend Hipple who he knows is looking for a computer. Hipple starts to look through the files, and he comes across child pornography: Hipple freaks out and calls the police. The police then conduct a warrantless forensic examination of the computer:
In the forensic examination, Agent Buckwash used the following procedure. First, Agent Buckwash created an “MD5 hash value” of Crist's hard drive. An MD5 hash value is a unique alphanumeric representation of the data, a sort of “fingerprint” or “digital DNA.” When creating the hash value, Agent Buckwash used a “software write protect” in order to ensure that “nothing can be written to that hard drive.” Supp. Tr. 88. Next, he ran a virus scan, during which he identified three relatively innocuous viruses. After that, he created an “image,” or exact copy, of all the data on Crist's hard drive.One of the interesting questions here is whether the search that resulted was within the scope of Hipple's private search; different courts have approached this question differently. But for now the most interesting question is whether running the hash was a Fourth Amendment search. The Court concluded that it was, and that the evidence of child pornography discovered had to be suppressed:
Agent Buckwash then opened up the image (not the actual hard drive) in a software program called EnCase, which is the principal tool in the analysis. He explained that EnCase does not access the hard drive in the traditional manner, i.e., through the computer's operating system. Rather, EnCase “reads the hard drive itself.” Supp. Tr. 102. In other words, it reads every file-bit by bit, cluster by cluster-and creates a index of the files contained on the hard drive. EnCase can, therefore, bypass user-defined passwords, “break[ ] down complex file structures for examination,” and recover “deleted” files as long as those files have not been written over. Supp. Tr. 102-03.
Once in EnCase, Agent Buckwash ran a “hash value and signature analysis on all of the files on the hard drive.” Supp. Tr. 89. In doing so, he was able to “fingerprint” each file in the computer. Once he generated hash values of the files, he compared those hash values to the hash values of files that are known or suspected to contain child pornography.Agent Buckwash discovered five videos containing known child pornography. Attachment 5. He discovered 171 videos containing suspected child pornography.
The Government argues that no search occurred in running the EnCase program because the agents “didn't look at any files, they simply accessed the computer.” 2d Supp. Tr. 16. The Court rejects this view and finds that the “running of hash values” is a search protected by the Fourth Amendment.I think this is generally a correct result: See my article Searches and Seizures in a Digital World, 119 Harv. L. Rev. 531 (2005), for the details. Still, given the lack of analysis here it's somewhat hard to know what to make of the decision. Which stage was the search — the creating the duplicate? The running of the hash? It's not really clear. I don't think it matters very much to this case, because the agent who got the positive hit on the hashes didn't then get a warrant. Instead, he immediately switched over to the EnCase "gallery view" function to see the images, which seems to be to be undoudtedly a search. Still, it's a really interesting question.
Computers are composed of many compartments, among them a “hard drive,” which in turn is composed of many “platters,” or disks. To derive the hash values of Crist's computer, the Government physically removed the hard drive from the computer, created a duplicate image of the hard drive without physically invading it, and applied the EnCase program to each compartment, disk, file, folder, and bit.2d Supp. Tr. 18-19. By subjecting the entire computer to a hash value analysis-every file, internet history, picture, and “buddy list” became available for Government review. Such examination constitutes a search.
Also, it seems that the Government failed to make the strongest argument that running the hash isn't a search: If the hash is for a known image of child pornography, then running a hash is a direct analog to a drug-sniffing dog in Illinois v. Caballes, 543 U.S. 405 (2005). Although Caballes is cited in the opinion for other reasons, it seems that the government didn't make the Caballes argument.
It's possible that the argument wasn't raised because the agent made a hash of every file instead of running a search just for matches of known images. But I'm not sure that really makes a difference, and whether it does hinges on some interesting questions. Is the creation of the hash a search? Or is running a query that matches the hashes to known hashes and produces a positive hit a search? It might also break down based on how much the government saw of the machine while the hashes were being made: Perhaps the search occurred when the file structure was revealed to the officers (if it was in fact revealed). But if so, I'm not sure that the images themselves should be suppressed as compared to evidence more directly related to the revealing of the file structure.
Either way, this is a fascinating computer crime law issue that gets debated from time to time without any case law; I believe this is the first case on the topic. Ah, more grist for the mill of the forthcoming second edition of my computer crime casebook. Thanks to FourthAmendment.com for the mention of the opinion, and Matt Caplan for the .pdf.
Related Posts (on one page):
- A verb usage you don't see every day:
- District Court Holds that Running Hash Values on Computer Is A Search:
In the most basic analysis, you're physically reading the bits of information which are stored in the physical properties of particular molecules of the machine. It would seem obvious to me that reading that data, in any form or manner, is a "search" of the data contained on the disk.
I have interpreted your comment as a motion for a more detailed post, and I have granted your motion.
This argument strains at the mechanics how computers work. Both to duplicate the files and generate the hash, the agent's computer, under his direction, accessed every byte very exactly.
Second a hash used this way is an acceleration of comparing files for exact match that allows the government to avoid archiving porn directly. With high probability (given a sufficiently limited reference pool) this is akin to directly comparing the files. I fail to see how this can be distinguished from looking at the files.
Third, the hash is a representation of what the file contains. Just as displaying the image is a representation of what the file contains. The hash happens to be a lossy representation. Say there is an audio recording on the computer. The government compresses (or encodes) the recording to a low-quality mp3 format, then listens. Is this is a search? Yes? Okay now turn down the quality. Repeat. Repeat.
Fourth, many internet search engines work in an identical way. They reduce your keywords and the page text to a set of hashes that they compare against. This is what allows google, for instance, to match your search terms to similar but not identical words.
There is no way to compute this without reading the data. In theory, you could read only every other byte, but would that make a legal difference?
As an analogy to the physical world, imagine a very sensitive chemical sniffer that detects a THC "signature". Since it only detects a signature, it is possible that some non-marijuana plants also trigger the sniffer (with a very small probability). Now, if I walk around your house, with a blindfold on, applying this sniffer to various plants, does that count as a search?
First, anyone still using MD5 should have their head examined -- it's been pretty thoroughly broken. Use SHA256, please!
Note that OS and application software may compute hashes of files or parts of files as part of its normal operation.
I'd argue that the "search" occurred not when the hash values were computed but rather when a set of hashes of files on the computer was intersected with a set of hashes of known child porn.
I assume that there are cases that law enforcement can sieze a sealed container to protect it from being modified or destroyed before they can get a search warrant.
By analogy, computing a set of hashes at the time the digital evidence was seized would be a really good idea as it would increase confidence that the evidence had not been modified in the course of a later search.
I don't understand the chain of custody, however. Since I am not a lawyer, it is not surprising.
Isn't this analogous to a third party providing evidence? The landlord gave the computer away to another who gave the hard drive to the police. There certainly was no seizure. Isn't the report of illegal material by the third party sufficient cause for a search?
Why does that matter for their purposes? They aren't using the hash as a means to protect the data, but simply using the fact that the hash of a particular file is unique and repeatable.
No! Cryptographic hash functions like MD5 produce entirely different output for inputs that differs in even a single bit. AFAIK, Google does not use hashes for searching text.
But can anyone come up with an illustrative analogy to hard-copy data searches? If I have a pile of magazines in the desk of my home office, and someone (Does what?) and this leads a police officer to find child porn, is it a search?
Is there anything comparable? Or is this Amend IV new ground?
This was an impossible concept not too long ago. In the future perhaps star trek like transporter technology will be available and by duplicating your home on a holodeck they could search the virtual home, find what they were after, then confirm the existence of the real item of interest in your real home with the holo equivalent of the hash value.
If it looks like a search and sounds like a search then it is probably a search.
This was an impossible concept not too long ago. In the future perhaps star trek like transporter technology will be available and by duplicating your home on a holodeck they could search the virtual home, find what they were after, then confirm the existence of the real item of interest in your real home with the holo equivalent of the hash value.
If it looks like a search and sounds like a search then it is probably a search.
If only someone had written an article on these interesting questions!
Note that using my lay judgement, I called it a search, but if we are to use the standard above, then I think it is not a search.
I think it's inapplicable because the officer needed to actually peek inside the disk. In the Caballes case, the dog was able to sniff the marijuana from outside the trunk:
"Dog sniffs marijuana outside closed trunk" is analogous to "Officer detects child porn on disk by sensor readings of the electromagnetic field outside the computer"
"Officer detects child porn on disk by reading data from disk" is analogous to "Dog sniffs marijuana in trunk after officer opens trunk"
Perhaps, but then in United States v. Jacobsen, actually taking the substance and destroying it to test it for drugs was held not a search under the Caballes rationale.
With respect to hash files, the government can identify any file on someone's computer that it has already indexed, whether it be illegal child porn or a communist manifesto.
I suppose the government could argue that conducting an MD5 hash analysis of a hard drive is synonymous with scent detecting dogs because the dogs can be trained to alert to any scent, not just the scent emanating from illegal drugs. However, I don't think the government will be making that argument any time soon .
I actually wouldn't particularlly distinguish Caballes from this case unless there is something I am missing about the chain of custody. SCOTUS did not rule that the use of a drug sniffing dog isn't a search, they ruled that it is not an unreasonable/illegal search.
An analogy to the infrared search used as an example of an illegal search would be a police created virus that performed the same hash function done here and then transmits that data to HQ. While not a search in and of itself I would say that operation would perform a data seizure.
As was described above, a MD5 is like a dog bark that signals either contraband, or stuff that looks like contraband on a bit level. This can include, well, anything. With access to an 'incriminating' hash and a hash collision program (google it), you could attach the incriminating hash to any number of files you would have a legitimate interest in keeping private.
This has been true since at least 2005
Everyone else: I have posted a copy of the opinion, via rader Matt Caplan.
Also, the clear message here is to make batch edits to your contraband files.
I did not say that google used MD5 in searches. I said they used a hash as part of the process. Many hashes exist. What does google do exactly? I don't know. But in the meantime you should look into other famous hashes: soundex Metaphone
In that hypothetical case, I can see why the question would be whether the subsequent warrantless search by police exceeded the scope of the third-party discovery.
But, under the actual facts of United States v. Crist, as you've presented them, I'm having a hard time understanding the fourth amendment question.
Please tell us in more detail how we got here.
They had a third-party who ahd looked at the computer, saw child porn, and called the cops. Why does this not qualify as "probable cause"? Why didn't they just get a search warrant?
With a good hash function, the probability of a collision is vanishingly small. You can simply treat the hash value as a random number. So, if you have a given 128-bit hash value, the probability that another given picture matches a specific hash is 1/2^128. (Note that this is not the same probability as that of a collision, but both are very small.)
There's actually two questions in Caballes facts: did the police need a warrant to allow the dog to walk around the outside of the automobile, and once the dog alerted to drugs, did they then lawfully search &seize the auto?
Isn't there an exception when considering motor vehicles which are not (yet) subject to seizure, and thus may be driven away? The citation escapes me, but it was an old bootlegging case, as I recall. So, if officers have PC to believe an automobile, which can leave, contains contraband/evidence/fruits of the crime, they can search the auto without a warrant.
If you mean whether the dog walking around the car constitutes a search, I would say that if the dog can lawfully be where he was and still detect contraband, then a warrant is unnecessary for the dog's presence, just as if an officer, while lawfully present in a location, can then see or sense contraband on or within private property, that would constitute PC for a warrant (unless one of the other exceptions applied).
In the cited case, it seems to me that the police, when shown evidence of child porno on the computer, still needed to obtain a warrant to search the hard drive. Given the circumstances, establishing the probable cause for a warrant would have been simple and straightforward. Why not be safe and get the stupid warrant. it's not brain surgery...
I'd like to add one thing off-post: when constructing a search warrant and supporting affidavit, generally one wants the items to be searched and the items to be searched for to be as broad as possible while still maintaining good-faith compliance with the warrant requirements. In the present case, I would not want to use something like hash values to narrow the files which are examined. In this case, the investigators would then review only the photo files which somehow met the hash value of known pornography (if I understand this process). If the hash values were not considered, then the investigators could reasonably look not only at every single image on the drive, but also check email, IM's, and even word processing files, looking for embedded or attached images. I would want to do this because, say, the suspect may have home-made photos, which don't match a hash value of known porno, but which may contain images which either constitute porno themselves, or even worse, might show something like child abuse. The same for files which are documents: one might find credit card charges for the porn images, or reference to the porn or other crimes in text. There's a line between complying with specificity requirements &fishing expeditions, but intelligent wording of a warrant might allow more of a search, which is usually, from law enforcement's perspective, a good thing.
Second, the agent's friend, yet another step removed from the former owner, find unlawful material and calls the cops. That is indeed probable cause and should have immunized the policy agency that did the analysis. That the discovery of unlawful material was made by a third party with no prompting from the police (i.e., the third party was not an agent of the state), there should have been further immunity.
Bad decision, IMO. I think it's going to be overturned on appeal.
Wouldn't the testimony of the third party qualify as a source of probable cause independent of the actual search?
So is this an indirect consequence of walking around with a police dog? Or is it a deliberate action?
Drugs emit volatile compounds which float through the air, thru fabric, etc. These emissions become public. Dogs smell them and react. A computer hard drive emits nothing coherent beyond a few hair-thicknesses. Access requires proactive measures.
The difference between smelling drugs and reading computer disks is the difference between letters in your home and a conversation overheard.
Well, sure, if you're using the phrase "hash function" in the wider sense, then you're certainly correct. But the type of hash functions being discussed here, namely crypto hash functions such as MD5, don't map similar inputs to similar outputs.
I couldn't disagree more.
Except that they read it in such a way that they could not inadvertently stumble on things that were not child porn (assuming that the police acted in good faith and followed procedure, etc.). How much of a legal difference does that make? I don't know.
I would also say that this operation is not particularly dissimilar from having an officer examine every file on the drive. There is roomusing this method to both pick up non-contraband (items that match hashes but are nonetheless legal, with a good hash this is extremely unlikely under these circumstances and doesn't even matter that much really) and miss actual contraband (contraband items that aren't already known, or contraband items that have been modified so that they have a new hash value).
I see this operation mostly as a quick filtering, a way to save having to do a bunch of tedious work that would be needed if each file needed to be examined one by one.
Would you count that eyeball examination of each file a search?
The crux of the case is that the court didn't accept a lame "it's not a search, it's a hashing" argument when the main argument of the prosecution (that the computer had been "abandoned" and that the search was thus okay) was shot down. Had the circumstances displayed an intent to abandon, this issue would not have arisen.
There's also chain of custody worries - the computer was outside the defendant's control and thus the presence of child pornography on the computer cannot be presumed to be possessed by the defendant. It's equivalent of stealing someone's luggage and reporting to the police that you found a brick of marijuana packed inside - the police have no way of determining if that brick was in there when the luggage was stolen. Even if it was wrapped in one of the owner's t-shirts, that's still not evidence that the owner placed said brick inside the luggage. (At this point you'd want to go on the attack, pointing out that the person who discovered the child porn had obtained the computer in, ahem, adversarial fashion...)
I'm not sure why a search warrant was not obtained, however. The entire point of retrieving an image of the hard drive is to allow data operations on the image without harming the original (or vice versa). Once that image had been taken, the police were free to wait for the necessary paperwork to clear - that image wasn't going to deteriorate or leave police possession. So long as they don't go snooping through it before the warrant comes in, they haven't performed a search.
Here is the sentence from Caballs that I use to justify my earlier statement. Perhaps it does not fairly reflect 4th amendment law.
Official conduct that does not “compromise any legitimate interest in privacy” is not a search subject to the
Fourth Amendment.
I read this to mean exactly what it says, in that the use of the drug sniffing dog is not a search subject to the 4th amendment, not that it is not in fact a search.
I think the letter vs. overheard conversation is actually a pretty good comparison here.
I would think that making the initial hash value (am I using the term correctly?) merely to establish and ensure the unchanged nature of the original data would not, in and of itself, be a search, as long as the hash values were not then cataloged and compared to known values of contraband. The search was actually the second step (or third), that is, comparing the mirror image of the original drive with known porn, with the intent to establish criminal possession. As long as the data contained within the hard drive was never indexed or looked at, then creating the hash value merely to verify integrity of the data does not seem like a search to me. Doing anything else with the resulting hash values, however, does seem like a search to me.
It looks like someone did in fact write such an article. And he even has your name. I'm really surprised that you didn't know this.
Thanks for the link: I(A) is just what I was looking for.
More to the Caballes question, creating a hash value (as Paul Allen pointed out) requires that a software agent of the law enforcement officer read every single byte of data from the hard drive. This seems like a clear search to me -- it is not detecting odorous emanations from a container, but investigating and operating on every part of the contents. In this way hash computations are analogous to X-ray or thermal images rather than a sniffer, and I am not surprised that the government passed on making the argument.
In the instant case, the police took perfect copies of the defendant's files, handed them to an oracle, and asked it "Are any of these files contraband?" The oracle answered in the affirmative. The oracle being in this case a computer with a list of hash values paired with a list of known file porn images.
The police could not have shown the computer to the oracle and said "is there contraband in there?" They must disassemble the computer, remove the drive, surreptitiously (in the sense of bypassing the existing OS) copy every bit of data on the drive, and then ask the oracle for a sophisticated examination of that copy in order to determine if there's a problem with said data. When I turn off my computer I expect that no one will take it apart both physically and digitally in order to determine if it contains contraband. Fido didn't take the car in Caballes apart. It simply looked (with its nose) at things already on the outside.
Now, one might argue that since the cops don't learn anything from the hash values themselves, then their oracle only tells them when bad things are present. Thus we're nearly back at Caballes. But, since the cops did have to disassemble the computer and bypass the usual roadblocks to getting at Crist's data, they clearly searched it. They searched it by taking it apart, removing the drive, copying the data, and analyzing it. All of these things required bypassing things put in place to keep that data away from prying eyes: the case, the drive itself, the OS, etc., and so they all constitute elements of a search. Even with a witness, to my knowledge, the police cannot disassemble (without a proper warrant) my car (not at or within 100 miles of the border) to pass every piece of it past a dog's nose, bolt-by-bolt, to see if any of it is contraband. Neither can they do so to my computer. If their oracle can determine the existence of child porn by looking at the outside of my computer, more power to them.
Now, as a layman, here's the thing I don't understand: The police have the computer. They have or could have gotten an affidavit from Hipple stating that he saw what he believed to be contraband on the computer. I don't see how they could not have passed this information in front of a judge to get a warrant. Everything they need appears to be there. The machine is secure, and there's no danger of Crist deleting the files or destroying the machine. Why not dot all the Is and cross all the Ts?
Clearly, we owe Orin Kerr a beer.
In an old skool conventional search you go thumbnail to thumbnail looking at images. this is much more invasive of privacy because the agent will see the contents of every file.
the hash search means he will only look at the contents of the file if and when he gets a match to a known illegal file.
but I cannot see how anybody... even the govt lol... could argue it's not a search.
and i cannot understand why a warrant wasn't applied for. this isn't a "street" thing. the frigging thing is sitting on the agent's desk.
Various people are correct to point out that two different files can have identical hashes. The math makes this necessary; files can be arbitrary large and pictures are often several kilobytes, but the hash is usually only a few bytes and there's no point is using a hash as large as the files you're interested in. Fewer bytes means fewer possible values.
So your grocery list may have the same hash as an illegal picture, just because there aren't enough possible hashes for every file to get a unique one. BUT you can reduce the risk as much as you care to by increasing the size of the hash. A 32-bit hash is on the small side, but can give a less than 1-in-4 billion chance of accidentally mis-matching two files, depending on the algorithm used. I think that's better than fingerprints.
My math assumes random file contents, which is wrong, but good hashes have a way of acting randomish with non-random input, so the basic point stands anyway.
(IANAL — but I am a software engineer)
Two files could map into the same hash, so hashes are not like fingerprints unless you believe two people can have the same set of fingerprints. In other words, except for special cases, hash functions are not injective. As a practical matter it's unlikely any two given files would have the same hash, but it's not impossible. You could have a legal to possess file with the same hash as one that's illegal, but the chances are small.
On the other side, I'm pretty sure that Hipple's statements would be covered by the silver platter doctrine, and the search of the computer's files would be inevitable given that evidence.
It would certainly be possible, however, to create a hash that only looked at every other byte of a file. If we assume that a typical child porn image is 100K, that's still more than enough samples to produce a very low probability of a false hit.
I agree the hash is definitely a search. but it is less invasive. note: i am not saying this does not mean he shouldn't have gotten a warrant. it's still a search. get a warrant.
it is clearly less invasive because the agent isn't actually looking at what's IN files, certainly not in their screen rendered glory unless and until a hash match is found.
iow, assume you have 10,000 image/video files on your computer.
which is more invasive to your privacy
1) agent looks at each file (either looking at the jpg, etc. or the video file (mpg) etc. to determine if it's contraband
OR
2) agent applies a formula (which is what a hash is) to each byte in your files and then only if a match to a known illegal file is made does he actually VIEW the contents of the file in their rendered glory.
see the difference?
they are both searches. one is clearly more invasive.
For example, if all drug dogs had the ability to somehow convey to the police the exact contents in an automobile trunk, both illicit and legal items, by sniffing the outside of the drunk, the Court would almost certainly hold that the sniff constituted a search. Of course, this is assuming that they public at large does not commonly use such dogs.
This case is different from Caballes in a number of ways. First, the government can determine what files it program alerts to on the fly. Second, once the government creates its index of MD5 sums from the suspects hard drive, it will probably retain this index indefinitely. With drug-dogs, the government has a limited amount of time to do what it wants to do.
True, but the probability of an unintentional collision can be made as small as desired. Note that DNA evidence is admissible in court, and that also has some chance of false positives. What is the legal standard that must be applied for searches?
True, but it cannot alert on all files that contain the word "bomb", for example. On the other hand, it could alert on any well-known files. For example, if a well-known PDF of the a leaked document was circulating, it could alert on that.
I wasn't discussing the difference between visually checking each file, and visually checking only those files that software matched to known illegal files. I was discussing the difference between matching with a hash matching without a hash.
Personally, I think the key is on pp.3-4.
The court finds, in the main body of the opinion:
(Emphasis added.)
But in footnote 2, the court elucidates:
(Emphasis added again.)
Putting those two highlighted facts together leads to an unsavory conclusion. Detective Cotton knew that the computer had been reported stolen, but nevertheless informed the AG's office that this was a search with consent of the owner. The clear inference is that Detective Cotton was less than honest.
When you add that little detail to the account, I agree that the evidence should be suppressed. There's a big chain of custody problem.
Read my article I link to: It answers your question. As for whether conduct amounts to a search in a sense not recognized by the Fourth Amendment, I don't really have any interest in that.
I think we should take the Court's statement that drug dogs are sui generis at face value and let Caballes sit out there as an outlier of 4th Amendment jurisprudence because drug dogs are so unique. I'm still pretty sure that the police can not train a dog to smell what I have written on a notebook in my trunk, and I'm also sure that if police wanted to use hash values to search for legal content they could. I think that is what makes hash values different from drug dogs for purposes of searches under the 4th Amendment. The likelihood that they could be used to search for legal content.
I have to admit, I very much enjoy reading judicial opinions that turn on intricacies of computer engineering. As a software person, it's a little like watching a nature documentary about oneself. Reading about familiar subjects spoken of in such an unfamiliar (almost, though I hesitate to use the word here, childlike) way is fascinating. It's admirable how often they seem to get it right, as in this instance.
A few side notes:
- The odds of hash collisions (though they can be architected in the case of MD5) are so astronomical that getting even one hit would seem to be pretty bulletproof probable cause. With 176 matches, they might as well lock you up without even bothering to check the original files.
Like this: if the odds of a collision were 50%, orders and orders of magnitude higher than is reasonable (though it's hard to put an exact number on it), then 1/2^176 gives approx. a 1 in 10^53 chance for all 176 to be false positives. Even if Agent Buckwash ran a check on a suspected pedophile's computer once every nanosecond, we wouldn't expect a false positive of that magnitude for 10^26 years.
To put it clumsily, given the current age of the universe to work in (approx. 10^26 nanoseconds), Agent Buckwash wouldn't encounter a single such fluke occurrence unless he could do a whole universe age's worth of one-pedophile-per-nanosecond checks, every single nanosecond. (The odds are roughly the same as quantum fluctuations spontaneously teleporting you, bodily and unharmed, to the surface of Mars.*) Even with a 90% chance of collision, the odds of winning the lottery are significantly better than a false positive. The search required to generate the hashes is the problem, not hash collisions.
* God, I wish there was a link for this. Trust me.
- The best real-world analogy for what the agent did in this instance would be something like the police sending a hyperintelligent robot into your house to rummage through your things, comparing what it sees to suspicious materials and reporting back. If that's not different from a drug sniffing dog circling a car I'll eat my hat.
If the use of technology or not in the course of an invasion of privacy is the criterion for a search, RoboCop never needs a warrant. (Actually, that sounds about right, come to remember.) See also looking through walls for drugs.
Well, I'm not a lawyer, but I was under the understanding that forcing someone to provide DNA evidence requires a warrant. The issue is not whether this could be good proof, the issue is whether it could be invasive on matters that are not illegal.
Bad, bad software engineer. You've ignored both the birthday paradox, and assumed that purely statistical information is a good metric of applied data.
The number of matches in this case make it clear that the odds are stupidly prohibitive, but the court has to place precedent that would apply for even one match.
I would argue that even your exposure standard would actually call this action a search. The files may not be rendered as images, instead they are exposed and rendered as hashes. Perhaps not very titilating, but still exposed.
Actually it is repeatable, but not unique. All hashing functions will lack the uniqueness criteria you suggest.
What a hash search allows you to do is to generate a short list of files which are probable matches. Just because a particular file matches does not mean that it is a specific file.
Furthermore, there *is* a legitimate argument against using MD5 for this sort of activity. The issue is that one can essentially append data onto arbitrary files to create hashes of the values one wants. So if I had incriminating evidence on my computer, I could also create thousands of files with the same MD5 hash value and thus require a file-by-file check. "This, your honor, is a text file with the same md5 hash as a child pornography movie" isn't going to get very far in court.
Third, the hash is a representation of what the file contains.
Wrong, unless you consider 0 to be a representation of half of all possible files and 1 to be a representation of the other half of all possible files.
Actually it is repeatable, but not unique. All hashing functions will lack the uniqueness criteria you suggest.
It's functionally unique; unless someone has intentionally cooked it, the odds of an incorrect match is much smaller than 1/number of atoms in the universe.
Furthermore, there *is* a legitimate argument against using MD5 for this sort of activity. The issue is that one can essentially append data onto arbitrary files to create hashes of the values one wants.
Um, at best you could create files on your disk that aren't pornography but match files that are -- why would anyone want to do that? You couldn't even frame anyone that way, since conviction would require examination of the actual files.
And even if you did something like that, it's just about the same thing as dumping a few images of child porn into a big folder of ordinary porn; a determined and methodical search would turn up the incriminating material. In fact, it'd be rather more obvious that you were up to something screwy.
But in practice, it's not worth the effort. If you're really worried about your files being accessed by the cops, encrypt them. Hell, encrypt the filenames and dump them someplace boring. If you're not doing that, what's the point of going to the effort of hash spoofing? If you are doing that, then the police aren't finding your files anyway.
Here's an interesting question... is a hash value of a media file considered a "derivative work" under copyright law? ;p
The birthday "paradox" (it's not nearly a paradox, just a surprise to people with poor intuitions about probability) only applies to finding some pair in a DB that matches; it doesn't apply to matching a predetermined value. But even if the birthday problem did apply, the odds of an incorrect match would be astronomical ... less than the chance of an incorrect match due to a misread or other hardware malfunction.
Better yet, use steganography.
A properly configured firewall.
I think that is what makes hash values different from drug dogs for purposes of searches under the 4th Amendment. The likelihood that they could be used to search for legal content.
Um, dogs are quite capable of searching for legal content.
If the police had only one file to check and only one sample of child-pornography, the odds of an innocent file matching with a 128-bit MD5 hash would be 340, 282, 366, 920, 938, 463, 463, 374, 607, 431, 768, 211, 456 to 1 against. Of course, they probably have thousands of each, but still, you'd have about the same chance of getting a single monkey to type out Hamlet.
Caballes seems exactly on point: no-one has a privacy interest that would be harmed by the police comparing the hashes of his files against a list consisting entirely of the hashes of child-pornography.
On the other hand, all that duplication and hardware copying rigmarole doesn't seem to change the Constitutional situation a bit.
I did note that it's extremely difficult to put even arguable numbers on the odds of hash collisions, and I confess that I didn't have any envelope backs handy for more detailed analysis. :) You're right that it depends greatly on the vagaries of the algorithm and the data itself.
Consideration of the birthday problem was implicit in any assumption of a probability for a collision in this case. The birthday paradox on its own gives a vanishingly small probability for this situation.
If the government has a million hashes, and the defendant has a million files, let's ballpark the birthday effect as the odds of at least two of the two million total hashes being equal. In birthday terms, we've got a calendar 10^38 days long and only 10^6 people at our party... I'm sure your own envelope can put an upper bound of zero to twenty places on the odds. It's so far off the magnitude scale from the actual birthday problem to make 50% or 5% or 0.0005% for a collision laughably, impossibly high.
Yeah, and spy programs that promise to only send anonymous info to beneficial vendors don't violate your privacy, so why make a fuss about them?
The fact is scanning an image of your disk is a search, even if the police promise to do in a way that will only get bad guys.
The real trick is that the police department in this case jumped the gun, didn't do their paperwork to get a warrant before investigating the contents of the drive, and quite possibly misrepresented the chain of custody in ways that would have invalidated the evidence therein anyway.
They tried to get around this by characterizing their electronic search, which compares hashes because that's more time-effective than having a detective search through tens or hundreds of thousands of files, as "not a search". Thus, the "probable cause" generated by the hash hits could justify a warrant-less manual search.
But like one of the other posters said, that's like sending your spy drone in, seeing something, and then entering on the evidence that the spy drone picked up. If you didn't have the necessary authorization to send that drone in there, then it can't itself generate the evidence necessary to justify its own search. Nor did the judge buy that argument.
It's also another case that demonstrates that the dumbest thing you can do is talk to an investigator. If this guy had the sangfroid to shut up and get a lawyer, the compromised evidence chain and lack of warrant for the search would mean he'd now be a free man. But because he made self-incriminating statements, he's still got to worry about beating a rap that the cops should have blown through bad procedure. So if the detectives come to ask you about something, don't talk to them!
you have a right not to have your hard drive searched by govt. agents, whether or not that search causes "harm" to use your term.
it's a search, and it didn't (as far as i can see) meet any warrant exceptions. therefore, it required a warrant.
whether or not your privacy interest would be "harmed" is a given. if it was an unlawful search, that's the proof right there.
You're applying a double standard. Certainly it's possible for two people to have the same set of fingerprints -- and the question isn't even that existential one, but rather whether the fingerprints of two different people can match according to some algorithm. These cases are different only in that there are so many more possible files than there are people. But not so when you restrict it to "files that are known or suspected to contain child pornography" -- then, the odds that the MD5 of some file on your disk matches one of the child porn MD5s is likely to be less than the odds that your fingerprints will be flagged as matching those of some criminal on file (assuming you're not one).
Yup. Just as the police cannot send a search robot into your home that only flags you if it detects a crime, they cannot search your disk with a program that only flags you if it detects a crime. Kind of obvious.
Do you think that the 5th amendment cannot be violated because innocent people can't incriminate themselves and guilty people have no privacy interests?
I wonder how many child-porn owners are going to black out a single random pixel in each of their child porn photos now. Or just apply some imperceptible filter to their videos.
And seriously - salting text files to create MD5 collisions? That seems like entirely too much work to avoid conviction. You're better off stealing pubic hairs from your gym's shower drains and going on a child rape spree.
.
Just the mechanics prevent that. OTOH, whatever information transits the 'net can be copied. See mass collection of international calls, which is not a "search" until accessed in a way that determines content. I don't know if the Jabara case is still good law in that regard.
.
As a thought exercise, the same would be true if the government could somehow possess all the data on all our hard drives. Until it's looked at, there is no search. Cataloging it via MD5 numbers can't be termed a search, I don't think, because MD5 results aren't human-decipherable (it reads like gibberish).
.
There is a genuine issue here about how easily the government can move from "possession" or "observation" to "search," and it doesn't have a simple answer. It used to be that crossing that barrier required something called "suspicion," but when "suspicion" can be bootstrapped from compelled turning over (seizure) of duplicable material (IOW, you still haven't lost anything), the public is more exposed to government intrusion.
You'd have no problem then with a blind cop picking the lock on your house and then taking pictures of all its contents (assuming he puts everything back where he found it)? You're saying that's not a search until a sighted cop looks through those pictures? WTF? That can't be right.
If you want to be really technical, you can't actually see the picture on the hard drive at all. You have to read the data and display it. The reading process involves the unnecessary step of copying it to some other physical medium, it just means you are doing it the hard way.
i've debunked this rubbish before, and interestingly you fall into the same justification (selection bias) for this erroneous device that I mention as the most frequent reason for this error. congrats.
If a hash is used for which it is feasible to find collisions, then there are opportunities for mischief.
From a cryptographic protocol design standpoint: an adversary who chooses what you feed to the hash function + non-collision-resistant hash = game over for the crypto protocol.
it's generally not necessary to analyze further, but here are two hypotheticals:
- the purveyors of child porn could create pairs of innocent files and child porn with the same hash, causing false positives for searches and undermining confidence in the technique.
- an corrupt official could create such a pair, arrange for the innocent half to find its way to his target, then use the hash collision as a pretext for causing the target further trouble.
That's just nonsensical. It's trillions of times easier to obtain someone's DNA and leave it at a crime scene.
I have to think that someone who would make these kinds of arguments just doesn't understand modern hash protocols (such as SHA-512), what they do, and how they work.
How many times easier would it be for the corrupt official to just put the contraband on your hard drive?
Caballes is distinguishable because in that case the incriminating evidence of illegality was freely available outside the container in which it was traveling, and the dog could sniff it without invading the person's "legitimate expectation of privacy." It was key that the dog only alerted on contraband because the dog thus "[did] not expose noncontraband items that otherwise would remain hidden from public view." Here, the police had to physically invade the person's computer and "expose noncontraband items."
Jacobsen is likewise distinguishable because the person in Jacobsen had no "legitimate expectation of privacy" in contraband (i.e. the cocaine that was destroyed during the test and revealed to be cocaine). Here, the owner has a legitimate privacy interest in every single file that isn't contraband and which was hashed along with the contraband.
Few of us ever look at a file. We look at a representation of some aspect of a file, and that representation is produced by another program. That program accesses the computer.
If I'm debugging a program, I might look at the bits in a file through a file dump program. Someone else may use the same file to look at a representation of video pictures. Both of us are using the same file, and using different programs to generate different representations.
Likewise, someone else may use the same file with a hash program to look at a hash ID. So, three of us are all looking at a representation of a file, and each is using a different program to generate that representation. But, we're all looking at the generated representation we choose for out purpose.
I think that Prof. James Duane and his police officer guest made a very effective case for never talking with any officer.
http://www.regent.edu/admin/media/schlaw/LawPreview/
rcgeek
However, when police are able to find evidence of a crime without such probing, why is this a bad thing? If the police can find evidence of a crime without non-criminal private items being looked through, I say go for it. Crist only has a legitimate privacy interest in the non-criminal files on his computers, not the ones that relate to child pornography- if police can locate child-porn files without perusing legimitate files then what's the problem? If they did it to my or your computer the police would be viewing exactly 0 files- no problem there. Frankly, I have no problem with privacy being violated when its strictly limited to illegal acts, even without a warrant. My problem is the computer appears to have been illegally physically seized- there was no consent by the owner, nor was there abandonment.
You can't run a search just for matches of known images using a hash. That's the point. Hashes are completely one way. In other words, I can create a hash using a known image, but if I have a hash, I can't know what the image looked like which created it.
Therefore, if you have a listing of hashes for known pornographic images, the only way to compare that hash with the contents of a hard drive is to create a hash of every file, and see which hashes match. Operating systems do not create hashes of all the files on the computer during their normal operation. So the only way to generate one is on purpose after the fact, usually through the forensic examination process (though there are other reasons for generating hashes as well).
.
ROTFL. I'd have a huge problem with it. I have a huge problem with the rationale of the Jabara decision, but AFIK, it's "the law" of the Circuit.
.
As a thought exercise, the instant case indeed aims to draw a line called "search" that is analogous to the government having the pictures, but not be deemed to have searched until they look - and disregarding the circumstances of how the pictures were obtained. By disregarding, I don't mean that the circumstance of obtaining the pictures is irrelevant to the complete decision, it just isn't part of deciding when "search" occurs, given that the government has the images in its possession.
The drug sniffing dog identifies chemicals released into the atmosphere and thus, no longer in control or possession of the owner of the goods being sniffed. These chemicals are left in the environment even after the object that generated them is long gone. The disk, however, is a sealed object and it must
be directly manipulated (electronically controlled) in order to extract data from it so that the hash can be performed. No passive sniffing occurs in such a scenario.
"Contraband," of course, is a rather broad term these days. If I have copyrighted material on my system without permission, I have contraband.
A hash simply indicates presence of a particular file (ignoring hash salting because this does not seem to be an area where it would actually be useful) but provides no information on whether that file is actually a violation.
Unlike the child porn issue, the exact same file could be legitimate for one person to possess and not for another.
1) Performing a Virus Scan - that's a search by definition
2) Matching MD5 hashes against a database
Other potential searches:
1) Opening up the computer case
2) Generating the MD5 hashes
3) Duplicating the hard drive
This is all besides the other issue of ownership of the PC, and especially regarding the fact that other people had access to the PC and could have tampered with it. I don't know whether or not the police could get a warrant for searching the PC when the PC was stolen and the porn was reported by someone handling stolen goods. However even with a valid warrant the chain of custody would have been enough to cast doubt as long as the owner had a consistent story disavowing any knowledge (and the police errors in this area just make this entire case a total cock up).
The best we can hope for here is that the guy stops looking at child porn because of the entire situation (and counts his lucky stars), and that the police sort out their act in the future, and don't let the emotion of child porn override due process.
Don't be surprise if one day your computer gets scanned for illegal "hate speech."
However, but for the (exceedingly unusual) filing of the police report on the computer, this case could have come out the other way: the court would have had to directly address the "abandoned property" question it skirted in footnote 8. Without the known police report, an objective officer could have concluded that Hipple lawfully possessed the computer and consented to the search. Game over at that point.
But that's not this case. As noted in footnote 2, the detective who refered the computer to the AG's office for forensic analysis knew it had been reported stolen. Thus, the police were on notice that the computer clearly had not been abandoned by its prior owner and "found" by Hipple. The stolen property report converts a decent argument for the consent exception to the warrant requrement into the category of police conduct one judge I know calls "No! Bad cop!" (using his best naughty doggie corrective voice). At the point the officer became aware the computer might be stolen, he should have realized there was a problem with Hipple's ability to consent. He should have immediately put the computer in the evidence vault for safekeeping and visited a magistrate to obtain a search warrant that permitted forensic examination of the possibly-stolen computer (using Hipple's detailed statement about possible contraband he saw on the machine as the basis for probable cause). But he didn't. And there can be no good faith exception available where the specific officer was on notice of potentially invalid consent prior to the warrantless search.
But I also think the court didn't get the "private search" analysis quite right. If we assume, as the court did, that the private search doctrine applies to Hipple's actions as trespasser to the contents of the computer, then there's an argument the police were entitled to hash, image, and recover AT LEAST the specific videos Hipple had viewed and deleted (and it's trivial to determine which ones they were, based on the deletion records created by the OS combined with Hipple's own statements). Depending on the contents of those videos, there might be enough evidence to sustain a conviction in this case.
I generally think the court's attempt to distinguish cases applying the private search doctrine to floppy disks and other removeable media from this searh of a HDD -- based on things like the number of physical platters in the box -- just doesn't cut it. That part of the memorandum opinion reminded me of some discussions I had in the past with magistrates who insisted on analogizing computers to file cabinets, and couldn't understand why the agents' warrant application sought permission to take the whole "file cabinet" away rather than just look through it on-site to find selected "papers" relevant to the investigation...
So you'd have no problem with a government-mandated program running in the background on your computers which periodically sends the hash of every file therein to the police to be compared against their DB of known childporn hashes? I bet you're just itching to sign up!
And, btw, I think my analogy is pretty dead on. Beef it up by saying that the blind cop is also deaf and mute and has no sense of touch except for on his camera shutter release finger. Or, hell, say that it's a robot cop. If it only comes in when I'm not there, and I can't tell that it's been there, I'm probably not harmed in the usual sense of the word. But we still feel that our privacy has been invaded by the blind, senseless, deaf mute robot cop with the magic camera. Backing down a bit, we'd be similarly upset at said cop going through our home filing cabinets photocopying everything. How is my computer any different?
The thing I'd like explained to me by a lawyer in the know (hint. hint. Orin.) is given that we pretty much all seem to agree here that a warrant would have been issued on Hipple's testimony for a fully-authorized search of the computer which would have turned up undoubtedly admissible evidence, how does waiving the facts under a judge's nose make the search reasonable? Is this where our distrust of the system comes into play?
IANAL, but the first thing that struck me was - the chain of possession. Why, in a crime with such a severe possible sentence, would the court rely soley on a computer with a compromised chain of custody?
yeah, likely the preponderance of evidence - dates of files, browser log, etc - is too complex to be faked convincingly (unless the entire chain were copied