Founder of Archive.org discusses the nonprofit’s plan to archive as much information as possible online, for all the world to share for free
TRANSCRIPT
Speaker 1:You're listening to k a LX, Berkeley 90.7 FM and this has method to the madness to show from the public affairs department that celebrates the innovative spirit of the bay area. I'm your host Ali Nisar and today we have Brewster Kale to internet pioneer who is an engineer, entrepreneur, activists advocating for universal access to knowledge for all through his projects. The Internet archive. Stay with us.
Speaker 2:We're trying to bring universal access to all knowledge. [00:00:30] Can we build the library of Alexandria version two can you actually take everything ever published? All books, music, video, software, webpages, anything ever meant for public distribution. Make it one, preserve it forever and to make it then available to anybody. Curious enough to want to have access to. So that's the, that's the basic worldview that we're part of a, we're not trying to solve the whole thing, but anything that's missing, we want to try to get to that [00:01:00] goal. That's such an amazing and inspirational vision. I mean, cause it's almost impossible to catalog everything. I would think that's actually quite possible. Yeah, you kind of have that. It's not infinite. It's not, it's not at all. Um, so if you take, um, oh, I don't know, take, take the library Congress for the largest book or print library in the world and by far about 28 million books.
Speaker 2:The next ones down are things like, uh, the British library, [00:01:30] the uh, Harvard and New York public library in there and kind of half the size and lead. So the library can exchange gimmick. But if you take a book, it's about a megabyte, about a megabyte, a book. So if you have 28 million megabytes, it goes mega, Giga, Tera, 28 terabytes. And that's 20 terabytes is seven hard drives that you can buy in best buy. And that means in one shopping cart for less than you pay in rent in a month. You get out of all [00:02:00] of the disc storage. I would store all the words in the library of Congress.
Speaker 1:Okay,
Speaker 2:let's do, and if you do out the math on the other things like movies and music and web pages, it's all completely within our grants. It's just cause the tech, techno, the technical guys have gotten data storage to be so, so dense. And then the access part is this internet. The idea of, of getting it to somebody in Kenya or East Oakland is completely possible. So when did you begin the process? [00:02:30] I this, this really crystallized for me back in 1980 to go and build, uh, try to build the library, but there were a lot of things missing. So I tried to help build some of those, those pieces, uh, leading up to a system that came before the worldwide web, uh, called ways to try to get people to come online in an open way, but the web was better. Um, so I've jumped onto that and then, um, built a couple of companies along the way to try to get the publishers online.
Speaker 2:Um, but by 19, 1996, [00:03:00] things were going, uh, well enough. All the Polish was getting online and the basic infrastructure was moving along. And not just because of me, but because of the, everybody was working together towards building that I could say, okay, let's build the library. So we started collecting the World Wide Web and this new organization called the Internet Archive, archive.org and we started archiving the world wide web and we tried to build robot crawlers. Basically the same things that operate the search engines, like Google the disco and visit every [00:03:30] website and download every webpage. And we would basically do every webpage from every website every two months. Then we'd start again and do it again. Do it again, do it again. Do it again. Cause that long it takes to crawl a wet, that's how long we give it. And then because the web is effectively infinite, you know, that was my question that it's, you know, there's these sites that just play chess with you.
Speaker 2:So I mean, so you know, there's infinite numbers of, of computer generated web pages. Um, but yeah, it takes us [00:04:00] about two months to go and gather up what it is. It's in a modern search engine. How do you determine which sites are, are um, are we trying to do order the you archive? We tried to do, we tried to do all of them. We biased towards the popular ones. So, uh, we tried to get something from everybody. And then for the, uh, ones that are used a lot, then those are the ones we try to go deeper, but we're talking hundreds of millions of websites. We, we now have 240 billion [00:04:30] pages. Um, and in 19,001 we made a way back machine, so you could go to archive.org and type in a URL. And, um, if we have it, we'll show you all the different versions we have.
Speaker 2:You can start clicking around and seeing the web as it was. So the idea of, of preserving this amazing thing that we're building, which is this worldwide web is quite doable even by a nonprofit. So we started working with the library of Congress. We worked with a bunch of different national libraries. [00:05:00] We work with about 200 university libraries, um, and state libraries and archives that they help fund bits and pieces of the Internet archive on the, on the web collection. It's completely exciting and it's working. Uh, we get about a half a million people a day using just the way back machine itself. And so it's a popular resource, uh, out there. But then we thought, okay, well what's this is going along. What else is there to do? So the, another endangered medium was a television and I've had a love [00:05:30] hate relationship with television, uh, burn television anyway.
Speaker 2:Said, hey, I watched too much growing up, but it is still a very influential purse, pervasive and persuasive medium. And nobody else seemed to be in the cultural areas, seem to be doing a good archive of it. So in the year 2000, we started, we hit the record button and we started quoting 20 channels, 24 hours a day, DVD quality, so Russian, Chinese, Japanese, Iraqi, Al Jazeera, BBC, CNN, ABC, Fox, whatever, [00:06:00] all these. And then just crunched forward. We're now up to around a hundred channels from 35 countries. Um, and we made these available just a few months ago. Um, at least the television news from the United States from the last three years. We wanted everyone to be, uh, uh, John Stewart Research Department before the last election so that people could go and type in words to try to find out what did their, what did politicians say before about particular [00:06:30] things would have pundits say.
Speaker 2:So you could basically go and quote and compare and contrast the elements that require, are required for critical thinking. So we wanted this to happen, so we made this available, um, publicly, and you can get 32nd snippets of these, um, news programs and you can watch those. And if you want the whole program, then we print it on a DVD and send it to you. I charge 25 bucks. So the idea is to try to get this, um, uh, ecosystem to work. [00:07:00] So documentaries can come out and it makes it easier for people to look at television news and critique it because otherwise it just flows over and it's just, these guys can say anything they want and get away with it. So how are we going to basically hold them accountable? Um, make it so that this, these materials are referenceable and making it by URL.
Speaker 2:So you can go and refer into news over a period of time. So it's really fascinating. I, I, I wonder though, if you could explain to us about, um, by copyrights. Yeah. So, [00:07:30] so, you know, you mentioned a lot of big net television networks, right? Um, that probably, you know, they have ownership over the content in some way, or how does that work? Um, well, everything's copyrighted forever, it seems, or at least, I don't know, I am not a lawyer on this stuff, but we are a library and the libraries have copyrighted materials. So that's what, that's what libraries are full of. And there are certain things libraries can do with them. Um, for instance, lends, I'm out. So let's take books. So we digitize books, [00:08:00] we've digitized a couple million bucks, uh, and we give them away for free. And so they can be downloaded in bulk.
Speaker 2:Uh, we think that that's very important. It's kind of a counter to some of the project by Google with their, with those libraries. Uh, and we worked with the University of California and we had a scanning center in Richmond, California and at UCLA, uh, digitizing books. And we digitized about 300,000 books from, um, those collections. And they're available for free on the net, uh, [00:08:30] for any use at all. Cause they were old enough to be out of copyright. But for the newer ones, the ones since 1923, a lot of them's have rights problems. So we digitize these also, um, largely based on book people donating the books to us and then we make them available to the blind and dyslexic and because we can, so the blind and dyslexic, if they are blind for the library of Congress to get a access, they can have access to now 500,000 [00:09:00] modern books all the way through Harry Potter or whatever, uh, to be able to have, um, free, uh, and easy access to these materials.
Speaker 2:But then we wanted to go further than that. Um, so for the books from the 20th century that we've scanned for the 21st century, we would try to buy books and then lend them out one at a time. So there's only one person who can have a copy at a time using the same controls. The publishers use to control the distribution of their imprint works. So [00:09:30] we, we buy these books and we lend them out, but the publishers are selling that many books yet. Um, so we've digitized a lot of these books, let's say from the 20th century and then we lend them out. So you can go to open library.work, which is another side of ours. And then you can go and click and download a pdf of this. But it's one of these special weird PDFs from Adobe that melt in your hands after two weeks, self-destruct, say self-destruct in that sort of a mission, impossible kind of way.
Speaker 2:Or you can [00:10:00] check it out and read it on the screen and then while you're reading it on the screen, nobody else can check it out. If you check it back in and then somebody else can use it. And if you forget to check it back in, then it automatically is checked back in so you don't get any library fees and it, um, uh, then somebody else can check it out. And we get, Oh, a couple thousands of 3000 people checking out a book a day. Um, uh, I think so on that order. So you can sound like a member of the archive just like, yeah, you get a library card, it's free. So if you've got an open library.org [00:10:30] you can go and borrow books. If you've got an archive.org you can borrow TV programs. Um, and on the website on the way back machine that's just free and open use.
Speaker 2:Um, is there a legal entity for a library or is it just the kind of no, you walk out like a duck, quacks like a duck, you're a library. So, uh, but uh, we were actually, um, there is a particular regulation to be able to get some bandwidth subsidy. Um, you have to get the state librarian [00:11:00] to go and say that your library and California State Library and Susan Hildreth time, uh, said that we were a library, which actually turned out to be very helpful because at one point this, the FBI came and wanted, actually demanded information about a patron and what that patron had done on the Internet archive. And well, libraries have a long history of not liking these sorts of requests. Um, and, but it was done with the Patriot Act, these national security letters with a gag [00:11:30] order. So they basically, they said, okay, you're going to have to give us this information and never tell anybody that you've even been asked this question.
Speaker 2:And, um, well it turns out that there's no way to say, uh, no. Can we ask a court from this or anything like that? They said, the only way you can say to pushback, uh, we were advised by the electronic frontier foundation and the was you had to sue the United States government. So we sued the United States government with, with their help [00:12:00] and um, uh, and we won. The FBI backed off immediately. They didn't really need that information. Um, and so we are, uh, so they backed off. Um, and one of the things that was to our advantage was that we were a library. Oh. So because of the state library in it. Yes. There that had verified for this particular use that we were library. But, but there are no real laws saying what a library is. Pretty much you can tell [00:12:30] when you see him.
Speaker 1:You're listening to k a LX, Berkeley 90.7 FM university and listener supported radio. This is method to the madness and show from the public affairs department that explores the innovative spirit of the bay area. I'm your host. Tallinn is our, thanks for joining me. And today we've been speaking with Brewster Kale, Internet legend and founder of the Internet archive about open access information and his project [00:13:00] to catalog the output of humanity back to our interview. Yeah. So I mean, just the, the scope of the operation in terms of bandwidth and storage. Um, could you ever dreamed when you, when you envision this in 1980 that these types of um,
Speaker 2:oh yeah, it's all very predictable. We're, we're pretty much on path. I mean, it was [00:13:30] these discussions, um, back in like 1883 with Richard Feinman, a physicist and, uh, with, with Stephen Wolfram who, who's gone on to make Mathematica and, and things like that. Yeah. We did not the, the, the, the church and sort of when would we be able to have it be cheap enough to put all books online and when would movies and when would all these other things come online? Yeah, we're pretty much on, uh, on the path detection a little slower than we predicted. So actually I would've imagined we'd be here [00:14:00] by now. It's certainly is assumed. I mean, if I, I talked to yet, you know, younger people, they think, isn't it the library of Congress already online? And I was like, ah, you know, it's really not. And uh, Eh, the Internet's still fairly thin in terms of the information that's on it.
Speaker 2:If you really know some subject area, you can look around, there's something on everything, but there's not the depth. Um, so that's the key thing that we've got to do now is fill out the rest of, of what [00:14:30] the best we have to offer. How do we make it so that everything that we'd want is online. So we digitize, if you take the total goal and see books of 10 million books, the library, Congress, 28 million, 10 million book libraries and good solid library, that's the University of California system or Princeton or um, uh, Corey Yale. It's sort of a 10 million book collection. Um, we're at about 2.5 million, so we're a quarter [00:15:00] of the way there. How many per day? We're doing about a thousand to 1500 a day. How does it happen? Um, there are scanning centers in 33 libraries around in eight countries around the world that are operated by the Internet archive.
Speaker 2:And, uh, these are scanners that were designed and built by some burning man guys, uh, over in Berkeley. Um, and there are two digital cameras that take pictures of each page. We raised them over a glass to flatten the page, to get a good image. [00:15:30] Um, and basically you can digitize a book in about an hour, all told the cataloging and the whole Shebang. Q and a is searchable. And then it's then it's put to a computer and it munches on it for about 12 hours. It makes it then searchable. It does the optical character recognition. It makes it into PDFs and into the talking books for the blind, um, on and on the all these different formats. And it makes it as available as possible and copies it to another, uh, storage computer in a backup computer in two different locations. [00:16:00] So in case things go down or things disappear.
Speaker 2:Um, so the idea is to, to try to give a permanent access to this book and it's now in its digital form. The physical book is not damaged, so we don't break the books. Um, we're kind of obsessive about books. We love books. So, uh, and for the books that don't go back on the library shelves, we actually go and store and, um, have done high density storage in Richmond, California. So we have a [00:16:30] warehouse that now has 600,000 books and it's growing at a couple thousand a day of books that are donated from all sorts of places. And we want one copy of every book ever published so that we can digitize it, um, and either put it back on the library shelf or put it back away. So every book ever published, I mean, that's not infinite, but it's a huge number. Like how do you know what the number is?
Speaker 2:Well, I very countered is 28 million. It's probably not that much bigger than that. So maybe, you know, what, 50 million, I'm talking [00:17:00] millions western history or everything. Everything. Yeah. Just go back to Sumerian tablets. I mean, it's, it's not, it's just from a computer perspective, it's not that big. And if you take the same movies, um, the, the number of movies that have been made for theatrical releases, their couple, 100,000 of them, and that's kind of it cause they're expensive to make. And, uh, actually about half of them are Indian. So, uh, so the idea of even doing the whole movies is [00:17:30] quite doable. Um, music, well during the disk era, two 78 long playing records and cds, few million. And that's kind of the number of published. There are gigs that people, you know, play in local bars. So a lot of them aren't recorded, but we have 100,000 concert recordings, uh, from about 5,000 bands.
Speaker 2:There was a tradition started by the grateful dead of doing tape trading. Um, so as that moved down [00:18:00] to the Internet, people started trading on the online and so we offered to, to play a host to these materials as long as nobody got upset if people wanted it to happen. And we get two or three bands a day, I'm saying yes, we're up for this. And the fans themselves go and put the materials on the Internet archive sites. So for not archive.org we've got everything the grateful dead has ever done, plus about 5,000 other bands that make something about a hundred thousand concert recordings. So that finding [00:18:30] those ways of working with the system such that we're not trying to interrupt a commerce, we're just trying to be aligned. Great. Just a digital one. Yeah, so there's, there's this sounds like there's a crowdsourcing element you got, you're uploading a lot of information. Oh, absolutely. Thousands of things a day get uploaded at the Internet archive and then they're different from what goes up on Youtube. I mean, if you [inaudible] it's sometimes not as easy to find, you know, whatever, but at least they're there for the long term.
Speaker 3:[00:19:00] Okay.
Speaker 1:You're listening to k a l ex perfectly in 90.7 FM university in listener supported radio and this is method to the madness, the show from the public affairs department that explores the innovative spirit of the day area. I'm your host. Tallinn is Ark. Thanks for joining me. And today we've been speaking with Brewster Kale, Internet legend and founder of the Internet archive back to our interview. So my listeners understand the context too. Yeah, there's [00:19:30] a really tragic story of Aaron Schwartz that just happened right now. And so there is this question of public domain information and what's open. Can you as a leader at the vanguard of this movement, can you just explain it a little bit about his story? [inaudible]
Speaker 2:what a tragedy. Aaron Short is, squirts a good friend and he worked here at the Internet Archive, was a, uh, was the guiding light. Um, he sort of entered the field when he was 14 years old and helped form creative Commons. And when we did the Internet [00:20:00] bookmobile making free books for people, he was involved in and playing a role peripheral at that, at that realm. But he was central towards this be creating of the creative comments, which is kind of 14, 15, 16 years old. Um, and he lived a very public life. He would just publish everything. You sort of lived on the net. He was, I learned what an open source life was like by watching him. Um, so he didn't really have [00:20:30] private journals. He kept it public. Um, and he strove to bring public access to the public domain. And you think that this is, of course, you know, if it's public domain, there'd be public access to it.
Speaker 2:And I was like, well, there's some people that aren't that interested in it and he ran up against them. So he made a court records available that were being sold by the government to try and make cost recovery. So he would, uh, made a system to try to make it, [00:21:00] um, such the court documents that were public domain went onto the Internet archive. And this was working with some folks at Princeton and Carl Malamud who lives up in Sebastopol, um, the Internet archive all working together on this. But he did it so fast because he was a good, good at writing script that, uh, the library that he was downloading them from, um, got noticed by the database provider, which I happened to be the government [00:21:30] and they called the FBI on them, called the FBI on somebody to go and, because they're reading the public domain too fast, but this is what happened.
Speaker 2:And then, uh, the FBI found that they didn't have anything that could, uh, Hassell this, um, guy with. So there wasn't an ongoing investigation. And then Aaron, uh, wanted to bring public access to the Google books that were done, um, that were in copyright, that were digitized from places like Berkeley [00:22:00] and others. And, uh, and so he went and freed those. And actually there's Google to their credit, didn't complain. Um, but the library, some of the libraries complained to us because Aaron went and put those books on the Internet archive again and we pointed back to Google to see where they came from. And, um, but they're public domain and so was basically just liberating the public domain. And when Aaron started downloading a lot of journal literature from a, from a digital library called j store, [00:22:30] um, a nonprofit, uh, j store got all upset and, uh, told MIT, which is where it was going, it was being downloaded.
Speaker 2:There's somebody that's downloading too many articles. And I, MIT went chase down, uh, Aaron and, uh, I think made the tragic mistake of calling the cops. And once the cops were involved and they escalated to the federal government and the federal government put into the secret service and [00:23:00] they made a federal case out of some young guy going and downloading too many old journal articles, um, and not even making them publicly available, maybe it would have made me window. But, um, what's the, what's the problem? And this went on for a couple of years and um, according to the family and his girlfriend made him so depressed and really dragged him down that had contributed to deciding to commit suicide last week. And, [00:23:30] uh, absolutely tragedy. So real starve our community and the federal government came down on somebody. I was trying to do something fundamentally good. And actually it's something that happens all day long every day. People are downloading masses of things from the Internet archive and other digital libraries all the time. And for some reason, um, they thought this kid should be stopped.
Speaker 1:And it's so counterintuitive and it's public domain information. That's what I think [00:24:00] as, as you know, people who are growing up on the Internet [inaudible] people at some of the show of students, they don't know anything besides having this wonderful tool at their disposal and find all the information I think could possibly ever want. But it seems with this story and where, you know, it highlights the fact that this isn't something we should take for granted. It's something that we actually actively protecting and fighting for.
Speaker 2:Yes, we should be actively protecting the Wikipedia is the Erin Schwartz is the, uh, uh, I'd say [00:24:30] the Internet Archive, the um, uh, Carl Malamud's public, um, public resource.org, um, that are people that are trying to build open access models. This, this bunch in the bay area. There's the Public Library of science, which is trying to, uh, get around the monopoly of, of some of these journal publishers that are, um, not allowing, um, new computer research data mining techniques to be applied. So [00:25:00] there's a real problems to what's going on out there. And there's a schism. There's a, there's a conflict and the Aaron Schwartz suicide, I think really highlighted that we're not out of the woods, but there are people that want to lock everything down and want cell phones that you can't go and play with you. You want to make it so that you can't go and install any software you want to on a computer, um, that you can't just read anything. Or if you do read anything that they'll know about it. And, and that this type of thing has got [00:25:30] to stop. It really doesn't lead to a world that we want to live in.
Speaker 1:Well, thank you so much for your time today. I really appreciate it. And you know, as someone who's created a organization that is really dedicated to trying to advance, you know, the acquisition of knowledge for the human race, I wanted to ask you, how do you create an organization that endorsed, like obviously when you're trying to do is create something that goes on
Speaker 2:forever? Um, yes. Well, archive.org and open library.org. [00:26:00] Well they go on forever. I hope so. But the, what happens to libraries is they're burned historically. That's just what happens. So library, Congress has already burned once. Library of Alexandria of course is famous for not being around anymore. So designed for it. So make copies. Um, so put other copies in other places. So we've already donated, um, early on about 10 years ago, a full copy of, of the web collection to the library of Alexandria in Egypt. [00:26:30] And there's a partial copy of our, uh, our collections in Amsterdam. So when there are five or six of these around the world and I think I can sleep cause the, what happens is they burn and they're burned by governments. Now it's not a political statement, it's just historically what happens. The new guys don't like the old stuff around theirs, sorry about it afterwards and they, you know, 50 or a hundred years from then they tend to want to have it back.
Speaker 2:Um, but often it's too late. But if we had other copies and [00:27:00] other places we could make this work and this takes real work, real, um, real money effort, um, could use all the help we could. Uh, any, any volunteers or any effort from the University of California community? Um, we're just over in San Francisco. We'd love to have visits. We'd love to have five ways to work with more people. Great. That's a great segue to my last question and how do I, if our listeners want to get involved in fighting this good fight, how do they get involved? Um, [00:27:30] please visit archive.org and open library.org. Um, take a look. Play around with it. Try uploading some things. Are you downloading some things? If you're, um, if you've got extra books we want, well, we'll preserve one of every different book that we can get Ahold of. We only have 600,000, so we probably don't have the books that you've got. Um, we could use volunteer effort. We could, uh, people do collections, technical people, all sorts of mechanisms of getting [00:28:00] involved in the Internet archive and the open access movement in general. Okay, great. Well, thanks so much for sharing. Thank you very much.
Speaker 1:You've been listening to method to the madness on k a l x Berkeley 90.7 FM. Thanks for joining us and thanks to Brewster Kale, as he's mentioned, you can learn more about his organization, archive.org. You'll learn more about us and method to the medis.org. Thanks for listening. Everybody. See in a couple of weeks
Speaker 4:[inaudible].
Hosted on Acast. See acast.com/privacy for more information.