Cathryn Carson & Fernando Perez, Part 2 of 2
Cathryn Carson is an Assoc Prof of History, and the Ops Lead of the Social Sciences D- Lab at UC Berkeley. Fernando Perez is a research scientist at the Henry H. Wheeler Jr. Brain Imaging Center at U.C. Berkeley. Berkeley Institute for Data Science.
Speaker 1: Spectrum's next.
Speaker 2: Mm MM.
Speaker 3: Uh Huh [inaudible].
Speaker 4: [00:00:30] We'll come to spectrum the science and technology show on Katie l x Berkeley, a biweekly 30 minute program bringing you interviews featuring bay area scientists and technologists as well as a calendar of local events.
Speaker 3: [inaudible].
Speaker 1: Hello and good afternoon. My name is Renee Rao and I'll be hosting today's show this week [00:01:00] on spectrum present part two of our two part series on big data at cal. The Berkeley Institute for data science bids is only four months old. Two people involved with shaping the institute are Catherine Carson and Fernando Perez. They are today's guest Catherine Carson is an associate professor of history and associate dean of social sciences and the operational lead of the social sciences data lab at UC Berkeley for Nana Perez is a research scientist at the Henry H. Wheeler [00:01:30] Jr Brain imaging center at UC Berkeley. He created the iPod iPhone project while he was a graduate student in 2001 and continues to lead the project today. In part two they talk about teaching data science. Brad Swift conducts the interview
Speaker 5: on the teaching side of things. Does data science just fold into the domains in the fields and some faculty embrace it, others don't. How does the teaching of data science move [00:02:00] forward at an undergraduate level? Yeah, there there've been some really interesting institutional experiments in the last year or two here at Berkeley. Thinking about last semester, fall of 2013 stat one 57 which was reproducible collaborative data science pitched at statistics majors simply because you have to start with the size that can fit in a classroom [00:02:30] and training students in the practices of scientific collaboration around open source production of software tools or to look at what was Josh Bloom's course, so that's astro four 50 it's listed as special topics in astrophysics just because Josh happens to be a professor in the astronomy department and so you have to list it somewhere. The course is actually called Python for science
Speaker 6: [00:03:00] and it's a course that Josh has run for the last, I think this is, this was its fourth iteration and that course is a completely interdisciplinary course that it's open to students in any field. The examples really do not privilege and the homework sets do not privilege astronomy in any way and we see students. I liked her a fair bit in that course as a guest lecture and we see students from all departments participating. This last semester it was packed to the gills. We actually had problems because we couldn't find a room large enough to accommodate. So word of mouth is working. In terms of students finding these [00:03:30] courses,
Speaker 5: it's happening. I wouldn't say it's working in part because it's very difficult to get visibility across this campus landscape. I am sure there are innovations going on that even the pis and bids aren't aware of and one of the things we want to do is stimulate more innovation in places like the the professional schools. We'll be training students who need to be able to use these tools as well. What do they have in mind or there [00:04:00] are other formats of instruction beyond traditional semester courses. What would intensive training stretched out over a much shorter time look like? What gaps are there in the undergraduate or graduate curriculum that can effectively be filled in that way? The Python bootcamp is another example of this that's been going on for
Speaker 6: for about four years. Josh and I teach a a bootcamp on also python for data science that is immediately before the beginning of the fall semester. Literally the weekend before [00:04:30] and it's kind of, it's a prerequisite for the semester long course, but it's three days of intensive hands-on scientific bite on basically programming and data analysis and computing for three days. We typically try to get a large auditorium and we got 150 to 200 people. A combination of undergrads, Grad Students, postdocs, folks from LVL campus faculty and also a few folks from industry. We always leave, leave a few slots available for people from outside the university to come and that one a has been very popular at [00:05:00] tends to, it's intense to have very good attendance be, it serves as an on ramp for the course because we advertise the in the semester course during the bootcamp and that one has been fairly successful so far and I think it has worked well.
Speaker 6: We see issues with it too. That would be that we would like to address three days is probably not enough. Um, it means because it's a single environment, it means that we have to have examples that are a little bit above that can accommodate everyone, but it means they're not particularly interesting for any one group. It would be, I think it would be great to have [00:05:30] things of this nature that might be a little bit better focused at the life sciences and the social sciences that the physical sciences, so that the examples are more relevant for a given community that may be better targeted at the undergraduate and the graduate level so that you can kind of select a little bit in tune the requirements or the methodological base a little bit better to the audience. But so far we've had to kind of bootstrapping with what we have.
Speaker 6: There's another interesting course on campus offered by the ice school by Raymond Lecture at the high school called working with open data [00:06:00] that is very much aimed at folks who are the constituency of the high school that have an intersection of technical background with a broader interdisciplinary kind of skills that are the hallmark of the high school and they work with openly available data sets that are existing on the Internet to create basically interesting analysis projects out of them and that's of course that that I've seen come up with some very, very successful and compelling projects at the end of the semester
Speaker 7: about the teaching and preparation in universities. In [00:06:30] the course of doing interviews on spectrum, a number of people have said that really the only way to tackle sciences interdisciplinary, the big issues of science is with an interdisciplinary approach, but that that's not being taught in universities as the way to do science. Sarah way to break that down using data science as a vehicle.
Speaker 5: I can speak about that as a science and technology studies scholar. The practice of interdisciplinarity, what makes it actually work is one of the [00:07:00] the most challenging social questions that can be asked of contemporary science and adding into that the fact that scientists get trained inside this existing institution that we've inherited from let's roughly say the Middle Ages with a set of disciplines that have been in their current form since roughly the late 19th century. That is the interface where I expect in the next oh two to five decades major transformations in research universities. [00:07:30] We don't yet know what an institution or research institution will look like that does not take disciplines as it sort of zero order ground level approximation to the way to encapsulate truth. But we do see, and I think bids is like data science in general and an example of this. We do see continual pressure to open up the existing disciplines and figure out how to do connections across them. It's [00:08:00] not been particularly easy for Berkeley to do that in part because of the structure of academic planning at our institution and in part because we have such disciplinary strengths here, but I think the invitation for the future that that word keeps coming back invitation. The invitation for the future for us is to understand what we mean by practicing interdisciplinarity and then figure out how to hack the institution so that it learns how to do it better. [inaudible]
Speaker 8: [inaudible] [00:08:30] you're listening to structure fun. K A, l ex Berkeley Fasten Kirsten and Fernando Perez are our guests. They're part of the Berkeley Institute for Data Science for Bids [inaudible] Oh,
Speaker 6: it seems that data science has an almost unlimited [00:09:00] application. Are there, are you feeling limits? I don't know about limits specifically because I think in principle almost any discipline can have some of its information and whatever the concepts and constructs of that discipline can probably be represented in a way that is amicable to quantitative analysis of some sort. In that regard, probably almost any discipline can have a data science aspect to it. I think it's important not to sort of [00:09:30] over fetishize it so that we don't lose sight of the fact that there's other aspects of intellectual work in all disciplines that are still important. That theory still has a role. That model building still has a role that, uh, knowing what questions to ask, it's still important that hypotheses still matter. I'm not so sure that it's so much an issue of drawing arbitrary limits around it, but rather of being knowledgeable and critical users of the tools and the approaches that are offered.
Speaker 6: Because in terms of domain [00:10:00] applications, I actually recently saw at the strata conference, which is one of these more industry oriented big data conferences that took place a few weeks ago in Silicon Valley. It's in Santa Clara. One of the best talks that I saw at the conference was an analysis half the poem, if I told him that Gertrude Stein wrote about Picasso After Picasso painted this very famous portrait of her. And that poem has a very, very repetitive rhythmic structure. It has very few words and it's a long poem with a very peculiar linguistic structure. And [00:10:30] this hardest, I, I'm blanking on his name right now, but he's an artist who works kind of at the intersection of digital arts in, in linguistics wrote basically a custom one off visual analysis and visualization tool to work on the structure of this poem to visualize it, to turn it into music.
Speaker 6: And it was a beautiful talk. It was a beautiful and very interesting talk and this was kind of the exact opposite of this was tiny data. This was one poem and in fact during the Q and a they asked him and he said, well I've tried to use the tool [00:11:00] on a few other things and there's a few songs in hip hop that it works well with, but it's almost, it's almost custom made for this one poem, right? So this was sort of tiny data, completely non generalizable and yet I thought it was fascinating and beautiful talk. So that's kind of an example that I would have never have thought of as as data science. Any examples of misapplication?
Speaker 5: I think we can admit that data science is a buzzword that is [00:11:30] exactly through, it's almost indefinable nature creates space for people to do methodologically problematic and in many cases also uninteresting work. Just throwing data into an analysis without asking is this the right analysis will get you stupid or misleading answers. It's the garbage in out principle. So yeah, like any intellectual tool in the toolkit, [00:12:00] there are misleading conclusions that can be drawn and one of the powers that Berkeley brings to this effort in data science is a focus on the methodology, the intelligent development of methodology along with just building things that look like tools on their own. I think that's going to be the place with the sweet spot for universities because of the emphasis on rigor and stringency and reasoning [00:12:30] along with just getting out results that look good and are attractive
Speaker 7: with data science. Are there infrastructure challenges that are worth talking about either in industry or at an academic institution? Because I know that computing power now through Amazon, Google organizations like that are enormous and so industry is sort of giving up the idea of having their own [00:13:00] computational capacity and they're using cloud virtual universities I would think are following suit.
Speaker 6: Yes, there is work being done already on campus in that regard. We've had some intersection with those teams. The university right now, uh, we've had since last year a new CIO on campus, Larry Conrad, who's been spearheading an effort to sort of reimagine what the research computing infrastructure for campus should look like. [00:13:30] Considering these questions precisely of what is happening in industry, what are the models that are successfully being used at other institutions to provide larger scales off competitional resources across all disciplines and beyond the disciplines that have been traditionally the ones that have super computers. Well, there's a long history of departments, again, like physics, like competition, fluid dynamics, teams like quantum chemistry teams that have had either their own clusters or that have large budgets who have access to the supercomputing centers at [00:14:00] the doe labs and things of that nature. But as we've been saying today, all of a sudden those needs are exploding across all disciplines and the usage patterns are changing and that often what is the bottleneck is maybe not the amount of raw compute power, but the ability to operate over a very large data sets, so maybe storage is the issue or maybe throughput biologists often end up buying computers that look really weird.
Speaker 6: Too many supercomputing centers because they, the actual things that they need are skewed in a different way and so there are certainly [00:14:30] challenges in that regard when we do know that Berkeley is right now at least in the midst of making a very concerted and serious attempt at at least taking a step forward on this problem.
Speaker 7: A lot of data is derived from personal information. Are there privacy concerns that you have [inaudible]
Speaker 5: they're all quite definitely in so many different ways that the input of experts who have thought about questions of consent, of privacy, [00:15:00] of the challenges around keeping de identified data d identified when it is possible through analytics to understand what patterns are emerging from them that is going to be so key. Especially to working with social data. And so one of the still open questions for all of us working with data that is about people is how to develop the practices that will do the protections necessary [00:15:30] in order to avoid the kinds of catastrophic misuses and violations of privacy that many of us do. Fear will be coming our way as so much data becomes available so fast with so many invitations to just make use of it and worry about the consequences later. That's not the responsible way forward. And I would like to see bids and Berkeley take on that challenge as part of its very deliberate agenda.
Speaker 8: [00:16:00] Okay. Spectrum is a public affairs show on k a l ex Berkeley. Our guests are Cathryn Carson and Fernando Perez. In the next segment they talk about institutional reactions to bids. Oh,
Speaker 7: are there any impediments that you've run into within the bids process [00:16:30] of getting up and running? Cause it's been going since, uh,
Speaker 5: it's not been going on that long as it, it's only December of 2013. Pretty recent, but I'm sure there's gotta be some institutional pushback or no, it's, it's been incredible actually how much support the institution has given. What bids is though, is a laboratory for the kind of collaboration that we're trying to instantiate. And so you have 13 brilliant Co-pi eyes each with their own vision and figuring out where [00:17:00] the intersection is and how to get the different sets of expertise and investments where they, where those intersections lie and how to get them aligned. I mean, that's, that's one of the fascinating challenges in front of beds as a laboratory in the small, for the process at large that we're trying to do
Speaker 7: on the tools and programming side. How would you break up what languages are providing, what kind of capability, [00:17:30] and are there new languages that are ascendent and other languages that are languages that are losing their grip? I'm sort of curious. It's a, it's another trivia questions that I think might have some interest for people. No, I think there's, there's clearly an ascendance. I think naturally the expansion of the surface of people interested in these problems
Speaker 6: is naturally driving the growth and importance of high level languages that are immediately usable by domain scientists. We're not full time programmers [00:18:00] and professional programmers. Traditionally a lot of the high end computing had been done in languages like c, c plus plus for trend and some Java that are languages that tend to be more the purview of, of people who do lots of software development. And a lot of that did happen in departments like physics and chemistry and computer science, but not so much in other disciplines. And so we're seeing the rise of open source languages like Python and r that are immediately applicable and easy to use for data analysis where a few commands [00:18:30] can load a file, compute some statistics on it, produce a few visualizations, and you can do that in five lines of code, not having to write a hundred or 500 lines of c plus plus.
Speaker 6: Right. And so the languages like that are, they're not new. Both I think are came out in the late eighties early nineties python came out in 1991 but they're seeing a huge amount of growth in recent years for this reason. There's also a growth of either new tools to extend these languages [00:19:00] or new languages as well. Tools for example, that connect these languages to databases or extensions to these languages to couple them to databases in better ways so that people don't have to only write raw sequel, which SQL is not the classic language for interacting with databases, so extensions to couple existing languages to database back ends. A lot of work is being done in that direction and there are some novel languages. For example, there's a team at MIT that about two years ago started [00:19:30] a project for a new language called Julia that is aimed at numerical computing, but it's sort of re-imagining.
Speaker 6: What would you do if you wanted to create a language like python with the strengths of language like python or Ruby or r, but if you were doing that today with the lessons of the last 20 years, that would be good for numerical computing, but it would be easy to use for domain scientists. That would be high level, that would be interactive, that would feel like a scripting tool, but that would also give you very high performance. [00:20:00] If you had the the last 20 years of lessons and the advances in some of the underlying technology and improved compiler machinery that we have today, how would you go about that problem? And I think the Giulia team at MIT is making rapid progress and it has caught the intention of people in the statistics community of people in the numerical analysis and algorithms community. Some prominent people have become very interested in how to become active participants in its development.
Speaker 6: So we're seeing both mature tools like python and are growing in their strength and and their importance. At the latest Strada Conference, [00:20:30] for example, there was a an analysis of kind of the the abstracts submitted that had r and python in their names versus things like excel or sequel or Java and Python and are clearly dominating that space, but also these, these kinds of more novels, sort of research level languages that whose futures still not clear because they're very, very young, but at least they're exploring sort of the frontier of what will we do in the next five or 10 years. And is this an area that's ripe for a commercial software creators who develop [00:21:00] a tool that would be specific to data science and sort of the same way that Mat lab is kind of specific now it's kind of a generic tool for mathematics. Obviously my answer here is extremely biased, but I'm, I sort of think that the space for a, the window to create a proprietary data science language is closed already.
Speaker 6: I think the community simply would not adopt a new one. There are some existing successful ones such as mat lab, IDL, which is smaller than Madlib. It is widely used in the astronomy and astrophysics. [00:21:30] And Physics Communities Mathematica, which is a project that came out of the mathematics and physics world and that is very, very sophisticated and interesting. Maple, which is also a mathematics language. Those are successful existing proprietary languages. I think the mood has changed to these are products that came out in the eighties and the nineties. I think the, the window for that, uh, as a purely proprietary offer has closed. I think what we're going to see is the continued growth and the rise potential. You have new entrants that are fundamentally [00:22:00] open source, but yet that maintain, as I said earlier, a healthy dialogue with industry because it doesn't mean, for example, in the art world there are companies that build very successful commercial products around are there is a product called r studio that is a development environment for analysis in our, and that's a company, there's a company called I think revolution analytics.
Speaker 6: I think they built some sort of sort of large scale backend high-performance version of our, I don't know the details, I don't use it, but I've seen their website. I think they're a large company that builds kind of our for the enterprise. So I think [00:22:30] that's what we're going to see moving forward at the base. People want the base technology, the base language to be open source. And I think for us as universities and for me as a scientist, I think that's a Tenet I'm not willing to compromise on because I do not want a result that I obtain or result that I published or a tool that I educate my students with to have a black box that I'm legally prevented from opening and to tell my student, well, this is a result about nature, but you can't understand how it was achieved because you are legally prevented from opening the box. [00:23:00] I think that is fundamentally unacceptable. But what is, I think a perfectly sensible way forward, is to have these base layers that are open on top of which domain specific tools can be created by industry that add value for specific problems, for specific domains that may be add performance, whatever. Catherine Carson and Fernando Perez. Thanks very much for coming on spectrum. Thanks for having us here. Thanks much.
Speaker 8: [inaudible]
Speaker 9: [00:23:30] all spectrums. Past shows are archived on iTunes university. We've created a simple link for you. The link is tiny url.com/k
Speaker 1: a l x
Speaker 8: spectrum
Speaker 1: Rick Curtis Skin. I will present a few of the science and technology events [00:24:00] happening locally over the next two weeks.
Speaker 10: Counter culture, labs and pseudo room present gravitational waves, results and implications with Bicep to collaborator Jamie Tolan at the pseudo room, hackerspace to one 41 Broadway in Oakland on Sunday, April 27th at 7:00 PM recently, scientists from the Bicep to experiment recorded their data findings demonstrating [00:24:30] evidence of gravitational waves that may imply cosmic inflation. The bicep to experiment is an international collaboration of research and technology from many institutions including a team at Stanford University work. Jamie Tolan works. Jamie will discuss the results of the bicep two experiment and its scientific contribution to current theories that attempt to explain the why, what and how of our universe. The event will be free.
Speaker 1: On April 30th UCLA professor [00:25:00] of geography, Jared diamond will give this year's Horace m Albright Lecture in conversation. Diamond is best known for his Pulitzer Prize winning book, guns, germs and steel and this lecture he will discuss his newest book, the world until yesterday, what we can learn from traditional societies. The book is about how traditional peoples differ from members of modern industrial societies and their reactions to danger. He will then produce B in a question answer session with the audience doors open at 6:00 PM [00:25:30] the event is free and open to the public on a first come first served basis will be held Wednesday, April 30th from seven to 8:30 PM in the International House Auditorium at two two nine nine Piedmont Avenue Berkeley.
Speaker 10: The theme of Mays science at the theater is science remix. Joined Berkeley lab scientists at the East Bay Center for the Performing Arts in Richmond, California on May 1st at 7:00 PM they'll discuss how discovery [00:26:00] happens. Help you show what science means to you and reveal why science can be as personal as you want it to be. Light refreshments will be served, but bring your imagination and participate at this free event.
Speaker 1: A feature spectrum is to present new stories about science that we find particularly interesting. Rick Carnesi joins me in presenting the news.
Speaker 10: Nature News reported on April 13th that a team of scientists from [00:26:30] Caltech have estimated that Mars's atmosphere was probably never thick enough to keep temperatures on the planet surface above freezing for very long. Edwin kite now at Princeton used from the Mars reconnaissance orbiter to catalog more than 300 craters and an 84,000 square kilometer area near the planets equator. The sizes of the creators were compared to computer models with varying atmospheres. Dance [00:27:00] or atmospheres would have broken up small objects as they do on earth, but the high frequency of smaller craters on Mars suggest the upper limit of atmospheric pressure on Mars was only one or two bar. This most likely means a temperatures on Mars have typically been below freezing. Did the team notes that their findings do allow the possibility of scenarios of Mars having a slightly thicker atmosphere at times. Do you perhaps to volcanic activity or gas is released by the large impact events and these could have [00:27:30] made Mars warmer for decades or centuries at a time, allowing water to flow. Then
Speaker 1: science daily reports one of the first social science experiments to rest on. Big Data has been published in plus one. A chair of investigators from Simon Fraser University analyzed when humans start to experience and age-related decline in cognitive motor skills. The researchers analyze the digital performances of over 3000 starcraft two players, age 16 to 44 starcraft two is a ruthless intergalactic computer [00:28:00] game that players often undertake to win serious money. Their performance records, which can be easily accessed, represent thousands of hours worth of strategic real time. Cognitive based moves performed at various skill levels using complex statistical modeling. Researchers distilled meaning from this colossal compilation of information about how players responded to their opponents and more importantly, how long they took to react after around 24 years of age, players show slowing and a measure of cognitive speed that is known to be important for performance. [00:28:30] Explains Joe Thompson lead author of the study. This cognitive performance decline is present even at higher levels of skill, but there's a silver lining in this earlier than expected slippery slope into old age. Thompson says older players, those slower seem to compensate by employing simpler strategies and using the games interface more efficiently. The younger players enabling them to retain their skill despite cognitive motor speed losses. These findings says Thompson suggests that our cognitive motor capabilities are not stable across our adulthood, but are constantly [00:29:00] in flux and that our day to day performance is a result of the constant interplay between change and adaptation.
Speaker 2: [inaudible]
Speaker 11: and music heard during this show was written and produced by Alex Simon. Today's interview was edited by Rene Rau. Thank you for listening to spectrum. If you have comments about the show, please send them to us via email or email [00:29:30] address is spectrum dot firstname.lastname@example.org join us in two weeks at this same tone. [inaudible].