Cathryn Carson & Fernando Perez, Part 1 of 2
Cathryn Carson is an Assoc Prof of History, and the Ops Lead of the Social Sciences D- Lab at UC Berkeley. Fernando Perez is a research scientist at the Henry H. Wheeler Jr. Brain Imaging Center at U.C. Berkeley. Berkeley Institute for Data Science.
Speaker 1: Spectrum's next.
Speaker 2: Okay. [inaudible] [inaudible].
Speaker 1: Welcome to spectrum the science [00:00:30] and technology show on k a l x Berkeley, a biweekly 30 minute program bringing you interviews featuring bay area scientists and technologists as well as a calendar of local events and news.
Speaker 3: Hi, good afternoon. My name is Brad Swift. I'm the host of today's show this week on spectrum we present part one of our two part series on big data at cal. The Berkeley Institute for Data Science or bids is only [00:01:00] four months old. Two people involved with shaping the institute are Catherine Carson and Fernando Perez and they are our guests. Catherine Carson is an associate professor of history and associate dean of social sciences and the operational lead of the social sciences data lab at UC Berkeley. Fernando Perez is a research scientist at the Henry H. Wheeler Jr Brain imaging center at UC Berkeley. He created the ipython project while a graduate student in 2001 [00:01:30] and continues to lead the project here is part one, Catherine Carson and Fernando Perez. Welcome to spectrum. Thanks for having us and I wanted to get from both of you a little bit of a short summary about the work you're doing now that you just sort of your activity that predates your interest in data science.
Speaker 4: Data Science is kind of an Ale defined term I think and it's still an open question precisely what it is, but in a certain sense all of my research has been probably under the umbrella [00:02:00] of what we call today data science since the start. I did my phd in particle physics but it was computational in particle physics and I was doing data analysis in that case of models that were competitionally created. So I've sort of been doing this really since I was a graduate student. What has changed over time is the breadth of disciplines that are interested in these kinds of problems in these kinds of tools and that have these kinds of questions. In physics. This has been kind of a common way of working on writing for a long time. Sort of the deep intersection [00:02:30] between computational tools and large data sets, whether they were created by models or collected experimentally is something that has a long history in physics.
Speaker 4: How long the first computers were created to solve differential equations, to plot the trajectories of ballistic missiles. I was one of the very first tasks that's computers were created for so almost since the dawn of coats and so it's really only recently though that the size of the data sets has really jumped. Yes, the size has grown very, [00:03:00] very large in the last couple of decades, especially in the last decade, but I think it's important to not get too hung up on the issue of size because I think when we talk about data science, I like to define it rather in the context of data that is large for the traditional framework tools and conceptual kind of structure of a given discipline rather than it's raw absolute size because yes, in physics for example, we have some of the largest data sets in existence, things like what the LHC creates [00:03:30] for the Higgs Boson. Those data sets are just absolute, absurdly large, but in a given discipline, five megabytes of data might be a lot depending on what it is that you're trying to ask. And so I think it's more, it's much, much more important to think of data that has grown larger than a given discipline was used in manipulating and that therefore poses interesting challenges for that given domain rather than being completely focused on the raw size of the data.
Speaker 1: I approached this from an angle that's actually complimentary to Fernando in part because [00:04:00] my job as the interim director of the social sciences data laboratory is not to do data science but to provide the infrastructure, the setting for researchers across the social sciences here who are doing that for themselves. And exactly in the social sciences you see a nice exemplification of the challenge of larger sizes of data than were previously used and new kinds of data as well. So the social sciences are starting to pick up say on [00:04:30] sensor data that has been placed in environmental settings in order to monitor human behavior. And social scientists can then use that in order to design tests around it or to develop ways of interpreting it to answer research questions that are not necessarily anticipated by the folks who put the sensors in place or accessing data that comes out of human interactions online, which is created for entirely different purposes [00:05:00] but makes it possible for social scientists to understand things about human social networks.
Speaker 1: So the challenges of building capacity for disciplines to move into new scales of data sets and new kinds of data sets. So one of the ones that I've been seeing as I've been building up d lab and that we've jointly been seeing as we tried to help scope out what the task of the Berkeley Institute for data science is going to be. How about the emergence [00:05:30] of data science? Do you have a sense of the timeline when you started to take note of its feasibility for social sciences? Irrespective of physics, which has a longer history. One of the places that's been driving the conversations in social sciences, actually the funding regime in that the existing beautifully curated data sets that we have from the post World War Two period survey data, principally administrative data on top of that, [00:06:00] those are extremely expensive to produce and to curate and maintain.
Speaker 1: And as the social sciences in the last only five to 10 years have been weighing the portfolio of data sources that are supported by funding agencies. We've been forced to confront the fact that the maintenance of the post World War Two regime of surveying may not be feasible into the future and that we're going to have to be shifting to other kinds of data that are generated [00:06:30] for other purposes and repurposing and reusing it, finding new ways to, to cut it and slice it in order to answer new kinds of questions that weren't also accessible to the old surveys. So one way to approach it is through the infrastructure that's needed to generate the data that we're looking at. Another way is simply to look at the infrastructure on campus. One of the launching impetuses for the social sciences data laboratory was in fact the budget cuts of 2009 [00:07:00] here on campus. When we acknowledged that if we were going to support cutting edge methodologically innovative social science on this campus, that we were going to need to find ways to repurpose existing assets and redirect them towards whatever this new frontier in social science is going to be.
Speaker 5: You were listening to spectrum on k a l x Berkeley, Catherine Carson and Fernando Perez, our guests. [00:07:30] They are part of the Berkeley Institute for data science known as big [inaudible].
Speaker 4: Fernando, you sort of gave us a generalized definition of data science. Do you want to give it another go just in case you evoke something else? Sure. I want to leave that question slightly on answer because I feel that to some extent, one of the challenges we have as an intellectual effort that we're trying to tackle at the Brooklyn [00:08:00] instead for data science is precisely working on what this field is. Right. I don't want to presuppose that we have a final answer on this question, but at least we, we do know that we have some elements to frame the question and I think it's mostly about an intersection. It's about an intersection of things that were being done already on their own, but that were being done often in isolation. So it's the intersection of methodological work whereby that, I mean things like statistical theory, applied mathematics, computer science, [00:08:30] algorithm development, all of the computational and theoretical mathematical machinery that has been done traditionally, the questions arising from domain disciplines that may have models that may have data sets, that may have sensors that may have a telescope or that may have a gene sequencing array and where are they have their own theoretical models of their organisms or galaxies or whatever it is and where that data can be inscribed and the fact that tools need to be built.
Speaker 4: Does data doesn't get analyzed by blackboards? Those data gets analyzed by software, but this is software that is deeply woven [00:09:00] into the fabric of these other two spaces, right? It's software that has to be written with the knowledge of the questions and the discipline and the domain and also with the knowledge of the methodology, the theory. It's that intersection of this triad of things of concrete representation in computational machinery, abstract ideas and methodologies and domain questions that in many ways creates something new when the work has to be done simultaneously with enough depth and enough rigor on all [00:09:30] of these three directions and precisely that intersection is where now the bottleneck is proving to be because you can have the ideas, you can have the questions, you can have the data, you can have the the fear m's, but if you can't put it all together into working concrete tools that you can use efficiently and with a reasonably rapid turnaround, you will not be able to move forward. You will not be able to answer the questions you want to answer about your given discipline and so that embodiment of that intersection is I think where the challenge is opposed. Maybe there is something new called [00:10:00] data science. I'd actually like to suggest that
Speaker 1: the indefinable character of data science is actually not a negative because it's an intersection in a way that we're all still very much struggling. How to define it. I won't underplay that exactly in that it's an intersection. It points to the fact that it's not an intellectual thing that we're trying to get our heads around. It's a platform for activity for doing kinds of research that are either enabled or hindered by the [00:10:30] existing institutional and social structures that the research is getting done in, and so if you think of it less as a kind of concept or an intellectual construct and more of a space where people come together, either a physical space or a methodological sharing space, you realize that the indefinable ness is a way of inviting people in rather than drawing clear boundaries around it and saying, we know what this is. It is x and not
Speaker 4: why [00:11:00] Berkeley Institute for data science is that where it comes in this invitation, this collection of people and the intersection. That's sort of the goal of it.
Speaker 1: That's what we've been asked to build it as not as uh, an institute in the traditional sense of there are folks inside and outside, but in the sense of a meeting point and a crossing site for folks across campus. That's [00:11:30] something that's been put in front of us by the two foundations who have invested in a significant sum of money in us. That's the Gordon and Betty Moore Foundation and the Alfred p Sloan Foundation. And it's also become an inspiring vision for those of us who have been engaged in the process over the last year and a half of envisioning what it might be. It's an attempt to address the doing of data science as an intersectional area within a research university that has existing structures [00:12:00] and silos and boundaries within it.
Speaker 4: And to some extent you try to deconstruct the silos and leverage the work done by one group, share it with another, you know, the concrete mechanisms are things that we're still very much working on it and we will see how it unfolds. There's even a physical element that reflects this idea of being at a crossroads, which is that the university was willing to commit to [inaudible] the physical space of one room in the main doe library, which is not only physically [00:12:30] at the center of the university and that is very important because it does mean that it is quite literally at the crossroads. It is one central point where many of us walk by frequently, so it's a space that is inviting in that sense too to encounters, to stopping by to having easy collaboration rather than being in some far edge corner of the campus.
Speaker 4: But also intellectually the library is traditionally the store of the cultural and scientific memory of an institution. And so building this space in the library is a way of signaling [00:13:00] to our community that it is meant to be a point of encounter and how specifically those encounters will be embodied and what concrete mechanisms of sharing tools, sharing coach, showing data, having lecture series, having joint projects. We're in the process of imagining all of that and we're absolutely certain that we'll make some mistakes along the way, but that is very much the intent is to have something which is by design about as openly and as explicitly collaborative as we can make it and I think [00:13:30] in that sense we are picking up on many of the lessons that Catherine and her team at the d lab have already learned because the d lab has been in operation here in Barrows Hall for about a year and has already done many things in that direction and that at least I personally see them as things in the spirit of what bids is attempting to do at the scale of the entire institution. D Lab has been kind of blazing that trail already for the last year in the context of the social sciences and to the point where their impact has actually spread beyond the social sciences because so many of the things that they were doing or were [00:14:00] found to have very thirsty customers for the particular brand of lemonade that they were selling here at the lab. And their impact has already spread beyond the social sciences. But we hope to take a lot of these lessons and build them with a broader scope.
Speaker 1: And in the same way BYD sits at the center of other existing organizations, entities, programs on campus, which are also deeply engaged in data science. And some of them are research centers, others of them are the data science masters program in the School of information where [00:14:30] there is a strong and deliberate attempt to think through how in a intelligent way to train people for outside the university doing data science. So all of these centers of excellence on campus have the potential to get networked in, in a much more synergistic way with the existence of bids with is not encompassing by any means. All of the great work that's getting done in teaching research around data science on this campus
Speaker 6: [00:15:00] spectrum is a public affairs show on k a l x Berkeley. Our guests are Cathryn Carson and Fernando Perez. In the next segment they talk about challenges in Berkeley Institute for Data Science Phase
Speaker 2: [inaudible]
Speaker 3: and it seems that that eScience does happen best in teams and multidisciplinary [00:15:30] teams or is that not really the case?
Speaker 1: I think we've been working on that assumption in part because it seems too much to ask of any individual to do all the things at once. At the same time, we do have many specimens of individuals who cross the boundaries of the three areas that Fernando was sketching out as domain area expertise, hacking skills and methodological competence. [00:16:00] And it's interesting to think through the intersectional individuals as well. But that said, the default assumption I think is going to have to be that teamwork collaboration and actually all of the social engineering to make that possible is going to be necessary for data science to flourish. And again, that's one of the challenges of working in a research university setting where teamwork is sometimes prized and sometimes deprecated.
Speaker 4: That goes back to the incentive people building tools don't necessarily get much attention, [00:16:30] prestige from that. How do you defeat that on an institutional level within the institute or just the community? Ask us in five years if we had any success. That's one of the central challenges that we have and it's not only here at Berkeley, this is actually, there's kind of an ongoing worldwide conversation happening about this every few days. There's another article where this issue keeps being brought up again and again and it's raising in volume. The business of creating tools is becoming actually an increasing [00:17:00] part of the job of people doing science. And so for example, even young faculty who are on the tenure track are finding themselves kind of pushed against the wall because they're finding themselves writing a lot of tools and building a lot of software and having to do it collaboratively and having to engage others and picking up all of these skills and this being an important central part of their work.
Speaker 4: But they feel that if their tenure committee is only going to look at their publication record and [00:17:30] 80% of their actual time went into building these things, they are effectively being shortchanged for their effort. And this is a difficult conversation. What are we going to do about it? We have a bunch of ideas. We are going to try many things. I think it's a conversation that has to happen at many levels. Some agencies are beginning, the NSF recently changed the terms of its biosketch requirements for example. And now the section that used to be called relevant publications is called relevant publications and other research outcomes. And in parentheses they explained such as software [00:18:00] projects, et cetera. So this is beginning to change the community that cure rates. For example, large data sets. That's a community that has very similar concerns. It turns out that working on a rich and complex data set may be a Labor that requires years of intensive work and that'd be maybe for a full time endeavor for someone.
Speaker 4: And yet those people may end up actually getting little credit for it because maybe they weren't the ones who did use that data set to answer a specific question. But if they're left in the dust, no one will do that job. Right. And so [00:18:30] we need to acknowledge that these tasks are actually becoming a central part of the intellectual effort of research. And maybe one point that is worth mentioning in this context of incentives and careers is that we as the institution of academic science in a broad sense, are facing the challenge today that these career paths and these kinds of intersectional problems and data science are right now extremely highly valued by industry. [00:19:00] What we're seeing today with this problem is genuinely of a different scale and different enough to merit attention and consideration in its own right. Because what's happening is the people who have this intersection of skills and talents and competencies are extraordinarily well regarded by the industry right now, especially here in the bay area.
Speaker 4: I know the companies that are trying to hire and I know that people were going there and the good ones can effectively name their price if they can name their price to go into contexts that are not [00:19:30] boring. A lot of the problems that industry has right now with data are actually genuinely interesting problems and they often have datasets that we in academia actually have no access to because it turns out that these days the amount of data that is being generated by web activity, by Apps, by personal devices that create an upload data is actually spectacular. And some of those data sets are really rich and complex and material for interesting work. And Industry also has the resources, the computational resources, the backend, the engineering expertise [00:20:00] to do interesting work on those problems. And so we as an academic institution are facing the challenge that we are making it very difficult for these people to find a space at the university. Yet they are critical to the success of modern data driven research and discovery and yet across the street they are being courted by an industry that isn't just offering them money to do boring work. It's actually offering them respect, yes, compensation, but also respect and intellectual space and a community that values their work and that's something [00:20:30] that is genuinely an issue for us to consider.
Speaker 4: Is there a way to cross pollinate between the academic side and industry and work together on building a toolkit? Absolutely. We've had great success in that regard in the last decade with the space that I'm most embedded in, which is the space of open source scientific computing tools in python. We have a licensing model for most of the tools in our space that [00:21:00] is open source but allows for a very easy industry we use and what we find is that that has enabled a very healthy two way dialogue between industry and academia in this context. Yes, industry users, our tools, and they often use them in a proprietary context, but they use them for their own problems and for building their own domain specific products and whatever, but when they want to contribute to the base tool, the base layer if you will, it's much [00:21:30] easier for them.
Speaker 4: They simply make the improvements out in the open or they just donate resources. They donate money. Microsoft research last year made $100,000 donation to the python project, which was strictly a donation. This was not a grant to develop any specific feature. This was a blanket, hey, we use your tools and they help what we build and so we would like to support you and we've had a very productive relationship with them in the past, but it's by, not by no means the only one you're at Berkeley. The amp lab was two co-directors are actually part of the team [00:22:00] that is working on bids, a young story and Mike Franklin, the AMPLab has a very large set of tools for data analytics at scale that is now widely used at Twitter and Facebook and many other places. They have industry oriented conferences around their tools. Now they have an annual conference they run twice per year. Large bootcamps, large fractions of their attendees come from industry because industry is using all of these tools and the am Platt has currently more of its funding [00:22:30] comes from industry than it comes from sources like the NSF. And so I think there are, there are actually very, very clear and unambiguous examples of models where the open source work that is coming out of our research universities can have a highly productive and valuable dialogue with the industry.
Speaker 3: It seems like long term he would have a real uphill battle to create enough competent people with data trained to [00:23:00] quench both industry and academia so that there would be a, a calming of the flow out of academia.
Speaker 4: As we've said a couple of times in our discussions, this is a problem. Uh, it's a very, very challenging set of problems that we've signed up for it, but we feel that it's a problem worth failing on in the sense that we, we know the challenges is, is a steep one. But at the same time, the questions are important enough to be worth making the effort.
Speaker 6: [inaudible] [00:23:30] don't miss part two of this interview in two weeks and on the next edition of spectrum spectrum shows are archived on iTunes university. We've created a simple link for the link is tiny url.com/kalx specter. Now, if you're the science and technology events happen,
Speaker 3: I mean locally over the next two weeks, [00:24:00] enabling a sustainable energy infrastructure is the title of David Color's presentation. On Wednesday, April 9th David Color is the faculty director of [inaudible] for Energy and the chair of computer science at UC Berkeley. He was selected in scientific American top 50 researchers and technology review 10 technologies that will change the world. His research addresses networks of small embedded wireless devices, planetary scale Internet services, parallel computer architecture, [00:24:30] parallel programming languages, and high-performance communications. This event is free and will be held in Satara Dye Hall Beneteau Auditorium. Wednesday, April 9th at noon. Cal Day is April 12th 8:00 AM to 6:00 PM 357 events for details. Go to the website, cal day.berkeley.edu a lunar eclipse Monday April 14th at 11:00 PM [00:25:00] look through astronomical telescopes at the Lawrence Hall of science to observe the first total lunar eclipse for the bay area since 2011 this is for the night owls among us UC students, staff and faculty are admitted.
Speaker 3: Free. General admissions is $10 drought and deluge how applied hydro informatics are becoming standard operating data for all Californians is the title of Joshua Vere's presentation. On Wednesday, [00:25:30] April 16th Joshua veers joined the citrus leadership as the director at UC Merced said in August, 2013 prior to this, Dr Veers has been serving in a research capacity at UC Davis for 10 years since receiving his phd in ecology. This event is free and will be held in Soutar Dye Hall and Beneteau Auditorium Wednesday, April 16th at noon. A feature of spectrum is to present news stories we find interesting here are to. [00:26:00] This story relates to today's interview on big data. On Tuesday, April 1st a workshop titled Big Data Values and governance was held at UC Berkeley. The workshop was hosted by the White House Office of Science and Technology Policy, the UC Berkeley School of Information and the Berkeley Center for law and technology. The day long workshop examined policy and governance questions raised by the use of large and complex data sets and sophisticated analytics to [00:26:30] fuel decision making across all sectors of the economy, academia and government for panels.
Speaker 3: Each an hour and a half long framed the issues of values and governance. A webcast. This workshop will be available from the ice school webpage by today or early next week. That's ice school.berkeley.edu vast gene expression map yields neurological and environmental stress insights. Dan Kraits [00:27:00] writing for the Lawrence Berkeley Lab News Center reports a consortium of scientists led by Susan Cell Knicker of Berkeley's labs. Life Sciences Division has conducted the largest survey yet of how information and code it in an animal genome is processed in different organs, stages of development and environmental conditions. Their findings paint a new picture of how genes function in the nervous system and in response to environmental stress. The scientists [00:27:30] studied the fruit fly, an important model organism in genetics research in all organisms. The information encoded in genomes is transcribed into RNA molecules that are either translated into proteins or utilized to perform functions in the cell. The collection of RNA molecules expressed in a cell is known as its transcriptome, which can be thought of as the readout of the genome.
Speaker 3: While the genome is essentially [00:28:00] the same in every cell in our bodies, the transcriptome is different in each cell type and consistently changing cells in cardiac tissue are radically different from those in the gut or the brain. For example, Ben Brown of Berkeley Labs said, our study indicates that the total information output of an animal transcriptome is heavily weighted by the needs of the developing nervous system. The scientists also discovered a much broader [00:28:30] response to stress than previously recognized exposure to heavy metals like cadmium resulted in the activation of known stress response pathways that prevent damage to DNA and proteins. It also revealed several new genes of completely unknown function.
Speaker 7: You can [inaudible]. Hmm.
Speaker 3: The music or during the show [00:29:00] was [inaudible]
Speaker 5: produced by Alex Simon. Today's interview with [inaudible] Rao about the show. Please send them to us spectrum [00:29:30] dot firstname.lastname@example.org same time. [inaudible].