Dr. Alex Shalek

Sharing data can accelerate discovery, promote collaboration, and improve reproducibility, but it also raises complex questions around consent, privacy, equity, and career development. For life scientists, deciding when and how to share data is no longer just a technical or logistical issue; it's an ethical and strategic decision that can affect everything from grant applications to global research partnerships.

Dr. Alex Shalek and members of his lab strive to engineer and create technologies and methods achievable by any lab, anywhere in the world through open source experimental protocols and computational packages. In this episode, he speaks candidly about the promises and responsibilities of open data, and why simply “sharing everything” isn’t always the right answer. Drawing on his experiences with global research collaborations, Dr. Shalek offers a grounded perspective on what it means to share data responsibly and shares his thoughts on how the open science movement must evolve to support both scientific progress and the people behind it. Whether you're an academic, industry scientist, or early-career researcher, this episode invites you to rethink what “open” should really mean in modern science.

Podcast published May 2024.
The following interview has been edited for clarity and brevity. The views expressed in this interview are those of the individuals involved and do not necessarily reflect the views of STEMCELL Technologies.

The Promise and Challenges of Open Data Industry vs. Academia: Different Perspectives on Open Data The Complex Ethics and Economics of Open Data Sharing Why Full Transparency Is So Hard: Publishing Limits and Regulatory Setting Standards for Open Data Open Data, Power Dynamics, and Protecting Trainees Balancing Data Benefit Across Communities Bridging the Gap: From Bench Skills to Leadership Skills Privacy and Consent in Open Data Final Words of Advice

The Promise and Challenges of Open Data

Nicole Quinn (NQ): Before we start, I will note that we're specifically not talking about open-access publishing or sharing final polished stories. We're talking about sharing raw data, raw methods, and doing that on openly accessible platforms.

I was fascinated, intrigued, and humbled by Alex Shalek’s insights and his eloquent way of framing some of the opportunities and challenges surrounding open data at the [2023] International Union of Immunological Societies (IUIS) meeting in Cape Town, South Africa, where we discussed open science.

Can you speak to your specific experiences with openly sharing data and how you're currently working within this open data sharing world?

Alex Shalek (AS): One of the things that's so exciting about genomics is the promise that it enables you to explore many different hypotheses. More than thinking about the things that you can do in your own experiment, it's the idea that as you bring together datasets across multiple different experiments, there will be opportunities for reanalysis that will enable emerging insights. It's one of those things where we always think that as we get more and more data, there's going to be real opportunities to take advantage of that to push the boundaries of what we know, to learn new things about how cells work, about how our tissues work, about what drives disease, and incredible opportunities for technology development, whether it's understanding what we need experimentally or building new computational approaches. I'd say that the recognition of the importance of data also underscores some of the complexities associated with it because many people are interested in having access to data, and many people have differences in levels of access to the data. In many places, when we originally envisioned the studies and got consents from them, particularly if we're thinking about human data, some of the use cases weren't always envisioned in the way in which we might want to use them now.

It opens up lots of questions about how to best do open access data and how to do the kinds of things that we all can see as being potentially incredibly powerful, but at the same time recognizing that, fundamentally, we owe respect to the individuals who partnered with us in these studies by sharing samples and sharing material at a time when things may have been very precarious or very difficult for them personally. So there's this tension between the promise and the responsibilities that we have as investigators, but the responsibility that we also have transitively as people that are getting involved in reanalyzing those samples and making sure that we do as well as we can by those individuals in their wishes.

Industry vs. Academia: Different Perspectives on Open Data

What are the different thoughts and considerations around open data in industry?

NQ: Jason, where are you sitting? You're coming at this from an industry perspective, whereas Alex is working as an academic.

Jason Goldsmith (JG): So I'll say the mercenary answer is that industry loves it if we don't have to provide it. We see this in academic papers all the time. “Hey, I have this new interesting finding, and I went to a dataset of human samples and saw it there.” That, I think, is the gold standard example Alex is getting at of new discoveries from old data. You don't have to go run a new trial. If the consent's all clean, there's a lot there. But by the same token, when private industry becomes involved, there are other concerns. Private industry spends millions of dollars of investor income or public stockholder income, i.e., people's money, to generate these datasets. So they have an interest in seeing them get the benefits of that. They usually share the data at some point after the trial is out and there's publishing—but it could have a decade-plus embargo on it because they want to mine all that information and be the beneficiaries of it. They actually have a responsibility [to do that] because of how the investment structure works.

So it creates this interesting dichotomy where industry broadly loves public data that's available—taxpayer-paid data—and that data is available for everyone to use. Companies are taxpayers too. But then they have an interest in not sharing their private datasets. I think COVID is an example to the contrary. But even then, the federal government backstopped the financial gain mechanism. They said, "Hey, if you can get a vaccine out or two vaccines or three, we'll buy them all. You just get them approved." And so everyone said, "We'll share data with each other to get these all approved." So I think that's part of the complexity from an industry perspective. If you put in a patent or you submit a paper, you put in a copyright [application], anything like that, it's now freely available. Whereas if you keep it secret, it's a trade secret and not available, then that has a strategic advantage to companies.

What are your perspectives from academia, Alex?

AS: I would say that we're looking at it very similarly. It's just we have different groups of people to whom we're responsible. Some of the things that you do in terms of setting up your cohorts, thinking about consents, working through it, there are different things that you're committing to those communities as part of that process. Academics often don't have the same ability to translate insights into products that can impact people in the way in which you want. As much as I'd love to be the person that develops a drug, brings it to market, and delivers it to an individual to really alleviate suffering, in many places, that's not what the academic model is made to do. It's made to find those targets, to find maybe a lead compound, and then to move it out to industry. I'd say that we have different responsibilities to people and, therefore, we can think about data in different ways.

I'd say that we [academia and industry] have different responsibilities to people and, therefore, we can think about data in different ways.

Dr. Alex Shalek

But I would say that there's also an importance in having open data. This came up in the meeting that we had at IUIS, which is that in many places, when we think about doing a study or repeating a study or setting it up again, A) it's not a very efficient use of money and B) it's not a very efficient use of time. There are tons of hidden variables in the way each of us thinks about structuring our studies and building stuff out where there can be biases or confounders of which we're not totally aware.It shows up in a lot of places where people think about the gut microbiome in different mouse colonies leading to different results. But I think that sometimes it's nice to have completely different studies structured by different people with different datasets, where you can see glimpses of hypotheses. Or the ideas that give you that sanity check and that tell you you're on the right path and you can move forward. So I think that there's an independent value to open data, which is that it gives us the opportunity to “sniff test” or to really fact-check some of the things that come out of what we're doing.

The Complex Ethics and Economics of Open Data Sharing

What makes open data sharing so ethically and economically complex?

JG: I think the difficulty is in aligning those incentives properly. So this absolutely happens where in industry groups, two big companies are working on the same pathway and some of their people decide they should share data with each other and enter into an agreement or a CDA [confidential disclosure agreement] and then share that because they see benefit in all those things you just said. But a little company, like the one I work for for example, may not want to share that because if they share it with the big company, the big company is going to outspend them and essentially jump eight steps. What the little company wants to do is sell them that data, so to speak. So that's where it gets very tough because all the benefits are there. But if the company spent $10 million of investor money to create the data, they want to see that $10 million back. Otherwise, it doesn't exist anymore.

AS: I’d want to see more than $10 million back. As the son of an IP [intellectual property] attorney, I will sit here and say I totally understand. The biggest question is, what do you cover versus what is trade secret? What do you keep internal? And where do you really get the most bang for your buck out of all the things that you put in? Because that's critical. And we [academia and industry] get to speak from different positions and approach things in different ways. I think we both see the utility in open data. I think that we just have to approach our engagement with it in different ways. So to your point, I might think about being a more proactive generator of open data and open sharing of data. But the thing that you're highlighting is your responsibility to many of the people in your company or fund companies. I would say that when I generate open data, I also have to think about who I partner with because those individuals might have different desires and wishes. [For example,] when I partner with individuals that are in the global south and that have less access to some of the computational infrastructure or don't have the same resources and know how to move quickly.

Is immediate open access always fair, or should we consider equity in timing and access to data?

AS: Or do we provide protected periods over which they can engage with the data? How do we make sure the researchers working on that actually generate the greatest return on that? And how do we make sure that that return accrues back to the community? As you think through these things, open data is great. But at the same time, we have to think about who owns that data, and that's the people involved in providing the samples that generate the data, but also the individuals involved in generating that data. What are the responsible things to do with it? We could take this approach that says, well, if we took the samples and brought them to Boston, where I'm sitting right now, and ran them and generated data, it would accrue data benefit to the communities that donated the samples or that collaborated with us as partners by contributing samples. But that might be a little disingenuous if we're not really engaging those communities and thinking about researchers in those communities and thinking about what local desires and wants are.

It's not fair for me to assume that every single person can be as open with their data and share it as freely, even when we're all working towards something that we agree is a major goal, like figuring out how to fight COVID-19 or tuberculosis or HIV.

Dr. Alex Shalek

So I completely understand what you're saying. We just have a different group of stakeholders. The incentives to open science aren't the same among all those groups. As much as I appreciate the strong push to share, and particularly from government organizations that I'm affiliated with, and as much as I believe in sharing, I recognize that I sit in a position of privilege and that I'm able to do that. Being careful sometimes to not put some of my students or my trainees at risk because they have careers and they have things that they need. But I can do so much more easily and much more freely than many people around the world. It's not fair for me to assume that every single person can be as open with their data and share it as freely, even when we're all working towards something that we agree is a major goal, like figuring out how to fight COVID-19, or tuberculosis, or HIV. So what's really important in understanding open data is understanding who's involved in the data in terms of the donation, of the generation of the data, the analysis of the data, and the responsibilities that go with that entire process.

I'd even say it's different. It's not just academia or industry. It involves really thinking comprehensively about who's implicated in what their desires and wishes are. Whether this is a responsible way to engage.

What’s the difference between reproducibility and scientific responsibility, and why does that distinction matter?

AS: We have to separate out the reproducibility piece from the responsibility piece because I see them as very different things. I can think about how to create reproducibility and what's required to create reproducibility and how open data can contribute to that. I think about that as being different from some of the other pieces that we're talking about. In many places, issues with reproducibility come down to the fact that we very often don't methodically or systematically characterize a lot of the things that go into experiments. We don't necessarily pay attention to them. They're the hidden variables. So they're things that we just take for granted. But we don't realize that other investigators don't take the same things for granted or some of the things that we think of as being unimportant turn out to be important when you look in different contexts. I think that a lot of figuring out the reproducibility piece is finding a better language, a better way of communicating. One of the big things that people talk a lot about when they talk about open data is this idea of creating better standards, better metadata, better references, better ways of annotating things to enable greater consistency and sharing of information.

Why Full Transparency Is So Hard: Publishing Limits and Regulatory Lessons

Are publishing constraints getting in the way of transparent and reproducible science?

AS: I think that one of the things that I really dislike about publishing is how you have to really pack all of your methods and approaches into the smallest place possible with the fewest words possible because you're trying to get [your research] into a print format. I love that there are some journals that try to focus on methodological approaches where you can talk about some of these pieces more freely. But I think we need to do a better job of describing what we've done, how we've done it, why we've done it, and what the assumptions were behind it so that people can have those conversations. If you're not creating a format where people can easily look at that and add it, particularly as they're amalgamating the information from lots of different studies, there's a tremendous opportunity to find that things aren't reproducible for reasons that are totally explainable. So you could say, "Oh, this isn't reproducible because these people did this in serum-free media whereas these people did this in media with serum." And you would say, "No, it's entirely biologically predictable. The results actually reproduce biological features. They just don't reproduce the same results because they were done in different ways, in ways that make sense."

How can open data standards improve the reproducibility and sharing of scientific methods?

AS: As for reproducibility, I think open data can be important in creating standardization of descriptions of experimental approaches, of metadata, of ways of people going back and forth, of normalizing the idea of having conversations. Obviously, there's this point around trade secrets that Jason brought up, which comes up in some contexts even within academic labs, but you really need greater exchange.

As for reproducibility, I think open data can be important in creating standardization of descriptions of experimental approaches, of metadata, of ways of people going back and forth, of normalizing the idea of having conversations.

Dr. Alex Shalek

JG: I think reproducibility is the area that industry cares the most about in a couple of ways. I think one example that we all know about is clinical testing. If I get a blood count and it's at one hospital or another hospital or even in another country, often that’s on medical devices that go through really rigorous reproducibility [testing]. Obviously, every experiment by a scientist can't be that way. But regulators set pretty strict guidelines about how you have to describe your assay and the level of detail of it, whether that's for one of those clinical devices that becomes a medical device or to measure how much acetaminophen (paracetamol) is in your pill. That is a rigorously reproducible measurement. Industry will take that same technology for ibuprofen and for [the statin drug] Lipitor and anything else.

Maybe that’s a lesson learned from regulators on how to describe assays precisely and what standards are needed for robust reproducibility or at least the descriptions therein. You're not going to run 100 samples in academia, but you can describe it, better than we [industry] do. Is it serum-free media or is it with serum? Some things aren’t written down like, "Oh, you need to angle the thing at 45 degrees." But in pharma, that's all really regulated, from the temperature of the pipette and the angle of it and what brand. And that's how they get around that [reproducibility issue] because the FDA wants it to work every single time you measure the acetaminophen in your pill. There could be some lessons from there actually that could help academia and publication in general in terms of conveying that information, even if you don't have to do it to the same level.

How can we satisfy both scientific and regulatory standards?

AS: We're talking about things that could be similar or could be different. The idea of mass, it's consistent. We have a scale that we use to measure mass. We can be very reproducible in how we approach it. Temperature is something where we have a scale. I was teaching in the class earlier today with temperature scales. It's something where we can be very consistent. I think the problem is that when you think about biology and the kinds of big data we're talking about in genomics, you have lots of things that you cannot control. We engage with different groups of individuals as donors and partners based upon what study we're doing. Those people aren't identical. They don't have identical experiences. They don't have things that would enable you to reestablish the same system. Even if you think about it with mice or pick your favorite model system, it's hard to create something that is exactly identical in every instance so that you can get the same output over and over again in exactly the same way. So what becomes important is to understand what drives variability and to really use that to think about whether or not what you're seeing is consistent.

So I think what you really want to say when you say reproducible is, is the biological mechanism consistent? As opposed to, did we get exactly the same results? The entire way I got into single-cell genomics was because we took cells that were supposed to be identical. We hit them with exactly the same stimulus, and they all responded differently. We tried to take a system that was supposed to be as identical as was humanly possible, [using] postmitotic cells synchronized in a way, hit with a bacterial Armageddon, and they responded differently. Now, those differences in response gave us something to correlate against, which helped us figure out mechanisms that we could then go back and validate. But I think the point is that biology has some variability in it. As a physicist, I'm not going to sit here and tell you it's the quantumness of biology, but there is some variability and we want to understand whether or not that variability is a biological feature or a technical feature. I would separate out the reproducibility piece into technical reproducibility and biological reproducibility and 100% agree with you on the technical piece.

What’s stopping scientists from being more meticulous in documenting key variables in methods sections?

JG: How often in the methods do you see [details about] when samples were collected from donors? How many hours until it was frozen? They'll have some range. Samples were frozen within 85 hours. And you have to ask, "Well, what does that mean? Is it most of them within 24, or 70?"

AS: It's so important. I know from some of the stuff where we were involved in rapid autopsy work during the COVID-19 pandemic. The amount of time from sample collection to processing, or from when the individual unfortunately passed away from COVID to when the sample was collected, the temperatures that it was kept at, and all these things played a critical role in driving things. They're understandable. You can see why they do specific things and why you get specific results, but only in light of having that information. In many places, we aren't careful enough in writing it down. I think we have to create standards around that being an expectation because it leads to much better interpretability. I want us to think about reproducibility versus interpretability. Consistent interpretation, I think, is what we want to go for, particularly as we move to biology [biological reproducibility], as opposed to technical reproducibility, which is a must in order to get to a place where we have biological consistency.

Consistent interpretation, I think, is what we want to go for, particularly as we move to [biological reproducibility], as opposed to technical reproducibility, which is a must in order to get to a place where we have biological consistency.

Dr. Alex Shalek

JG: That's where I think regulators have done a little bit. They’ve said, "Hey, if you're going to submit this, you have to tell us the time you stored it at and the average and the mean." Because of the stakes on, for example, drug potency and manufacturing of sterile things, they have created a playbook of important variables. I think it's not going to translate 100%, but you could learn from it to an extent with what regulators have demanded of pharmaceutical assays just in terms of a playbook of “write it down, please.”

Setting Standards for Open Data

Who in academia is overseeing or running the open databases, and how are they being standardized and communicated?

AS: There is an incentive in industry for repeatability. You want to do no harm to the person who's going to take that acetaminophen or the aspirin. There are lots of repercussions to that not happening. So not only is it the right thing to do, but you wouldn't want to have something that was adverse for all the reasons in the universe. When you think about academia, there's a lot of incentive to do it within your own lab and to produce results that are very consistent and to think about how to do things in a way that provides validation. We like to use multiple different methods, multiple different approaches, multiple different systems; as many pieces of information as we possibly can get. But now, as you think about open data, where you're pulling together resources across labs, that's where I think it is really hard because the question is, who oversees that? Who convinces you that you should be adhering to specific standards? Or gets you to intellectually buy into the value of doing that? Particularly as when funding decreases or stays the same and inflation goes up, it becomes really hard to think about doing extra work and putting extra onus on people.

That’s the rub. A lot of it comes down to people who are big proponents and advocates of the possibilities of big data or people who want to do work that requires big data to push these things forward. It's a lot of convincing and getting people to the table and telling them they have to do this. It's hard to say this is what will incentivize people to adopt a common standard versus adopt best practices because I think everybody wants best practices. They want to do good science. They want to make sure that what they're putting out into the universe is high quality and it's not going to come back to them that something was wrong or they need to retract a paper because nobody ever wants that.

Open Data and Protecting Trainees

How can we balance trainee, PI, and community interests?

AS: It's critical. I think that this idea of sharing and when to share is really important because, as you say, I sit in a position of privilege. I'm a tenured professor. Every single paper is important to me, but it is not the piece that makes my career, that leads to my PhD, that lines me up for a postdoc, or might line me up for a job. I get to think about it slightly differently. In general, I want to get back to something that comes up a little bit earlier. Why do we do these things and what motivates us? Personally, I want to help people. I do this job because I want to help people. That means I want to find causes for illnesses. I push toward creating better therapies. I want to mentor and train people. That's why I'm in academia, right? And so I have to think about who I'm trying to do good for and who I'm responsible for.

How can we balance trainee and PI interests?

AS: As we think about open data, early sharing can potentially put some of my trainees at a disadvantage because they don't have the time to do their analysis, write up their paper, and work through things. I think that what becomes critical in sharing early and sharing openly is to set expectations, to have conversations, and to put people in a place where they're comfortable with that and where you basically have created agreements that protect individuals. During COVID-19, we had all these partnerships. We worked with people all around the globe, and we wanted to show for multipl

率先评论此产品

¥0.00

Dr. Alex Shalek

Dr. Alex Shalek

The Promise and Challenges of Open Data

Can you speak to your specific experiences with openly sharing data and how you're currently working within this open data sharing world?

Industry vs. Academia: Different Perspectives on Open Data

What are the different thoughts and considerations around open data in industry?

What are your perspectives from academia, Alex?

The Complex Ethics and Economics of Open Data Sharing

What makes open data sharing so ethically and economically complex?

Is immediate open access always fair, or should we consider equity in timing and access to data?

What’s the difference between reproducibility and scientific responsibility, and why does that distinction matter?

Why Full Transparency Is So Hard: Publishing Limits and Regulatory Lessons

Are publishing constraints getting in the way of transparent and reproducible science?

How can open data standards improve the reproducibility and sharing of scientific methods?

How can we satisfy both scientific and regulatory standards?

What’s stopping scientists from being more meticulous in documenting key variables in methods sections?

Setting Standards for Open Data

Who in academia is overseeing or running the open databases, and how are they being standardized and communicated?

Open Data and Protecting Trainees

How can we balance trainee, PI, and community interests?

How can we balance trainee and PI interests?

欢迎您关注STEMCELL Technologies的微信公众平台