According to their web site (http://ndar.nih.gov/): “The National Database for Autism Research (NDAR) is an NIH-funded research data repository that aims to accelerate progress in autism spectrum disorders (ASD) research through data sharing, data harmonization, and the reporting of research results. NDAR also serves as a scientific community platform and portal to multiple other research repositories, allowing for aggregation and secondary analysis of data.” This informatics platform was created in 2008 and since then has entered information on more than 77,000 research participants across a large number of measures secured within Amazon cloud.
Overall these are lofty goals from the federal government and if properly applied could advance our knowledge regarding autism. Like many other data repositories it faces challenges stemming from handling big data sets; a virtual deluge of research results with multiple resources streaming into it. Included in this warehouse are results from disparate sources including neuroimaging, genetics, clinical trials, postmortem studies, etc.
Do we need large data sets? The short answer has to be yes. However, in order for large data sets to be useful and provide significant patterns we have to peel away different layers or strata to make the same useful and to classify our results as being either structured or non-structured. One problem is that individual researchers make the decision as to what to collect in each study and then sources are combined at the general warehouse. Criteria for diagnosis, age of participants, comorbidities, and treatments may be different among datasets. Is data collected by different techniques being done at the same level of precision? How do you perform an error analysis across the board? With so many differences it is difficult to envision how the data could be of use as a training data set, test set or even to perform a meta-analysis when applied to questions different from those for which they were originally intended. Ongoing efforts by NDAR to develop tools for data definition, standardization and validation are therefore useful and necessary. I hope that in the future we will be able to have data organized into tables having pre-existing links between them in multiple sections of the repository. At present you may have to guess or know beforehand what data you want to find and where it is located. In addition, I would also suggest installing tools based on decision trees and other classification methods that would better allow combining results from different studies.
One possible use of NDAR resides in granting researchers the ability to establish connections between data present in the depository and that held private to the investigator. However, establishing such correlations may lead to intuitive leaps that may prove unfounded. There is even a name for this, ” apophenia”, that is, the experience of seeing patterns or connections in random or meaningless data. In essence researcher will often look for patterns where there are none -and what better place to do so but in a big data set. The problem is that by overfitting data, the results fool us into thinking that we know more than we really do. There is a lot of noise in large data sets and many of our predictions may go outside the range of the data.
I have had a good amount of experience with NDAR and have been pleased with the people handling my requests. However I am displeased with the favoritism exhibited by their administration in favor of some participants and centers. In essence, the rule by which federally funded investigators have to deposit data is only enforced in some but not all cases. There are a select few for which regulations do not apply. There is no level playing field.
If you work in a conglomerate of Autism Centers of Excellence (ACE) your published research data (e.g., MRIs) may be deposited at your discretion, not within the allotted time frame set by regulations that apply to other researchers. Furthermore, NDAR officials will collude with the researchers to keep diagnostic information out of the database. Prima facie it seems that the researchers are complying with the requirements and their data is there (e.g., MRIs), but the same is useless without diagnostic information. The only way to obtain the data, as indicated by NDAR officials, is to make your own idea/data/research plan available to the researchers (your competitors) that they are sponsoring and ask for their collaboration. Even if lucky, and allowed to collaborate, you still remain at the mercy of those researchers that are in cahoots with NDAR. This means that you may be kept blind to diagnosis, and even after providing them with your data, the same may remain unanalyzed for months to years. In the meanwhile your own idea may find its way to their “subconscious” and published preemptively by them.
I have been told by NDAR officials that they proceed in this fashion because the amount of competition is fierce among the larger autism centers. However, this competition is even fiercer at the single investigator level, especially when the federal government decides to play favorites. The vision statement of NDAR (to accelerate progress in autism spectrum disorders [ASD] research through data sharing) should be redone: “to help accelerate the research of investigators selected by NDAR and in so doing help cover up any improprieties”.