Date: Monday, November 22, 2010, 12:30 pm, SB 111
Title: Large-scale Scientific Data and Metadata Management
Speaker: Dr. Tanu Malik
Scientific data is massive, complex, highly distributed and has large amount of meta-data associated with it. Storing, managing and querying scientific data, thus, is a challenge, especially in current industrial-strength database systems. Many of the systems were conceived and developed under the backdrop of commercial data which is simple to describe and query. In this talk, I will present critical requirements emerging from scientific data, and will present novel systems that address these requirements. The first part of my talk will examine the impact of rapid growth of scientific repositories on distributed middleware systems. Current middleware systems assume repositories are static. Scientists, however, are increasingly interested in viewing the latest data as part of query results, and often bypass middleware systems. This results in reduced efficiency of middleware systems and runaway network costs. I will present Delta a dynamic data middleware cache system for rapidly-growing scientific repositories. Delta includes a decision framework that adaptively decouples data objects—choosing to keep some data objects at the cache, when they are heavily queried, and keeping some data objects at the repository, when they are heavily updated. The second part of my talk will focus on managing and querying scientific metadata, especially lineage data. I will contest the current architectural model adopted by current data provenance management systems and will describe our current work in auditing and tracking data provenance in distributed applications.
Tanu Malik is a Research Associate with the Computation Institute (CI) at the University of Chicago and Argonne National Laboratory. Her research interests focus on developing innovative methods, systems and technologies that improve the performance of database systems and large-scale data management systems. She has worked in the areas of federated and distributed data systems, data replication, data approximation, data provenance and the semantic Web. She was instrumental in developing SkyServer, the first large-scale scientific database and later established OpenSkyQuery, a platform for cross-correlating hundreds of astronomy datasets. OpenSkyQuery is actively used by astronomers world-wide. A recurrent theme in her research is to re-examine the core principles of database technology in the light of new requirements emerging from scientific domains. Her research has resulted in some innovative database technology for handling large scale distributed scientific data and metadata. Tanu earned her PHD and MS in 2008 from the Department of Computer Science at Johns Hopkins University. She earned her B.Tech in 1999 from the Department of Civil Engineering at Indian Institute of Technology, Kanpur. Before joining CI, she was with Cyber Center and Indiana Center of Database Systems at Purdue University as a Research Assistant Professor. She is a senior consultant, and co-PI with the National Earthquake Engineering Simulation (NEES) grant.