Puchun Niu is a visiting postdoctoral associate in the lab of Miel Hostens, the Robert and Anne Everett associate professor of digital dairy management and data analytics at Cornell University. Puchun's work focuses on collecting experimental data for integration and management, processing the data through a pipeline for quality control and profiling, and creating a comprehensive dataset for storage, analysis and visualization.
We spoke with Puchun about creating a data pipeline to standardize large volumes of data collected across multiple dairy cattle experiments. The experiments are conducted as part of the Accelerating Livestock Innovations for Sustainability (ALIS) program, a transdisciplinary collaboration that uses feed additives, ration balancing, technology and data modeling to create solutions for climate-smart animal agriculture.
Animal agriculture researchers have been conducting experiments for years without a data pipeline. Why do they need one now?
Researchers collect large volumes of data from animal experiments. A typical dairy cattle experiment might include feed chemical analysis, feed intake, milk yield, feed digestibility and energy expenditures—not to mention methane emissions, which are central to the latest ALIS experiments. Managing these diverse measurements, and the metadata that provides information about the measurements, across multiple experiments poses significant challenges for data organization, analysis and efficient utilization.
A data pipeline helps streamline the collection, processing and analysis of data. This is crucial in modern agriculture, where the growing use of technology and data-driven decision-making plays an essential role in improving productivity, sustainability and research outcomes.
What is wrong with storing data in traditional worksheet-based formats, such as spreadsheets?
The spreadsheet approach has several limitations. There are discrepancies in how data is recorded across experiments and limited ability to scale for large datasets. Frequent errors also happen due to manual input, and the lack of automation makes the process tedious for humans. And then there is the lack of centralized data management, which weakens data security.
These challenges highlight the need for a standardized approach to agricultural data organization that is consistent, easy to understand, and readable by both people and computer systems. An approach like that allows for seamless integration and analysis across different experiments. That makes the process more efficient and reliable.