Data management =============== Effective data management is crucial for the performance and reliability of any computational suite. This section outlines best practices for handling data within tasks, between tasks, and across the entire suite. Proper data management ensures that tasks run efficiently, data integrity is maintained, and critical outputs are archived and retrievable. Data within and between tasks ----------------------------- Most tasks read and write data of various formats and sizes. Tasks need to be designed to handle this input and output data efficiently and reliably. These best practices for data handling have been established at ECMWF to achieve this: - Ensure input data is available on a fast and reliable filesystem, e.g. in a previous data retrieval task. - Use fast filesystems for task output, such as temporary or scratch space. - Consider data formats, data chunking, and data compression when designing a task. - Clean working directories before you start the task, as a previous run or previous date cycle could have left files behind. - Store frequently-accessed invariant data (static data) on disk or other highly available data storage and document its expected structure in the task or suite documentation, ideally using a consistent naming convention. - Treat static data as a dependency of the suite and use some form of version control or snapshots for reproducibility. - Periodically clean out-of-date data to avoid build-up of data on disk once archived, including task log files. - Minimise filesystem access by avoiding unnecessary file reads and writes. For example, prefer a single `grib_set` to assign multiple attributes rather than one per attribute. Retrieving data from remote services, databases, and archives ------------------------------------------------------------- Data retrieval tasks are common bottlenecks and points of failure in suites and therefore require careful design. Time-consuming retrievals can slow down suites significantly if not implemented efficiently. They should be kept out of the critical path of the suite unless they are :doc:`time-critical ` themselves. Often this is done by keeping them in a separate repeat loop family to allow them to run in parallel with the main loop. Effective data chunking is key and should be tailored to the data retrieval service used. Failure-prone retrievals must handle and document errors adequately to ensure either the suite itself or the operator can recover from them. Giving concise instructions to the user or operator for the common retrieval failures in the :ecflow-docs:`manual page ` is essential for these tasks. Dissemination ------------- Dissemination tasks have to be on the critical path of the suite to ensure timely delivery of the suite outputs. They should be designed to be as efficient as possible and to minimize the risk of failure. These tasks should have their `defstatus` set by an admin event toggle and switched off by default. This ensures dissemination is only switched in the operational suite when needed and skipped in testing. Archiving data -------------- Ensure the main outputs of a suite are archived properly, i.e. stored on reliable long-term storage rather than expensive, fast file systems. As archival often is a slow, serial process, keep the tasks outside of the critical path of the suite. Similar to data retrieval tasks, they should be implemented on separate repeat loops (the :ref:`lag family `) to allow them to run independently of the main loop. Once archived, data should be cleaned on disk if it is no longer required for the rest of the suite. It is often useful to keep this data available in operations to recover from failures or to rerun parts of the suite. This should be enabled with appropriately timed tasks to clean this data. ECMWF archival and dissemination services ----------------------------------------- At ECMWF, the following services are commonly used for data retrieval and archival. They have their own best practices and guidelines (linked below) which should be followed when using them in a suite: - the **Meteorological Archival and Retrieval System (MARS)** is the main data archive at ECMWF - :ecmwf-confluence:`general user documentation ` - :ecmwf-confluence:`the MARS retrieve action ` - :ecmwf-confluence:`guidelines for writing efficient requests in suite scripts ` - the **Fields Database (FDB)** is a faster database for the most common and recent meteorological fields - :ecmwf-confluence:`documentation and examples ` - **ECMWF's File Storage system (ECFS)** is a file storage system for large volumes of unstructured data - :ecmwf-confluence:`general documentation ` - **ECMWF Production Data Store (ECPDS)** is the main data dissemination service at ECMWF - :ecmwf-confluence:`general documentation `