Presentation in Science Coffee of ESA’s Advanced Concepts Team on the 24.11.2023 by Pablo Gomet (ESA/ESAC)
Abstract:
Current and upcoming space science missions will produce petascale data in the coming years. This requires a rethinking of data distribution and processing practices. For example, the Euclid mission will be sending more than 100GB of compressed data to Earth every day. Analysis and processing of data on this scale requires specialized infrastructure and toolchains. Further, providing users with this data locally is not practical due to bandwidth and storage constraints. Thus, a paradigm shift of bringing users code to the data and providing a computational infrastructure and toolchain around the data is required. The ESA Datalabs platforms is specifically focused on fulfilling this need. It provides a centralized platform with access to data from various missions including the James Webb Space Telescope, Gaia, and others. Pre-setup environments with the necessary toolchains and standard software tools such as JupyterLab are provided and enable data access with minimal overhead. And, with the built-in Science Application Store, a streamlined environment is given that allows rapid deployment of desired processing or science exploitation pipelines. In this manner, ESA Datalabs provides an accessible and potent framework for high-performance computing and machine learning applications. While users may upload data, there is no need to download data, thus mitigating the bandwidth burden. As the computational load is handled within the computational infrastructure of ESA Datalabs, high scalability is achieved, and resources can be requisitioned as needed. Finally, the platform-centric approach facilitates direct collaboration on code and data. Currently, the platform is already available to several hundred users, regularly showcased in dedicated workshops and interested users may request access online.
Pablo Gomez - Solving Large-scale Challenges with ESA Datalabs
1. 1
ESA UNCLASSIFIED - For ESA Official Use Only
Solving Large-scale Data Challenges with ESA
Datalabs
Pablo Gómez
Data Science Section, SCI-SAS
24/11/2023
ESA ESAC
2. 2
2
• Part of the Data Science and Archives Division
• Focused on science data exploitation
• Works with different missions & interdisciplinary
Background – Data Science Section
3. 3
3
• Part of the Data Science and Archives Division
• Focused on science data exploitation
• Works with different missions & interdisciplinary
Background – Data Science Section
4. 4
4
“Big” Data – Where we are and what is coming
Euclid First Images
5. 5
5
“Big” Data – Where we are and what is coming
Gaia Data Release 3
6. 6
6
Importance of archival data – Hubble Space Telescope
HST publications by type
https://archive.stsci.edu/hst/bibliography/pubstat.html
de Marchi & Merín, presented at EAS 2023
Not assigned
Partly Archival
Archival
General
Observer
7. 7
7
“Big” Data – Where we are and what is coming
ESAC Science Data Center
14. 14
14
A Platform Designed to Boost Research Productivity
14
SaaS
PaaS
IaaS
System Development
IT Development
Science Development
You can start HERE!
24. 26
26
Recent Events
• Euclid Consortium meeting June 2023
• 200+ new users
• Stress test
• Lots of feedback
• Focus on user experience
• With ESA missions
• Experimental onboarding of external projects
ideas for new use-cases; UI improvements
25. 27
27
JWST @ ESA Datalabs: baseline JWST area
JWST area @ ESA Datalabs
• JWST calibration pipeline
• Astroquery (inc. ESA JWST module)
• pyESASky
• JDAVIZ
• astropy
• matplotlib
• ….
Access to JWST NFS volume:
• JWST calibration files
• Example notebooks for eJWST
• Example notebooks from STSCI
26. 28
28
The ESA Space Science Exploitation Platform
• SCI Data available for researches to work on it, made easy
• Reusable for fast implementation of Scientific Processing Pipelines
• Reusable for fast implementation of Scientific Analysis and Visualisation Tools
High-level messages
Increase Space Science Operations Efficiency
Enable Collaboration and Open Science
• Share complex processing tools and data with your team
• Share your contributions with the community in SCI‘s AppStore
27. 29
29
Catalogue of interacting galaxies in HST archives
One example use case of ESA: Datalabs
Harnessing the Hubble Space Telescope Archives: A
Catalogue of 21,926 Interacting Galaxies
O’Ryan et al. 2023, arXiv:2303.00366
➢ Access to data directly (open large
FITS file is a few seconds, 100k
cutouts created on the order of
minutes)
➢ 92 million cutouts produced (2.5 TB)
➢ Using fine-tuned Zoobot on a sample
of mergers from CANDELS &
COSMOS
➢ Predict interacting galaxies in HST
archives: 21,926 interacting galaxies
found with high confidence (p>0.95)
➢ Other gems: strong lenses, proto-
planetary disks
28. 30
30
ESA Datalabs for Euclid pilot studies
Detecting Solar System Object Preserving Low-Surface Brightness
Detecting Transients Cosmology Likelihood for Observables in Euclid
29. 32
Perspective – A typical ML project
1. Setup
Tools &
Frameworks
Local folders etc.
Getting the data
30. 33
Perspective – A typical ML project
1. Setup
Tools &
Frameworks
Local folders etc.
Getting the data
31. 34
Perspective – A typical ML project
1 - Setup
Tools &
Frameworks
Local folders etc.
Getting the data
2 - Data Prep
I/O
Data Cleaning
Data Labeling
Gaia Data Release 3
Bing
32. 35
Perspective – A typical ML project
1 - Setup
Tools &
Frameworks
Local folders etc.
Getting the data
2 - Data Prep
I/O
Data Cleaning
Data Labeling
3 - Models
Training
Inference
Clustering
…
33. 36
Perspective – A typical ML project
1 - Setup
Tools &
Frameworks
Local folders etc.
Getting the data
2 - Data Prep
I/O
Data Cleaning
Data Labeling
3 - Models
Training
Inference
Clustering
…
35. 38
Datalabs – Quo vadis?
Anomaly Detection
Finding interesting things
Dealing with the flood
Etseneth et al. 2023
36. 39
Datalabs – Quo vadis?
Anomaly Detection
Finding interesting things
Dealing with the flood
Learning with Few Labels
Etseneth et al. 2023
Get a few
labels
Train a semi-
supervised
model
Different Downstream Tasks
• Roughly sort unlabeled data
• Find other instances
• Incremental improvements
37. 40
Datalabs – Quo vadis?
Anomaly Detection
Finding interesting things
Dealing with the flood
Learning with Few Labels
Etseneth et al. 2023
Get a few
labels
Train a semi-
supervised
model
Different Downstream Tasks
• Roughly sort unlabeled data
• Find other instances
• Incremental improvements
Standardized ML Data Preprocessing