The field of astronomy is rapidly changing away from the traditional notion of a lone astronomer pointing a telescope at a single object in a static sky. Initiatives such as the Sloan Digital Sky Survey have ushered in a collaborative big data era of wide-field sky surveys, in which telescopes collect observations continuously while sweeping across the visible night sky. This method of data collection enables not only very deep imaging of far and faint objects but is also optimal for searching for objects that might be changing or moving. By analyzing the differences in astronomical image data from one night to the next, astronomers can detect "transient" objects, such as variable stars, supernova, and near Earth asteroids. New sky surveys provide a wealth of scientific value for astronomers but not without technical challenges. Survey data need to be automatically processed and the results immediately distributed to the scientific community in order to enable rapid follow-up observations as transient astronomy can be highly time sensitive. Detection alert data distribution mechanisms need to be robust and reliable to maintain scientific integrity without data loss. Additionally, alerting systems need to be scalable to support a data volume unprecedented in astronomy, as transient detection rates have increased to exceed all historical data in a single night. A streaming architecture is an ideal architecture for automated distribution and processing of transient data in real time as it is being collected. In this talk, we will discuss how Kafka and Avro are being used in wide-field astronomical sky survey pipelines to serialize and distribute transient data, the design choices behind this system, and how this alert stream system has been successfully deployed in production to distribute transient detection alerts to the scientific research community in excess of 1 million events per night.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patterson, High Alpha) Kafka Summit SF 2019
1. Building A Newsfeed From
The Universe:
Data Streams In Astronomy
Maria Patterson, Phd
Data Scientist
@OpenSciPinay
2. Building A Newsfeed From
The Universe:
Data Streams In Astronomy
Maria Patterson, Phd
Data Scientist
@OpenSciPinay
3. HA venture studio that conceives,
launches, and scales enterprise cloud
companies.
A highly focused fund that invests
in best-in-class enterprise cloud
companies from our Studio and
around the world.
High Alpha Studio High Alpha Capital
Maria Patterson @OpenSciPinay
4. A new model for entrepreneurship
that unites company creation with
venture funding.
Maria Patterson @OpenSciPinay
5. Since 2015 our studio has
launched 16 companies;
we are currently
launching ~1 new
company every other
month.
High Alpha
Studio
PRE-LAUNCH
PRE-LAUNCH PRE-LAUNCH PRE-LAUNCH
Maria Patterson @OpenSciPinay
9. Listening to the sky in all directions
The Sloan Digital Sky Survey (SDSS) is
an early pioneer of the survey
technique, collecting tens of TBs of
image data from nearly 1 billion objects.
Astronomical “Sky Surveys”
Maria Patterson @OpenSciPinayImage Credit: PS1SC/R.Ratkowski/Sloan Digital Sky Survey
11. Zwicky Transient Facility
Maria Patterson @OpenSciPinay
• First light November 1, 2017
• 48” telescope at Palomar Observatory
• Image size = 235 x area of Moon
• Images entire Northern sky every 3 nights
• Images Milky Way plane twice a night
• Designed to detect transients -
supernova, gamma-ray bursts, etc, and
moving objects - comets, asteroids
13. Large Synoptic Survey
Telescope
Maria Patterson @OpenSciPinay
• Under construction for full operations 2022
• 8.4 m mirror in northern Chile
• 3.2 Gigapixel camera, largest ever
• Images entire Southern sky every few nights
• 20 TB raw data / night for 10 years
• 60 PB survey end, 15 PB catalog database
• All data public and open source code!
14. Large Synoptic Survey
Telescope
Maria Patterson @OpenSciPinay
• Under construction for full operations 2022
• 8.4 m mirror in northern Chile
• 3.2 Gigapixel camera, largest ever
• Images entire Southern sky every few nights
• 20 TB raw data / night for 10 years
• 60 PB survey end, 15 PB catalog database
• All data public and open source code!
41. How should we package alert data?
Maria Patterson @OpenSciPinay
42. XML-based
• Measurements characterizing objects
• Verbose, redundant, and heavy
• Non-standard / non-typed fields
• Meant more for human inspection
• How do we include images?
• How can we better scale?
Traditional Format
Maria Patterson @OpenSciPinay
43. • Compact, as opposed to XML’s verbosity
• Fast parsing with structured messages
• Easy to characterize with simple JSON schema
• Availability of user-friendly Python modules
• avro-python3
• fastavro
• Strictly enforced schemas, but allows evolution
• Allows “postage stamp” cutout files
Data formatting: Apache Avro
Schema
Data
Maria Patterson @OpenSciPinay
44. Including image data “postage stamps”
Maria Patterson @OpenSciPinay
Data formatting: Apache Avro
https://github.com/ZwickyTransientFacility/ztf-avro-alert
https://github.com/lsst-dm/sample-avro-alert
45. How should we distribute alert data?
Maria Patterson @OpenSciPinay
46. Event-driven Python module
• Must be connected to get data
• Difficult to filter - uses XML’s XPath
• Not easy to sink to database
• Not scalable to LSST scale
Traditional Method
Maria Patterson @OpenSciPinay
47. • Scalability - many consumers, in parallel
• Feed astronomical “community brokers"
• Keep database archive in sync
• Maintain all history
• Let consumers “rewind” if disconnected
• Availability of user-friendly Python packages
• Runs in Docker, easy to dev
Data transport: Apache Kafka
Maria Patterson @OpenSciPinay
https://github.com/lsst-dm/alert_stream
48. How do we find objects of interest?
Maria Patterson @OpenSciPinay
49. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
50. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
51. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
52. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
53. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
54. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
55. Data filtering: Write your own Python
• Allows complex operations / machine
learning Python modules
• If True: write to new topic
• If False: drop
• Deployed in separate Docker containers
Maria Patterson @OpenSciPinay
57. IPAC
• Data processing (difference Imaging)
• Archive to databases
• Enrich with supplementary data
• Packaging to Avro format
(custom open source code)
Maria Patterson @OpenSciPinay
Putting it all together:
ZTF Alert Distribution System (ZADS)
59. IPAC
• MirrorMaker - Docker
• (confluentinc/cp-kafka)
• One topic per night
Maria Patterson @OpenSciPinay
Putting it all together:
ZTF Alert Distribution System (ZADS)
60. IPAC
UW
• Three broker cluster
• confluentinc/cp-kafka Docker image
• confluentinc/cp-zookeeper Docker image
• MirrorMaker - Docker
• (confluentinc/cp-kafka)
• One topic per night
Maria Patterson @OpenSciPinay
Putting it all together:
ZTF Alert Distribution System (ZADS)
64. UW
Avro
Archive
IPAC
UW • UW cluster 16 partitions
• Custom open source code
• Subscribe to topics per night
Maria Patterson @OpenSciPinay
Putting it all together:
ZTF Alert Distribution System (ZADS)
73. ZTF Alert Distribution System (ZADS)
• 600k - 1.2 million alerts per night
• Up to ~75 GB / night archived
• ~20 minutes from image taken to
available to consumers
• ~4 seconds packaging/ transfer time
• https://ztf.uw.edu/alerts/public/
Maria Patterson @OpenSciPinay
74. Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
75. Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
76. Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
77. Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
78. Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
79. Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
80. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
81. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
82. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
Filter 1
Filter 2
Filter N
83. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
Filter 1
Filter 2
Filter N
84. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
Filter 1
Filter 2
Filter N
85. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
Filter 1
Filter 2
Filter N
86. Filter
Submission
Service
User Interface
Filter Cluster
Maria Patterson @OpenSciPinay
"Design of the LSST Alert Distribution System":
https://dmtn-093.lsst.io/
• 189 CCDs, processed in parallel
• 4k x 4k, 3.2 billion.pixels per image
• End of pipeline alert generator
• 10,000 alerts every 39 seconds
Alert Hub
Central node
(or cluster)
Archive
Filter 1
Filter 2
Filter N
89. For more information
• lsst.org
• Code repositories: dm.lsst.org/browse
• github.com/lsst-dm repos:
• alert_stream
• sample-avro-alert
Large Synoptic Survey Telescope
• ztf.caltech.edu
• dirac.astro.washington.edu
• github.com/ZwickyTransientFacility repos:
• ztf-avro-alert
• alert_stream
Zwicky Transient Facility
ALL THE KUDOS! to the ZTF team and LSST
DM team (esp., Eric Bellm, John Swinbank,
and Simon Krughoff)
Maria Patterson
maria@highalpha.com @OpenSciPinay
See also
• mtpatter.github.io - 3 technical notes on architecture
• github.com/mtpatter/postgres-kafka-demo