This document discusses how big data and data science can be used to attain new heights, likening it to magic. It provides an overview of Ken Johnston's background and experiences in data science. It then discusses six keys to a "big" magic show with big data: trying multiple times, addressing issues with over-counting, experimentation techniques like A/B testing, infrastructure for big data, tools and skills, and security, privacy and fraud protection. The document emphasizes the importance of an assistant to help the data scientist or data engineer with various tasks.
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...Boris Adryan
Traditional machine-to-machine (M2M) uses the internet to replace what was previously achieved through a wire. The challenges for IT are not much different to any other implementation of a prescribed business model.
But how are we going to leverage the connectedness of devices in the consumer Internet of Things (IoT) in a world in which every individual may show a different degree of technology adoption? Not everyone has the connected Crock Pot! The challenges are manifold, and while in 2015 we are still arguing about technical standards that hinder communication of things across platforms, the looming challenges of data integration are even more significant.
Even if all devices e.g. in the connected home of the future are going to speak one language, how are we generating actionable insight from the available information according to the users' need? How do we determine the appropriateness of action? An empty fridge might be alarming, but should we inform the user of an impending hunger crisis if the door hasn't been opened in a week, the heating system is set to low, the car is parked at the local airport? Draw your conclusions!
Ontologies organize things and establish their relationship to each other. They can be used for knowledge inference. For example, a car is a means of transport and ultimately an indicator of absence or presence. Some scientific domains are already making extensive use of ontologies to deal with vast amounts of information. The Gene Ontology (GO) has over 40k interlinked terms that describe cell and molecular biology. For every biological entity on that scale, we can ask: Where is it? What is its function? What process is it involved with? Benefitting from substantial government funding (in the range of > $40M from the NIH since 2001), knowledge inference through GO is widely applied in academic and industry research.
In this webcast I aim to introduce the three main branches localization, function and process that we use in GO and demonstrate how they're immediately applicable in the IoT — after all, a cell is just a large, interconnected system. I will further discuss relationship types that we use in the annotation of biological entities, and propose a few that are more appropriate for the IoT. I will contrast this relatively simple system with other ontologies suggested for the IoT. It is not my aim to sell GO as a one-size-fits-all, but talk about how building a large ontology has taught us pragmatism that is quite remote from many purely academic ontology proposals.
Data Science London - Meetup, 28/05/15Boris Adryan
Slides from my @ds_ldn talk about Ontologies in the Internet of Things. Note that this is a short version of a talk that I presented earlier this month on O'Reilly Webcasts, still viewable for a while at: http://www.oreilly.com/pub/e/3365
Using big data and implementing hadoop is a trend that people jump all to quickly to. Instead understanding the run time complexity of one's algorithms, reducing said complexity and managing the process from start to finish in a lean and agile way can yield massive cost savings - or save your organization.
Do you want to know how real users are interacting with your product? Do you want to know which features they don’t use? Would you like to understand how your product works internally under real operational conditions? Then you need telemetry—the instrumentation of your product to record this information and transmit it back to you for analysis. Windows 10 implemented this capability. Today, there are more than 450 million devices running Windows 10 providing constant feedback on its operation. Ken Johnston says Microsoft learned a lot about what they did right for that launch—and what they have to do better going forward. Ken shares the internal architecture of Microsoft Windows Telemetry and how its data cloud is constructed. He describes how Windows implemented privacy controls for users, how they gather and categorize customer feedback, and what they learned by sharing features early with users. All this infrastructure is now an integral part of how Microsoft prioritizes bugs and feature requests. Learn how to add telemetry to your applications to answer your use and quality questions.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...Boris Adryan
Traditional machine-to-machine (M2M) uses the internet to replace what was previously achieved through a wire. The challenges for IT are not much different to any other implementation of a prescribed business model.
But how are we going to leverage the connectedness of devices in the consumer Internet of Things (IoT) in a world in which every individual may show a different degree of technology adoption? Not everyone has the connected Crock Pot! The challenges are manifold, and while in 2015 we are still arguing about technical standards that hinder communication of things across platforms, the looming challenges of data integration are even more significant.
Even if all devices e.g. in the connected home of the future are going to speak one language, how are we generating actionable insight from the available information according to the users' need? How do we determine the appropriateness of action? An empty fridge might be alarming, but should we inform the user of an impending hunger crisis if the door hasn't been opened in a week, the heating system is set to low, the car is parked at the local airport? Draw your conclusions!
Ontologies organize things and establish their relationship to each other. They can be used for knowledge inference. For example, a car is a means of transport and ultimately an indicator of absence or presence. Some scientific domains are already making extensive use of ontologies to deal with vast amounts of information. The Gene Ontology (GO) has over 40k interlinked terms that describe cell and molecular biology. For every biological entity on that scale, we can ask: Where is it? What is its function? What process is it involved with? Benefitting from substantial government funding (in the range of > $40M from the NIH since 2001), knowledge inference through GO is widely applied in academic and industry research.
In this webcast I aim to introduce the three main branches localization, function and process that we use in GO and demonstrate how they're immediately applicable in the IoT — after all, a cell is just a large, interconnected system. I will further discuss relationship types that we use in the annotation of biological entities, and propose a few that are more appropriate for the IoT. I will contrast this relatively simple system with other ontologies suggested for the IoT. It is not my aim to sell GO as a one-size-fits-all, but talk about how building a large ontology has taught us pragmatism that is quite remote from many purely academic ontology proposals.
Data Science London - Meetup, 28/05/15Boris Adryan
Slides from my @ds_ldn talk about Ontologies in the Internet of Things. Note that this is a short version of a talk that I presented earlier this month on O'Reilly Webcasts, still viewable for a while at: http://www.oreilly.com/pub/e/3365
Using big data and implementing hadoop is a trend that people jump all to quickly to. Instead understanding the run time complexity of one's algorithms, reducing said complexity and managing the process from start to finish in a lean and agile way can yield massive cost savings - or save your organization.
Do you want to know how real users are interacting with your product? Do you want to know which features they don’t use? Would you like to understand how your product works internally under real operational conditions? Then you need telemetry—the instrumentation of your product to record this information and transmit it back to you for analysis. Windows 10 implemented this capability. Today, there are more than 450 million devices running Windows 10 providing constant feedback on its operation. Ken Johnston says Microsoft learned a lot about what they did right for that launch—and what they have to do better going forward. Ken shares the internal architecture of Microsoft Windows Telemetry and how its data cloud is constructed. He describes how Windows implemented privacy controls for users, how they gather and categorize customer feedback, and what they learned by sharing features early with users. All this infrastructure is now an integral part of how Microsoft prioritizes bugs and feature requests. Learn how to add telemetry to your applications to answer your use and quality questions.
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
Talking about the ease of use and handling Big Data technologies in the Cloud. Using Google Cloud Platform and Amazon Web Services and all of the tools around it.
Showing the problems and how we can solve them with simple tools.
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
The Briefing Room with Mark Madsen and WebAction
Live Webcast Feb. 10, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=fa83c6283de99dfb6f38b9d7199cb452
In our increasingly interconnected world, the windows of opportunity for meaningful action are shrinking. Where hours once sufficed, minutes are now the norm. For some transactions, seconds make all the difference, even sub-seconds. Meeting these demands requires a new approach to information architecture, one that embraces the many innovations that are fundamentally changing the data-driven economy.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature as he explains how a confluence of advances are changing the nature of data management. He'll be briefed by Sami Akbay of WebAction, who will showcase his company's real-time data platform, designed from the ground up to meet the challenges of leveraging Big Data in concert with all manner of operational enterprise systems.
Visit InsideAnalysis.com for more information.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
IT Operation Analytic for security- MiSSconf(sp1)stelligence
IT Operation Analytic: Using Anomaly Detection , Unsupervised Machine Learning, to distinct normal and abnormal behavior and enhance efficiency of SIEM detection and alert capability.
Performance Quality Metrics for Mobile Web and Mobile Native - Agile Testing ...Andreas Grabner
5 Real Life Examples on why Mobile Web and Mobile Native Apps failed and Which Metrics would have shown the problem early on.
Using these metrics along your delivery chain allows you go get closer to full automated deployment pipeline but also making sure performance criteria is met
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
PXL Data Engineering Workshop By Selligent Jonny Daenen
On 2020-12-09 Laurens Vijnck and Jonny Daenen gave a workshop at PXL.
During this session, we collectively provisioned a streaming ingestion pipeline in mere minutes. The technology stack included Pub/Sub, Dataflow, and BigQuery. Hereafter, students had the opportunity to perform interactive queries on their own real-time data to answer a series of business questions. These questions were borrowed from real-life cases that we encountered at Selligent Marketing Cloud.
Google Colab (Free Jupyter Notebooks) and Google Data Studio have proven to be excellent tools to facilitate these kinds of interactive sessions.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Data Science in the Real World: Making a Difference Srinath Perera
We use the terms “Big Data” and “Data Science” for use of data processing to make sense of the world around us. Spanning many fields, Big Data brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture.
These usecases use basic analytics, advanced statistical methods, and predictive technologies like Machine Learning. However, it is not just about crunching the data. Some usecases like Urban Planning can be slow, and there is enough time to process the data. However, with use cases like traffic, patient monitoring, surveillance the the value of results degrades much faster with time and needs results within milliseconds to seconds. Collecting data from many sources, cleaning them up, processing them using computation clusters, and doing all these fast is a major challenge.
This talk will discuss motivation behind big data and data science and how it can make a difference. Then it will discuss the challenges, systems, and methodologies for implementing and sustaining a data science pipeline.
Data Scientist has been regarded as the sexiest job of the twenty first century. As data in every industry keeps growing the need to organize, explore, analyze, predict and summarize is insatiable. Data Science is creating new paradigms in data driven business decisions. As the field is emerging out of its infancy a wide range of skill sets are becoming an integral part of being a Data Scientist. In this talk I will discuss the different driven roles and the expertise required to be successful in them. I will highlight some of the unique challenges and rewards of working in a young and dynamic field.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Why We Need Diversity in Testing- AccentureTEST Huddle
In this webinar Rasa (Testing capability lead for Denmark) and Matthias (EALA Testing capability lead) will share some of their own experiences why diversity matters, give insights into how Accenture as a global firm is promoting diversity and how we are in the process of changing our attitudes and processes to make all of this sustainable
Keys to continuous testing for faster delivery euro star webinar TEST Huddle
Your business needs to deliver faster. To accommodate, Development needs to introduce fewer changes but in a much more frequent cadence. This creates a challenge for test teams to keep up with the rapid pace of change without compromising on quality. Automation is paramount to the success or failure of Continuous Delivery, and Continuous Testing enables early and frequent quality feedback throughout the CI/CD pipeline.
In this webinar, Eran & Ayal will explore how to implement Continuous Testing to ensure high quality releases in a Continuous Delivery environment; including what to test and when to automate new functionality in order to optimize your efforts.
More Related Content
Similar to Big Data: The Magic to Attain New Heights
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
The Briefing Room with Mark Madsen and WebAction
Live Webcast Feb. 10, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=fa83c6283de99dfb6f38b9d7199cb452
In our increasingly interconnected world, the windows of opportunity for meaningful action are shrinking. Where hours once sufficed, minutes are now the norm. For some transactions, seconds make all the difference, even sub-seconds. Meeting these demands requires a new approach to information architecture, one that embraces the many innovations that are fundamentally changing the data-driven economy.
Register for this episode of The Briefing Room to hear veteran Analyst Mark Madsen of Third Nature as he explains how a confluence of advances are changing the nature of data management. He'll be briefed by Sami Akbay of WebAction, who will showcase his company's real-time data platform, designed from the ground up to meet the challenges of leveraging Big Data in concert with all manner of operational enterprise systems.
Visit InsideAnalysis.com for more information.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
IT Operation Analytic for security- MiSSconf(sp1)stelligence
IT Operation Analytic: Using Anomaly Detection , Unsupervised Machine Learning, to distinct normal and abnormal behavior and enhance efficiency of SIEM detection and alert capability.
Performance Quality Metrics for Mobile Web and Mobile Native - Agile Testing ...Andreas Grabner
5 Real Life Examples on why Mobile Web and Mobile Native Apps failed and Which Metrics would have shown the problem early on.
Using these metrics along your delivery chain allows you go get closer to full automated deployment pipeline but also making sure performance criteria is met
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
PXL Data Engineering Workshop By Selligent Jonny Daenen
On 2020-12-09 Laurens Vijnck and Jonny Daenen gave a workshop at PXL.
During this session, we collectively provisioned a streaming ingestion pipeline in mere minutes. The technology stack included Pub/Sub, Dataflow, and BigQuery. Hereafter, students had the opportunity to perform interactive queries on their own real-time data to answer a series of business questions. These questions were borrowed from real-life cases that we encountered at Selligent Marketing Cloud.
Google Colab (Free Jupyter Notebooks) and Google Data Studio have proven to be excellent tools to facilitate these kinds of interactive sessions.
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp.
In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.
Data Science in the Real World: Making a Difference Srinath Perera
We use the terms “Big Data” and “Data Science” for use of data processing to make sense of the world around us. Spanning many fields, Big Data brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture.
These usecases use basic analytics, advanced statistical methods, and predictive technologies like Machine Learning. However, it is not just about crunching the data. Some usecases like Urban Planning can be slow, and there is enough time to process the data. However, with use cases like traffic, patient monitoring, surveillance the the value of results degrades much faster with time and needs results within milliseconds to seconds. Collecting data from many sources, cleaning them up, processing them using computation clusters, and doing all these fast is a major challenge.
This talk will discuss motivation behind big data and data science and how it can make a difference. Then it will discuss the challenges, systems, and methodologies for implementing and sustaining a data science pipeline.
Data Scientist has been regarded as the sexiest job of the twenty first century. As data in every industry keeps growing the need to organize, explore, analyze, predict and summarize is insatiable. Data Science is creating new paradigms in data driven business decisions. As the field is emerging out of its infancy a wide range of skill sets are becoming an integral part of being a Data Scientist. In this talk I will discuss the different driven roles and the expertise required to be successful in them. I will highlight some of the unique challenges and rewards of working in a young and dynamic field.
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Demi Ben-Ari
Once you start working with distributed Big Data systems, you start discovering a whole bunch of problems you won’t find in monolithic systems.
All of a sudden to monitor all of the components becomes a big data problem itself.
In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system once you’re using tools like:
Web Services, Apache Spark, Cassandra, MongoDB, Amazon Web Services.
Not only the tools, what should you monitor about the actual data that flows in the system?
And we’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk we’ll mention all of the aspects that you should take in consideration when monitoring a distributed system using tools like: Web Services,Spark,Cassandra,MongoDB,AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Why We Need Diversity in Testing- AccentureTEST Huddle
In this webinar Rasa (Testing capability lead for Denmark) and Matthias (EALA Testing capability lead) will share some of their own experiences why diversity matters, give insights into how Accenture as a global firm is promoting diversity and how we are in the process of changing our attitudes and processes to make all of this sustainable
Keys to continuous testing for faster delivery euro star webinar TEST Huddle
Your business needs to deliver faster. To accommodate, Development needs to introduce fewer changes but in a much more frequent cadence. This creates a challenge for test teams to keep up with the rapid pace of change without compromising on quality. Automation is paramount to the success or failure of Continuous Delivery, and Continuous Testing enables early and frequent quality feedback throughout the CI/CD pipeline.
In this webinar, Eran & Ayal will explore how to implement Continuous Testing to ensure high quality releases in a Continuous Delivery environment; including what to test and when to automate new functionality in order to optimize your efforts.
In this webinar Carsten will explore the role of the tester in a Scrum team. He will examine where the tester play an important role in Scrum and how you can contribute to a teams performance.
Leveraging Visual Testing with Your Functional TestsTEST Huddle
Designing and implementing (or selecting) the right automation strategy, for functional testing, with visual testing, can help your project with greater test coverage while improving test scalability
This talk suggests how we might make sense of the tools landscape of the near future, where the pressure to modernise processes and automate is greatest, and what a new test process supported by tools might look like.
Takeaways:
- We need to take machine learning in testing seriously, but it won’t be taking our jobs just yet
- We don’t need more test automation tools; today we need tools that capture tester knowledge
- Tools that that learn and think can’t work for testers until we solve the knowledge capture challenge.
View On-Demand Webinar: https://youtu.be/EzyUdJFuzlE
In this session, we’ll write tests and code for solving a real Star Wars problem. And we’ll discuss what we’re doing, refine our specs, as well as see what changes in the design tell us.
View On-Demand Webinar: https://huddle.eurostarsoftwaretesting.com/resource/test-management/tdd-rest-us/
Scaling Agile with LeSS (Large Scale Scrum)TEST Huddle
In this webinar, Elad will cover the principles that the #LeSS framework has to offer in order to enable bug organisations to become agile.
View webinar recording - https://huddle.eurostarsoftwaretesting.com/resource/agile-testing/scaling-agile-less-large-scale-scrum/
Creating Agile Test Strategies for Larger EnterprisesTEST Huddle
Having difficulty creating an agile test strategy for your company? Let Testing Excellence Award winner, Derk-Jan de Grood, show you how it’s done
View webinar recording here - http://huddle.eurostarsoftwaretesting.com/resource/agile-testing/creating-agile-test-strategies-larger-enterprises/
3 key takeaways
- Do you know the meaning of your organisation, system, product?
- Can you deliver the important risks right away?
- How can you communicate about the (process and product) risks your dealing with?
View Webinar recording: https://huddle.eurostarsoftwaretesting.com/resource/test-management/is-there-a-risk/
Growing a Company Test Community: Roles and Paths for TestersTEST Huddle
Over the past three years, our company’s test team has grown from three lonesome testers to a community of nine – with more planned. Since we don’t see testers as “click monkeys”, but as valuable and integrated project members who bring a specific skill set to the table, it’s important for us to choose testers well and to train them in various areas so that they can contribute, grow and see their own career path within testing.
To structure to our internal tester training program, we have been developing role descriptions, education paths and career options for our testers, which I’d like to share with you in this webinar.
View webinar - https://huddle.eurostarsoftwaretesting.com/resource/webinar/growing-company-test-community-roles-paths-testers/
It’s the same argument again and again. One side says “team members should all be able to do everything, and the programmers should do their testing and all testers should be writing code”. The other side says “No, that can’t possibly work – programmers don’t know how to test, they don’t have the right mindset”. And on and on it goes.
http://huddle.eurostarsoftwaretesting.com/resource/webinar/need-testers-agile-teams/
In this webinar, Dave Haeffner (Elemental Selenium, USA) discusses how to:
- Build an integrated feedback loop to automate test runs and find issues fast
- Setup your own infrastructure or connect to a cloud provider
-Dramatically improve test times with parallelization
https://huddle.eurostarsoftwaretesting.com/resource/webinar/use-selenium-successfully/
Practical Test Strategy Using HeuristicsTEST Huddle
Key Takeaways
- See what makes a good test strategy
- Learn how to make a thorough test strategy
- Identify what is the ‘Heuristic Test Strategy Model’ is
- Develop a solid test strategy that fits fast
- Discover how diversification can help you to create a test strategy
Key Takeaways:
- A diagramming method that helps discuss roles
- A one page analysis heuristic for roles
- Why roles matter on projects
https://huddle.eurostarsoftwaretesting.com/resource/people-skills/thinking-through-your-role/
Key Takeaways:
- What will this release contain
- What impact will it have on your test runs
- How can you preserve your existing investment in tests using the Selenium WebDriver APIs, and your even older RC tests
- Looking forward, when will the W3C spec be complete
- What can we expect from Selenium 4
https://huddle.eurostarsoftwaretesting.com/
New Model Testing: A New Test Process and ToolTEST Huddle
In this webinar, Paul described his experiences of building and using a bot for paired testing and also propose a new test process suitable for both high integrity and agile environments. His bot – codenamed System Surveyor – builds a model of the system as you explore and captures test ideas, risks and questions and generates structured test documentation as a by-product.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
1. Big Data: The Magic to
Attain New Heights
Ken Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – kenj@Microsoft.com
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic
2. Data Scientist
in Core Data
Science Team
Office Live,
WebApps,
Office Online
Cosmos,
AutoPilot,
Local,
Shopping
About Ken
Kanban and
Data Science
series on
LinkedIn
EaaSy&MVQ
– Everything as
a Service &
Minimum Viable
Quality
Write Books and Blog
and some fiction
16. Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain
“Big Data” Search Trends
@rkjohnston #DataMagic
18. Common Design Patterns
Good Paper to Read
IDC: Six Patterns of Big Data and Analytics Adoption:
The Importance of the Information Architecture
Ingest
From Services, IOT, Apps
Via Streams
Into Storage
Process
Build Pipelines
Reduce, Transform, Join
Pipe out
Analyze
From Services, IOT, Apps
Via Streams
Into Storage
19. Azure Model
Cindy Gross – Technical Fellow: Big Data and Cloud
Twitter: @SQLCindy cindyg@NealAnalytics.com
Ingest
Process
Analyze
23. Prototypical Big Data PlatformClient1Client2Client3
TelemetryFrontEndService
Fast pipeline for high priority Data
Alerting
DB
Alerting
Dashboard
Big Data Map
Reduce Cloud
PIIScrubbingService
DataExtractionService
Insights
DB 1
Insights
DB N
Additional
Reporting
Dashboards
Personally Identifiable
Information (PII)
Management very critical.
Data Driven Quality (DDQ)
and big data pipelines will
need a cloud platform
Superfast pipeline typically
(not always) bypasses cloud.
Also void of PII.
Big Data & ML Model Orchestration
@rkjohnston #DataMagic
24. Prototypical Big Data PlatformClient1Client2Client3
TelemetryFrontEndService
Fast pipeline for high priority Data
Alerting
DB
Alerting
Dashboard
Big Data Map
Reduce Cloud
PIIScrubbingService
DataExtractionService
Insights
DB 1
Insights
DB N
Additional
Reporting
Dashboards
Big Data & ML Model Orchestration
Ingest Process Analyze
@rkjohnston #DataMagic
25. User Segmentation Approaches
• Risk Tolerance Model
• Users Segment themselves
• Opt in for greater risk with a reward in mind
• Profile Based
• Usage behaviors
• new vs. power users
• Browser type
• Connection Type
• Device and Device OS
@rkjohnston #DataMagic
26. Ring 2 External Beta
UsersRing 2: Company
& NDA
Balancing Speed and Risk with Rings
Ring 1: My Team
Ring 4: Everyone
Ring 0: Buddy Build
Red Line demarks disclosure risk
and possible loss of patent rights
Risk Tolerance
is highest
No desire
for risk
@rkjohnston #DataMagic
29. Office 365 Advanced Threat Protection
Big Data Only Solution
Safe Link is powered by
Cloud Exchange & Bing data
AI Model powered by data
from thousands of
companies and attachments
@rkjohnston #DataMagic
33. App Store Data Architecture
App Certification
and Analysis
Pipeline
Store Services Log
and Telemetry
Bing Spam and
Malware
Windows Services Safety
Platform
(MSA, SmartScreen, Etc..)
MMPC/Spynet
Network IPs
File Hashes
PhotoDNA
Strings
API Called
User Install Data
Ratings and
Reviews
Purchases
Geographic Data
Account
Reputation
Bad URLs
Botnet infected
Clients
Cosmos Storage
and Compute
BTW this
was not Big
Data
34. NoName was Learning basic DS
Look at how I did this k-means
clustering and found these weird
outliers in buying circles from Dev
accounts created the same week and
same IP address
Check it out, I found this guys FB
page. We have his picture!
NoName and I were Spitballing Ideas
35. Bad Dev
‘N’
Bad Dev
‘N’
Fraud Network Identification
Bad Dev 1
Payment Instruments
App Similarity
Social Networks
3rd party app stores
Bad Dev 2
XXXDeveloper
Created 40 Different Store Developer Accounts and 100s of Apps
App Metadata
(URL, Websites)
Developer Watering
Holes
Shared Fraudulent Payment Instruments
Bad Dev
‘N’
New Identity
Metadata
Shared Fraudulent Payment Instruments
App Similarity
App Similarity
41. Six week coding milestone
Code churn is
cumulative
Imagine this as part of a
larger multi-layered
project
Layer 1
Layer 2
Layer 3
• Tightly coupled layers
• Long stabilization phase
• Complicated end-to-end integration
Sim-ship increases
risk
Maximum point of
instability is at end of
milestone
Code Churn Example 1
@rkjohnston #DataMagic
42. Code Churn Example 2 (Continuous
Deployment)
Layer 1
Layer 2
Layer 3
• Risk per release decreases because of more
incremental change
• You still must be careful of Risk within
Production but…
• Total risk over time can be less with
incremental change
Rapid release cadence
(weekly or daily)
Max Risk is Production
Layer N
@rkjohnston #DataMagic
45. Measures = Test Cases
• We do Measures
• What is a post release test
case?
• Automation validates the
golden path
• We measure the golden
path
• Measures are the same
as test cases
• Monitor the golden path
@rkjohnston #DataMagic
46. >1.5*IQR = Outlier = Bug (probably)
• What is a Test Case?
• What I expect to happen vs.
What does happen
• A Test Case is Binary
• Measures can observe
success and fail
• Measures have history of
pass fail
• When pass or fail drift from
standard expected rates we
find outliers
• Outliers are often bugs
47. Rings + Speed + Data = Success
• When speed increases the need for telemetry increases
• The rings model provides a buffer
@rkjohnston #DataMagic
51. Generic Service Stack
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
DefaultPath
Production
Traffic
Front door servers for logging
and access management
UX rendering layers
Identity or authentication layers
Persistent data layers
@rkjohnston #DataMagic
52. Runtime Flags Example 1
Side-by-Side Deployments
Service UX Front Door
Service Auth/Identity
Runtime Flags
• Flags direct traffic through the stack
• Used to test vNext before full
release
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
DefaultPath
Runtime
Production
Traffic
Test or Forked
Traffic
Runtime
RuntimeRuntime
Layer B vNext
53. Runtime Flags Example 2
N Test Environments
Service UX Front Door
Service Auth/Identity
Layer A vCurrent
Layer B vCurrent
Service Layer C
(Persistent Data Store)
Production
Traffic
Test
Case
Checkin
Tests
DefaultPath
Runtime
Runtime
Runtime
Runtime
Layer A Test Path
Layer B Test Path
54. Apps as a Service: Facebook
How Facebook secretly redesigned its iPhone app with
your help
…a system for creating alternate versions… within the
native app.
The team could then turn on certain new features for a
subset of its users, directly,
…a system of "different types of Legos... and see the
results on the server in real time."
From article on The Verge by Dieter Bohn September 18, 2013
@rkjohnston #DataMagic
56. Visualization
Machine Learning
Data Scientist Data Engineer
Extract Load
Transform
Data Architecture
Operations and
Monitoring
Big Data Infrastructure
& Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data Exploration
http://www.datasciencecentral.com/profiles/blogs/difference-
between-data-engineers-and-data-scientists
Typical Industry Staffing
57. Blended Role for Agile
Visualization
Machine Learning
Data Scientist/Data Engineer
Extract Load
Transform
Data Architecture
Operations and
Monitoring
Big Data Infrastructure
& Storage
DB Administration
Statistics
Math
Programming
Modeling
Story Telling
Data
Exploration
@rkjohnston #DataMagic
58. LDA vs PCA vs A13
before stratified
sampling
Backlog Doing Validation Done
MLADS ARPD
Rehearsal
Submit Abstract to
Strata + Hadoop World
Edge Experiment 1
Data Processing
Edge Experiment 2
Customer Sat and Post
Sales Monetization
Factors Analysis
Install Base Decay Rate
estimation using
Baysian Model
Friday Review Slides
for Edge Experiment 1
Edge Experiment 1
Insights Analysis
Top Enterprise DSAT
list from textual
analysis
Business Entity Graph
with DUNS, Domain
Name, & TaxIDs
Open Source Entity
Graph visualization
technology research
Submit Paper to
Informs 2016
ARPD V3 Model with
FFF
MLADS ARPD Slides
Draft 1
Device Lifetime Value
(LTV) model 2
Process and Culture impact Retention
• Kanban for Project Management
• Balance long and short term impact
• Participate in Industry papers and reviews
@rkjohnston #DataMagic
59. Trying Again & Again
Advantages and
Disadvantages of
the counting
culture
61. The 5 Vs of Big Data
Nine months ago there were only three Vs
Variety VelocityVolume Verify
Verification – managing data quality and access control at all points
Value
62. Must Count More
Counting More Granular
Make it go up and to the
right
Is vs Likely
Business Impact is a
Given
Drives behavior
(especially if tied to
compensation)
63. Viable
Possible Features
Minimum + Viable
Good features to test the
users responses
Bad user experience. Too
minimal a set or wrong set of
features. Will not engage users
enough to gain valuable
insights
The product you want to
build but to deliver all
features will take too long
Wasted work adding features
that do not add critical value for
winning and retaining customers
Minimum
MVP in a Nutshell
64. Possible Data
Viable
Model should provide
enough coverage that it can
be used for core insights.
Many models try to include all data
and large numbers of attributes but
that slows down innovation
If precision is too low then the
model can’t be trusted for even
first level insights.
Minimum
More features can increase
complexity without
significant improvement in
precision and recall
Minimum Viable Model (MVM)
Possible Features
Minimum + Viable
An Ideal MVM uses a modest
amount of data, implements
a relatively simple initial
algorithm, has good
precision (we aim for 98% or
more) and enough recall to
be used for core insights.
65. Keep your eye on the target
The goal is not
to get a bulls eye
every time
The goal is to
get the data and
Learn
67. Six Keys to a “Big” Magic Show
Try, Try, Try
Again
The Tyrany of
Counting
Magic
Tricks
(A/B Testing,
Runtime Flags)
The Venue
(Big Data
Infrastructure)
Foundation
(Tools for Big
Data)
Security
(Protection,
Privacy, Fraud)
The
Assistant
Recruit, Train,
& Retain
“Big Data” Search Trends
@rkjohnston #DataMagic
68. Big Data: The Magic to
Attain New Heights
Ken Johnston Principal Data Science Manager
Twitter – @rkjohnston
Blog – http://linkedin.com/in/rkjohnston
Email – kenj@Microsoft.com
LinkedIn - http://linkedin.com/in/rkjohnston
@rkjohnston #DataMagic
Editor's Notes
They aren’t afraid to get their hands dirty in the data.
They are uniquely gifted at connecting the dots.
Through data they make original and deep insights.
I have to tell them al the time just how amazing they are.
My son gets this magic kit in a box. Within an hour of playing with it he comes to tell me how he’s going to be a magician and we have to throw a magic show.
I thought I’d use his idea of creating a magic show as a way to talk about the magic of data science.
Presenter guidance:
Share how we think about the data platform in the cloud. Today, we’ll specifically talk about SQL in a VM (briefly), SQL DB, DocumentDB, HBase on HDInsight, and Tables/Blobs. There are lots of other adjacent services such as Redis Cache, Event Hubs, HDInsight, Azure ML, Data Factory, Stream Analytics that will not be addressed in this deck.
Slide talk track:
The top row is Power BI – you’re making decisions based on data
The middle row is ML, Stream Analytics, HDInsight, and Data Factory – processing and making sense of the data
The bottom row is where you ingest and store data -
With Azure, organizations have access to a whole range of services that allow them to use the right tool for the right job when developing applications.
In the cloud, organizations can collect and manage data in the form in which it’s born and store it in the form that best suits an application’s needs.
Clients: Common Library but support multiple OS.
Front End: Telemetry and debug data come through Front End.
PII Scrubbing: Happens at client and again upon ingestion.
Cloud Platform: large scale, many developers, shared structured data. Cloud allows for elastic scaling
APIs and Query Service: Allows access to refined data. Often data is piped to a SQL Server for KPIs and deep analysis
Databases and Reporting Services: Deep analysis is usually done with tools like R Studio and Power Pivot for visualization. Dashboards monitor well known KPIs but are not insights.
Clients: Common Library but support multiple OS.
Front End: Telemetry and debug data come through Front End.
PII Scrubbing: Happens at client and again upon ingestion.
Cloud Platform: large scale, many developers, shared structured data. Cloud allows for elastic scaling
APIs and Query Service: Allows access to refined data. Often data is piped to a SQL Server for KPIs and deep analysis
Databases and Reporting Services: Deep analysis is usually done with tools like R Studio and Power Pivot for visualization. Dashboards monitor well known KPIs but are not insights.
Ring 0: Buddy Build – Build may not have been checked in, pass component to buddy developer
Ring 1: My Team – Should pass Unit and check-in tests
Ring 2: Company and NDA – Pushing to these users is based upon quality gates and telemetry measures. Further progression all telemetry based.
Ring 3: External Beta Users – Release based upon telemetry results. Release is metered by % and device models.
Ring 4: Everyone –
Product is available to for general adoption but may still use metered rollout.
Rings 2-4: leverage rolling deployments (small % at a time) with metrics to stop and roll back
Volume – How much data do you have and how much do you really need
Variety – What Data sources do you have and how can they be combined for more value
Velocity – Speed of data to insight impacts how you use it
Verification – managing data quality and access control at all points
Value – Big Data can be expensive and must produce valuable insights