Tachyon Nexus as the company gets a lot of media attention from its first day of birth.
Different
=====
这是关于Tachyon Nexus公司的媒体方面的宣传
财富杂志
华尔街日报
值得一提的是, 最近被美国媒体评为最值得关注的十家存储方面的创业公司之一。这是唯一一个开源的系统, 也是唯一一个华人是创始人的
那么问题来了, 为什么在这个时间点,在这样一套生态系统中 我们需要Tachyon
我们来看一下两个硬件的趋势
=======
Why do we want something like Tachyon?
第一个趋势:
一方面, 内存的带宽每年都在增长, 指数级的增长. 而另外的一方面, 传统的存储设备是硬盘的带宽的增速却在逐年放缓.
这个就会导致Memory-locality(也就是保持所需要的数据在内存当中) 对于实现low latency的数据访问无比的重要
===
Here is the reason: on one hand, memory throughput increases exponentially each year, on the other hand, harddisk throughput increases, but at a rate much slower than memory.
As a consequence, memory locality is the key to achieve interactive response time for big table queries.
这个趋势被越来越多的人和项目认识到, 所以也孕育了
People have already observed these trends and are creating solutions to exploit memory. Then the question is,
这里仍然有一个缺失的环节, 就是缺少一个专注于memory的Storage Layer, 使得不同的Application可以在其上运行
===
There is still a solution missing, from the storage layer, where a wide range of applications can be built on top of.
为了让大家知道问题在哪儿, 或者说我们要解决的painpoint在哪里, 我来使用Spark作为一个例子.
Spark是一个in memory的数据处理框架。
数据会保持一个copy在JVM当中,
因为只有一个copy, Spark使用lineage来保证在memory里的数据因为crash或者其他原因 有办法把它恢复出来
具体的方法如下面所示
====
As an example, lets take a look at spark.
然而这并没有解决全部的问题。 比如在许多公司都会有非常复杂的pipeline, 比如一个job的输出是另一个job的输入
====
Many companies have very complicated data processing pipeline, one job output is the input of another job
然而这并没有解决全部的问题。 比如在许多公司都会有非常复杂的pipeline, 比如一个job的输出是另一个job的输入
====
Many companies have very complicated data processing pipeline, one job output is the input of another job
一旦计算层面出现了crash, 可能由于程序的bug或者server的硬件问题,那么这个JVM里的数据就会跟着一起丢失.
=====
When the computation task crashes for some reason,
The in-memory storage also dies, since they both are in the same process.
Two tasks want to read the same data, both must load the data in-memory, which leads to data duplication. And since these are in the JVM, this could lead to GC issues.
那么回过头来想一下, 问什么会出现这些问题呢?
这是要在memory里实现这种的可靠的, across job, 甚至across famework的数据共享, 你需要的是从存储层面上来彻底解决这个问题, 而不是在计算层。 这就回答了我们最开始的问题, 为什么在这个生态系统里需要Tachyon。
=======
Tachyon I s designed to address these issues. With Tachyon system, you can achieve ….
Master: keeps track of metadata and the status of all the workers
Single Tachyon Worker Node, manages local space, and stores data in the ram disk.
Standby masters for fast failover.
…
So, with these features, tachyon can achieve high throughput
然而这并没有解决全部的问题。 比如在许多公司都会有非常复杂的pipeline, 比如一个job的输出是另一个job的输入
====
Many companies have very complicated data processing pipeline, one job output is the input of another job
Thanks to the many members of the Tachyon community, many of which are in this room!
正因为Tachyon是一个开源项目,它吸引了很多不同的公司和机构来使用。
Since tachyon is open source, it is easy for people to start to use tachyon.
…
It is interesting to see the different use cases of tachyon
Also, tachyon supports a variety of under filesystems, from different industries. Here are some of the supported under file systems.
They span different use cases like big data, cloud, hpc, and enterprise, and it is exciting to see tachyon work with those systems.
One of the primary concerns about Tachyon is that it only manages memory, which is still relatively scarce in many production environments. 0.6 introduces tiered storage which allows uses to allocate additional storage devices to Tachyon, instead of just memory. Currently memory, ssd, and hdd are supported storage types. When using tiered storage, the storage types will be transparent to the user, and Tachyon will simply spill from memory to the next available storage tier instead of directly to the underfilesystem. Tiered storage is configuration driven meaning there is no need to recompile or change the code.
其他的还有
Data Serving: Different ways to serve the data to applications
New hardware: like rdma