<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Epsio Blog]]></title><description><![CDATA[Epsio Blog]]></description><link>https://blog.epsiolabs.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1753198818764/b80f3a7b-a911-4e57-a583-5fd5982ba00d.png</url><title>Epsio Blog</title><link>https://blog.epsiolabs.com</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 17 Apr 2026 06:29:41 GMT</lastBuildDate><atom:link href="https://blog.epsiolabs.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Releasing ioflame: an eBPF based I/O tracer and visualizer]]></title><description><![CDATA[We’re happy to release ioflame, an I/O tracing tool that we developed internally to debug high I/O latency and wait time. ioflame attaches to the Linux kernel block level with eBPF probes, tracing all I/O activity. The resulting output of the trace i...]]></description><link>https://blog.epsiolabs.com/releasing-ioflame</link><guid isPermaLink="true">https://blog.epsiolabs.com/releasing-ioflame</guid><category><![CDATA[eBPF]]></category><category><![CDATA[Rust]]></category><category><![CDATA[profiling]]></category><dc:creator><![CDATA[Kobi Grossman]]></dc:creator><pubDate>Thu, 20 Nov 2025 07:51:30 GMT</pubDate><content:encoded><![CDATA[<p>We’re happy to release <a target="_blank" href="https://github.com/Epsio-Labs/ioflame">ioflame</a>, an I/O tracing tool that we developed internally to debug high I/O latency and wait time. ioflame attaches to the Linux kernel block level with eBPF probes, tracing all I/O activity. The resulting output of the trace is a flamegraph organized by filesystem hierarchy, showing the exact amount of bytes read or written by directory.<br />Here’s an example output from the repository, showing reads of some files not cached in the page cache:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762785499411/dda1267d-6a2e-4aa8-b893-033483dbbe43.png" alt class="image--center mx-auto" /></p>
<p>In this post, we’ll explain why we needed to build this tool for Epsio, and do a deep dive into the Linux filesystem and I/O stack to understand why that’s a non trivial problem.</p>
<h2 id="heading-profiling-is-easy">Profiling is.. easy?</h2>
<p>We’re building a <a target="_blank" href="https://www.epsio.io/">Streaming SQL Engine</a>, which is a SQL Engine that works on live changes to input data instead of having all the data up front in query time. Like any database, we care about performance. Thus we spend time profiling benchmark queries, trying to find our current bottlenecks. But what even is a bottleneck? What makes a database fast or slow?</p>
<p>There are 3 main resources that every software uses: CPU, Memory, and disk. In broad strokes, performance problems stem from inefficiently using one of these resources. Thus to improve performance, we need two things: A good <em>understanding</em> of how each of these resources work, and a good <em>visibility</em> on how we’re using them in practice. When we have an intuition on how something should optimally work, and knowledge on how far are we from that optimum in practice, we can start thinking on improvements we can make to our code. To look at a concrete case, let’s look at the CPU.</p>
<p>Understanding the CPU is understanding multicore parallelism, what are the standard algorithms for computing what we need to compute with a good <a target="_blank" href="https://en.wikipedia.org/wiki/Time_complexity">time complexity</a>, and a myriad of important low level details which are beyond the focus of this post.<br />From the visibility perspective, we have great tools to see what’s going on in real time. At a high level, we can get basic utilization percentages for each core. We can then dig deeper with flamegraphs from data recorded by a profiler, as well as per CPU performance counters, to get high resolution data on what our CPUs are doing.</p>
<p>For example, we can look at this zoomed in part of a flamegraph of one of our runs:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763029434973/c6c81ba4-e9d7-4689-8a53-14726e1b07c1.png" alt class="image--center mx-auto" /></p>
<p>There’s some noise here, but we can see that we spend most of our time in this section on writing to <a target="_blank" href="https://rocksdb.org/">RocksDB</a> SSTs, with serialization and allocation also taking a big part of the runtime. This is very useful information! We can start asking questions like: Are we writing more data than we should? Can we avoid serialization? Can we reduce our allocations done in the hot path? etc. So when it comes to the CPU, we have good visibility that enables us to improve our performance.</p>
<p>A database has another crucial resource: Disk. We can understand it well, and know that i.e it is a block based device, that random writes on SSDs are not a great idea and so on. But what about visibility?</p>
<p>We can get a high level view and see that we’re blocked on I/O via CPU utilization data, and specifically wait%, which is the % of time in which the CPU sits idle and waits for I/O to complete. For example, consider the following graph:</p>
<p><img src="https://cdn-images-1.medium.com/max/1920/1*mEVOBc29LIXQqzKZgs_oiQ.png" alt /></p>
<p>In it we can see a ~60% wait. This means that most of the time, instead of doing something useful, the CPUs are waiting for I/O. We can also look at the disk queue, which is where requests are waiting to be sent to the drive, and see that it’s packed:</p>
<p><img src="https://cdn-images-1.medium.com/max/1920/1*ayljV0jtAstTk5j2WV1BkA.png" alt /></p>
<p>But knowing we wait on I/O is only half the way there. We want to see <em>why</em> we’re doing that much I/O. In the CPU case, a high utilization is a good global indicator, but the flamegraph helped us really understand what’s going on. How can we do that when profiling I/O?<br />That’s where things get complicated, and that’s the issue we set out to solve with ioflame.</p>
<h4 id="heading-io-tracing-a-surprisingly-hard-problem">I/O tracing: a surprisingly hard problem</h4>
<p>Ideally, we’d like to a level of observability similar to what <code>perf</code> gives us on the CPU, but instead of hot functions we’d like to find hot directories or files. We want to know which files are read or written the most, how many bytes are read or written from each directory in total, etc. so we could discover hotspots. We first tried known tools, but all of them didn’t fit to our specific use case for one of these reasons:</p>
<ul>
<li><p><strong>Too high level:</strong> <code>iostat</code> is incredibly helpful to get the high level picture, for example whether the problem is many reads or many writes. But what are we doing that’s causing the I/O? That’s hard to tell from <code>iostat</code> alone. Other tools like <a target="_blank" href="https://manpages.ubuntu.com/manpages/noble/man8/biosnoop-bpfcc.8.html">biosnoop</a> provided more information, like PID and latency, but no information about the file level.</p>
</li>
<li><p><strong>Can’t distinguish page cache operations from actual I/O:</strong> One can imagine using <code>strace</code> with some wrapper script to trace all the <code>read</code> and <code>write</code> syscalls a program does. This would work in theory, but won’t tell us much about actual I/O done!</p>
<p>  That’s because we won’t know if a read was served from the <a target="_blank" href="https://en.wikipedia.org/wiki/Page_cache">Page cache</a> or the disk, or which of the writes was flushed to the block device.<br />  Since we’d naturally see high wait% when the page cache is close to full, the thrashing in this case would aggravate the problem: A small % of <code>read</code>s can access files whose pages aren’t cached at read time, but cause the most I/O and thus the high wait%.<br />  This will be hard to tell if we track both cached and uncached reads together.</p>
<p>  This behavior exists in eBPF based tools like <a target="_blank" href="https://manpages.debian.org/unstable/bpfcc-tools/filetop-bpfcc.8.en.html">filetop</a>. Similar to <code>iostat</code>, <code>filetop</code> and similar tools were great for an initial overview of where we might have problems, but wasn’t precise enough for our use case.</p>
</li>
<li><p><strong>Not tracking all types of I/O:</strong> File I/O is not necessarily the result of <code>read</code>, <code>write</code> or <code>fsync</code> on a file descriptor. For example, we use <code>mmap</code> buffers as a simple disk backed channel between operators, to avoid OOMs when large amounts of data are involved. While <a target="_blank" href="https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf">we’re sure</a> we want to phase them out for this use case in the future, when we’re debugging high wait% currently we‘d like to know if mmap buffers are the cause.</p>
<p>  The problem is that mmap buffers are almost invisible to observability tools, most of which operate at the filesystem or syscall level. That’s because a read from mmaped memory is a simple pointer dereference. If the read page is not mapped, a page fault will occur, and the kernel page fault handler will issue a read from the filesystem to get the page from disk. All of this happens deep in the kernel, with no direct userspace action we can reasonably hook.</p>
</li>
</ul>
<p>I should stress that we used some of these tools together to understand what’s going on in high wait% cases, and each tool worked great. But combining them didn’t work for us at scale. In complicated benchmark queries we can have tens of RocksDB directories, each with hundreds of SST files, as well as many mmap buffers. All of these being read, written and deleted at a rapid pace. For these cases the combination of tools and heuristics wasn’t enough.</p>
<p>We wanted a single tool that has low overhead, catches any type of I/O, attributes it to a path, and displays the result in an easy to understand visualization. In short, the experience we’ve gotten used to from CPU profilers.<br />We experimented with ad-hoc <a target="_blank" href="https://bpftrace.org/">bpftrace</a> scripts that used <code>ext4</code> probes, but we wanted a proper tool that worked on any filesystem and environment. So we wrote <a target="_blank" href="https://github.com/Epsio-Labs/ioflame">ioflame</a>.<br />But how does it work? Where do we want to put a probe, and what data should we record in it?</p>
<p>To answer that we’ll need to understand the Linux I/O stack a bit better, which we’ll do by tracing the path a read takes from userspace to the block layer. We won’t cover the write path, but the concepts are similar.</p>
<h4 id="heading-life-of-a-read">Life of a <code>read</code></h4>
<p>Let’s say that userspace issues a <code>read</code> syscall from some file. The read goes then through several layers in the Linux kernel, until it reaches the device driver, descending in the abstraction level from files to disk blocks. The layers are the <em>syscall</em>, <em>VFS</em>, <em>page cache</em>, <em>filesystem</em> code and finally the <em>block layer</em>.<br />Here’s a diagram of the path through the different layers, with the relevant kernel functions in each layer. We’ll explain each layer in detail next:</p>
<p><img src="https://cdn-images-1.medium.com/max/1920/1*XjrscwgHTu_V0-tbf_HauQ.png" alt /></p>
<p>The diagram was created by printing the kernel stack at <code>submit_bio</code> with <code>bpftrace</code>, at the read time.</p>
<p>Before we get into each layer, let’s understand what we need of our hook.<br />We need it to be at a layer low enough so that it captures all I/O and doesn’t run on page cache hits, but we also need it to record high level information about which file was accessed. We’d also like it to be as generic as possible, not relying on specific filesystems or disk types.</p>
<p>Given these requirements, let’s look at the different layers and see what they give us. Note that the same layer can be called in different levels of the stack. For instance ext4 is called both directly by the generic VFS logic, and then in the case of a cache miss by the page cache. So the different layers are:</p>
<ul>
<li><p><strong>Syscall:</strong> The first path running the kernel as a result of our <code>read</code>.<br />  This isn’t a good place to hook: there are many different syscalls that eventually result in disk I/O, and at this stage we won’t know if the read is from cached data or not.</p>
</li>
<li><p><strong>VFS:</strong> The next layer is <a target="_blank" href="https://docs.kernel.org/filesystems/vfs.html">VFS</a>. This layer sits between the syscall layer and the actual filesystem code. It provides a common abstraction that all filesystems use, so that the syscall layer doesn’t need to know the filesystem specifics. It also provides common helpers that all filesystems can use. For example, <code>generic_file_read_iter</code> above connecting between the filesystem and the page cache logic.</p>
<p>  While this layer is an awesome abstraction in general, for our purposes it’s a bit too high in the call stack. In <code>vfs_read</code> or in the VFS helpers we still don’t know whether the pages are read from the page cache or not.</p>
</li>
<li><p><strong>ext4:</strong> The filesystem specific logic. This could be a great place to hook, for instance hooking <code>ext4_readahead</code> right which is called right before issuing the block layer request sounds exactly like what we need. That’s because at this point in time we know that we had a cache miss, since the page cache code called the read logic.</p>
<p>  One problem with hooking this layer is that we would need to have a <strong>per filesystem hook</strong>. Since we want to be as generic as possible, that’s a no go for us. Ideally, we’d like different filesystems to work out of the box with our hook, without adapting per filesystem.</p>
<p>  Another problem is the fact that not all filesystem calls correspond to actual I/O. One example is <a target="_blank" href="https://man7.org/linux/man-pages/man2/truncate.2.html">truncate</a>d files, or the more general concept of “holes” in files. These are parts of the file that are marked by the filesystem to be filled with zeros. There aren’t actual zero bytes stored on the block device itself though, since a metadata marker indicating a hole is enough. Thus a read of such a hole, even though it does go through the filesystem logic, won’t actually trigger any I/O (except for metadata, which is tiny and is probably cached anyway).</p>
<p>  So the filesystem specific code isn’t ideal for a hook. What else do we got?</p>
</li>
<li><p><strong>Page cache:</strong> Part of the complex memory management subsystem. Called by VFS helpers to try and get the read page from the page cache. In case of a miss, it initiates a read from the filesystem. This layer is <strong>close to ideal</strong>, and a <code>read_pages</code> hook was used in our initial implementation.<br />  The only real problem with it is the hole problem mentioned above. At this point, we only know that a page isn’t resident in the cache and needs to be read from disk. But what if this page is part of a file that was just <code>truncate</code>d? We’ll mark the read as actual I/O done, when it isn’t.</p>
<p>  This isn’t that big of a deal, and this approach would work, but this is an issue we would encounter in our usage of mmap buffers, for example. We decided to dig deeper and try to find a hook which doesn’t have this problem. This brings us to the..</p>
</li>
<li><p><strong>Block layer:</strong> The last level before actual device drivers, sending requests from the filesystem to read individual blocks. This sounds like the perfect place to hook. We can put a probe on <code>submit_bio</code>, which by definition sends I/O requests to the block device, so we know we are actually reading from disk. There is one problem though: since we are in a very low level in the stack, it’s hard to <strong>relate I/O to actual files</strong> it’s done on, which was our point all along!  </p>
<p>  In upper layers of the stack, for instance in the page cache miss code, we have the <code>file</code> struct easily available to us. By design, these layers work on files, not blocks, so it’s not surprising that it’s easy to relate a page miss to a file. In the block layer however, all types of I/O are mixed — general filesystem metadata blocks, <a target="_blank" href="https://en.wikipedia.org/wiki/Journaling_file_system">journal</a> records, and actual data blocks which we care about. So it makes sense that a relation to a file, if there is one, would be harder to obtain.</p>
</li>
</ul>
<p>Luckily, extracting file information in the block layer is hard but not impossible, as we’ll see later. But first, what data should we even use to relate the operation to a file? Is there any file information this close to the device driver?</p>
<h4 id="heading-associating-bio-requests-with-files">Associating bio requests with files</h4>
<p>In Linux, each file is associated with a unique <a target="_blank" href="https://en.wikipedia.org/wiki/Inode">inode</a>. The file inode contains the metadata about the file, like its size and permissions. But most importantly it allows us to <em>uniquely</em> identify a file as a tuple of 3 fields of the inode: The inode number <code>i_ino</code>, the block device in which the inode resides <code>s_dev</code>, and the inode generation <code>i_generation</code>.</p>
<p>This means that if we can a reference to an <code>inode</code> struct in <code>submit_bio</code> somehow, we’d be almost done! We’ll only need to extract the amount of bytes read or written and whether it’s a read or a write. So what’s the signature of <code>submit_bio</code>?</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">submit_bio</span><span class="hljs-params">(struct bio *bio)</span></span>
</code></pre>
<p>Hmm, not very telling. Looking at <code>struct bio</code> members, unsurprisingly there is no <code>inode</code> field we could read, which makes sense due to the block layer level of abstraction as we discussed above. There was also no <code>file</code> struct member or anything like that allows us to get the inode..</p>
<p>But then we remembered something. After all, we’re reading from a file into a <strong>page</strong>. Thus there must be some <code>page</code> struct somewhere in the bio that maybe could have the backing inode. And indeed, we found this in the <a target="_blank" href="https://elixir.bootlin.com/linux/v6.17/source/include/linux/blk_types.h">kernel code</a>:</p>
<pre><code class="lang-c"><span class="hljs-comment">/*
 * main unit of I/O for the block layer and lower layers (ie drivers and
 * stacking drivers)
 */</span>
<span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">bio</span> {</span>
  ...
  <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">bio_vec</span>  *<span class="hljs-title">bi_io_vec</span>;</span> <span class="hljs-comment">/* the actual vec list
};
...
/**
 * struct bio_vec - a contiguous range of physical memory addresses
 * @bv_page:   First page associated with the address range.
 * @bv_len:    Number of bytes in the address range.
 * @bv_offset: Start of the address range relative to the start of @bv_page.
 * ...
 */</span>
<span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">bio_vec</span> {</span>
 <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">page</span> *<span class="hljs-title">bv_page</span>;</span>
 <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> bv_len;
 <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> bv_offset;
};
</code></pre>
<p>So we can easily find the page accessed in the bio, great! Getting from the page to the inode is easy: <code>page</code> has a member of type <code>struct address_space *</code>, which has a pointer to the inode.</p>
<p>Tracking Direct I/O is a bit more complex to track as the read page is a userspace buffer page and not a file backed page, but we resolved that by hooking <a target="_blank" href="https://elixir.bootlin.com/linux/v6.17/source/fs/iomap/direct-io.c#L68">another</a> bio submission function which has the file inode in Direct I/O.</p>
<p>To distinguish file backed I/O requests from general metadata I/O that we don’t want to track, we can use the same logic the kernel <a target="_blank" href="https://elixir.bootlin.com/linux/v6.17/source/include/linux/page-flags.h#L733">uses</a> to detect if a page is <em>anonymous</em>, i.e not backed by a file: if the lower bit of the <code>address_space</code> isn’t set, we know that the page is backed a file and we can proceed.</p>
<p>Finally, we write the read or written bytes to an eBPF <a target="_blank" href="https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_PERCPU_HASH/">percpu map</a>, keyed by the inode. After reading a 1GB file, the map would look like this:</p>
<p><img src="https://cdn-images-1.medium.com/max/1920/1*2FfjWO6Qt87m2ijdkHYysg.png" alt /></p>
<p>This is great, but something is missing. What we eventually want is paths, not inode numbers. How can we get that?</p>
<h4 id="heading-to-userspace-and-beyond">To userspace and beyond</h4>
<p>An <code>inode</code> struct is a unique handle to a file with metadata and not much more. It doesn’t have the file path, since paths are at a higher level of abstraction. Many paths even point to the same inode via hardlinks. Thus finding the path from the inode just doesn’t make sense. This means we’ll need to hook some location right after opening a file with a path, which will have both the inode and path, then store it in a mapping like so:</p>
<p><img src="https://cdn-images-1.medium.com/max/1920/1*Pl68-2l04YknsfUI3yk1FA.png" alt /></p>
<p>In general, resolving the path of a file in eBPF <a target="_blank" href="https://stackoverflow.com/questions/79631612/how-can-i-safely-build-the-full-path-of-a-struct-dentry-in-an-ebpf-lsm-hook">isn’t trivial</a> due to the eBPF verifier (resolving a path is a loop which can’t be proven to terminate), so we opted for the simpler solution of resolving it in userspace with procfs: In kernel space, we’ll hook a file open function, which will give us both the inode opened, the PID who opened it and the fd it got. Then in userspace we’ll read the path from <code>/proc/&lt;pid&gt;/fd</code>.</p>
<p>We decided to hook <a target="_blank" href="https://elixir.bootlin.com/linux/v6.17.7/source/fs/file.c#L644">fd_install</a>, which gets called whenever a new fd is opened by any process. Here we have all we need: a struct <code>file</code> containing the inode, with the pid and fd also available. We send this to userspace via an eBPF <a target="_blank" href="https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_RINGBUF/">ringbuffer</a>, which then resolves the path as described above, and stores it in the mapping.</p>
<p>In userspace we link the information from both maps, use <a target="_blank" href="https://docs.rs/inferno/latest/inferno/">inferno</a> to produce a flamegraph per directory, and we’re done!</p>
<h4 id="heading-wrapping-up">Wrapping up</h4>
<p>The initial version of ioflame is on <a target="_blank" href="https://github.com/Epsio-Labs/ioflame">Github</a>. With this basic architecture in place, we can imagine other cool features we’d like to add:</p>
<ul>
<li><p>Tracking over time, allowing comparing I/O patterns over time with something like a <a target="_blank" href="https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.htm">differential flamegraph</a></p>
</li>
<li><p>Adding a richer UI to enable better filtering</p>
</li>
<li><p>Adding probes to track page cache inserts and evicts, to produce a visualization of page cache residency over time</p>
</li>
</ul>
<p>But all of that is in the future. For now we finally have the I/O visualizer we needed.<br />btw, did we mention we’re building a cool <a target="_blank" href="https://www.epsio.io/">Streaming SQL Engine</a>?</p>
]]></content:encoded></item><item><title><![CDATA[I love UUID, I hate UUID]]></title><description><![CDATA[At Epsio we’re building a streaming SQL engine which natively integrates with databases/warehouses (both as “source”s and “sink”s). Epsio acts as a stream processing “middle man“- receiving data from one (or more) sources, applying transformations (a...]]></description><link>https://blog.epsiolabs.com/i-love-uuid-i-hate-uuid</link><guid isPermaLink="true">https://blog.epsiolabs.com/i-love-uuid-i-hate-uuid</guid><dc:creator><![CDATA[Maor Kern]]></dc:creator><pubDate>Tue, 09 Sep 2025 12:36:37 GMT</pubDate><content:encoded><![CDATA[<p>At Epsio we’re building a streaming SQL engine which natively integrates with databases/warehouses (both as “source”s and “sink”s). Epsio acts as a stream processing “middle man“- receiving data from one (or more) sources, applying transformations (according to the users’ queries), and sinking results into multiple tables in the users’ sink database. As part of this, we make it a priority to optimize database interactions so they run with maximum efficiency and impose minimal overhead on our users’ databases.</p>
<p>To sink data into the users databases, Epsio needs to be able to insert and delete rows efficiently. Inserting rows is fairly easy- with most databases we simply do <code>INSERT INTO</code> (or perhaps <code>COPY ..</code> for large amounts of data). Deletions, however, are more complicated. We could perhaps delete by all the fields <code>DELETE FROM pasten WHERE user=nadav AND age=21 AND …</code> - but a few issues would arise:</p>
<ul>
<li><p>It’s crazy inefficient- the users’ database would need to do a lot of work to find that row, given you don’t want a massive index on all the tables fields</p>
</li>
<li><p>What if there are multiple rows with the same fields, but Epsio only wants to delete one? Some databases give this option, but Postgres for example does not.</p>
</li>
</ul>
<p>Enter primary keys.</p>
<p>Primary keys are a cornerstone for any relational table. They give the ability to have a handle to a <em>specific</em> row, sort of like a pointer. The primary key is usually indexed, which means that given a primary key, the database can quickly jump to it’s corresponding row very quickly. These keys are often used to perform quick UPDATES/DELETES, and also as foreign keys for other tables to point to them.</p>
<p>There are multiple data types used for primary keys. The main two types are:</p>
<ol>
<li><p>Auto incrementing integers - every new row inserted receives the “next“ number, starting at 0</p>
</li>
<li><p>UUIDs (128 bits) - every row receives a randomly generated UUID.</p>
</li>
</ol>
<p>Auto incrementing keys must be generated by the <strong>server,</strong> as given two clients inserting at the same time the server must be the one to coordinate which one will receive which number.</p>
<p>UUIDs, however, can be generated by the <strong>client</strong>. This is because the probability of a UUID collision is astronomically low- so low in fact that most systems rely on them to be absolutely unique.</p>
<p>But why does it even matter if the client is the one creating the identifier? To answer this, take for example an application that allows users to make posts, comment and edit them. If the client is the one creating an identifier, the application could be “optimistic“ about the server successfully receiving the post, and allow the user to edit/comment on the post even before the server has responded with it’s OK. This is because the client <strong>already</strong> has the identifier of the entity it’s creating, even before the server responded.</p>
<p>At Epsio, we also leverage UUIDs for primary keys in the users’ databases. This is mainly because there are situations such as <code>COPY …</code> in Postgres where you can’t receive back auto incremented keys from rows inserted. UUIDs allow us to generate identifiers before the insert, and later use those identifiers to delete from the database.</p>
<p>So up till now, UUIDs may sound like a gift from the Statistical Lords of the Universe, but there are some serious caveats.</p>
<h2 id="heading-the-problem-with-indexing-uuids-and-specifically-uuidv4">The Problem with Indexing UUIDs (and specifically, UUIDv4)</h2>
<p>The reason primary keys are so efficient for lookups is because they are indexed. This means the database is holding some type of data structure, like a B-Tree, with all the UUIDs (with the data perhaps being the physical location of the row). B-Trees work very well when the writes are as condensed as possible; as the writes become more scattered, the B-Tree hits more leafs, and thus more pages need to be brought into cache and more splits occur. Sequential numbers added into a B-Tree is the <strong>best</strong> situation for a B-Tree- you’re always hitting the right most leaf, filling it up, then splitting, etc. The default UUID type (UUIDv4) that is used by default in most DBs / clients though, is the opposite- it’s the <strong>worst</strong> for a B-Tree. The very things that makes UUIDs so magical- their global uniqueness- also means they’re completely random numbers across a 122bit spectrum. To put it in perspective, it’s range is wider than there are atoms in the universe!</p>
<p>This matters because inserts will suffer a serious hit, while lookups may suffer slightly. This means our throughput for insertions will go down, and the database will need to work harder. Not great.</p>
<h2 id="heading-enter-uuidv7">Enter UUIDv7</h2>
<p>UUIDv4 is fairly simple- as stated above, it’s simply a random number in a 122bit spectrum. UUIDv7 however, is a bit more interesting- it’s first 48 bits are used to contain the current timestamp (IE the timestamp of when the UUID was generated), a few more bits for version and variant, and 74 bits are used to store a random number. It’s exact layout is as follows</p>
<pre><code class="lang-plaintext">+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            unix_ts_ms (48)            | version (4) |   rand_a (12)   | variant (2) |                   rand_b (62)                                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
</code></pre>
<p>This means that if we generate multiple UUIDv7s in consecutive times (say one millisecond after the other), they’ll all sort one after the other! This solves the index issue, and actually <strong>adds</strong> a very big benefit: It’s possible to now know the insert time of the row <strong>from within the primary key</strong>. This can help debugging, and in internal “chaos“ simulators we’ve used this to understand and figure out issues in Epsio.</p>
<p>To give a taste of the performance increase, I made a simple benchmark with my friend Claude to insert 10 million rows with UUIDv4 and then 10 million with UUIDv7. Claude of course proceeded to write code with an abundance of emojis, as one does.</p>
<p><strong>Results:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>UUIDv4</strong></td><td><strong>UUIDv7</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Index Size</td><td>389MB</td><td>301MB</td></tr>
<tr>
<td>Throughput</td><td>183K/s</td><td>266k/s</td></tr>
<tr>
<td>Total Time</td><td>54.62s</td><td>37.53s</td></tr>
</tbody>
</table>
</div><p>Which shows the improvement fairly well- the total index size is 22% smaller with UUIDv7, and in total was 31% faster.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756828234703/49bfe2db-7713-4fab-9f3b-652f4af943ac.png" alt class="image--center mx-auto" /></p>
<p>That’s why in Epsio… we use UUIDv7s for primary keys!</p>
<p>But still, there are some tradeoffs that I do want to note here:</p>
<ol>
<li><p>UUIDv7 containing the timestamp is not always a good thing. In Epsio, the UUID identifying a row is never exposed to an untrusted user. But if your IDs are being exposed (for example, via REST endpoints) you may be giving a potential attacker information you’d rather them not have (namely, when the row was created).</p>
</li>
<li><p>Less “random”- I debated whether to even write this, since the chances are still astronomically low of a collision. But for all UUIDs generated in the same millisecond, your “random” bits are reduced from 122 to 74. To give a taste for this- if you have 1 million UUIDs generated the same <strong>millisecond</strong>, the chances of a collision are .0000000003%.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Why is stream processing hard?]]></title><description><![CDATA[Before starting Epsio a couple of years ago, I used to spend countless hours learning about creative and innovative ways companies were making data access and querying faster and more efficient. Looking at rising products in that space — whether it’s...]]></description><link>https://blog.epsiolabs.com/why-is-stream-processing-hard</link><guid isPermaLink="true">https://blog.epsiolabs.com/why-is-stream-processing-hard</guid><dc:creator><![CDATA[Gilad Kleinman]]></dc:creator><pubDate>Sun, 10 Aug 2025 15:35:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754840116914/a9aeed5a-4e0a-4024-9a10-0f5fe38b9dcb.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Before starting Epsio a couple of years ago, I used to spend countless hours learning about creative and innovative ways companies were making data access and querying faster and more efficient. Looking at rising products in that space — whether it’s ClickHouse, DuckDB, or others — I was consistently amazed by how much magic and innovation these tools packed into their products while still abstracting away the complexity &amp; while keeping the interface extremely simple.</p>
<p>The streaming space however, told a completely different story. While technical innovation was absolutely booming in this area, it somehow felt like the technology was becoming increasingly more complex and costly to use as the underlying theory &amp; tech evolved.</p>
<blockquote>
<p><em>Streaming is a high-maintenance, complex bitch that offers real-time data and scalability, but it’s costly and a pain to manage; batch processing is your reliable, easy-to-handle workhorse, less resource-intensive but slower and less sexy — choose based on whether you want a race car or a minivan.</em></p>
</blockquote>
<p><em>(Ill-Valuable6211 in Reddit, “</em><a target="_blank" href="https://www.reddit.com/r/ExperiencedDevs/comments/183f70f/comment/kaodppu/?utm_source=share&amp;utm_medium=web3x&amp;utm_name=web3xcss&amp;utm_term=1&amp;utm_content=share_button"><em>Tradeoffs between batch processing and streaming</em></a><em>?”)</em></p>
<h4 id="heading-a-death-by-a-thousand-components-amp-configurations"><strong>A Death By A Thousand Components &amp; Configurations</strong></h4>
<p>While in a batch processing world, I could spin up a database and pretty quickly get started, in a stream processing architecture, for me to be able to build a simple feature that counted rows in a PostgreSQL database, I “only” needed:<br />- Debezium for CDC<br />- Avro registry for schema of events<br />- Kafka for change queue<br />- Flink to run the aggregation<br />- JDBC Sink to write back the results to PostgreSQL</p>
<p>And, of course — digging deep into many (many) Flink / Kafka / Debezium configurations to make all components work together and learn many new concepts like watermarks, checkpoints, etc…</p>
<h3 id="heading-why-is-streaming-inherently-hard"><strong>Why Is Streaming Inherently Hard?</strong></h3>
<p>Fundamentally, I do believe stream processing is inherently harder and more complex than batch processing. This is for 2 main reasons:</p>
<h4 id="heading-1-stream-processors-are-highly-coupled-to-other-data-products"><strong>1 — Stream processors are highly coupled to other data products</strong></h4>
<p>By definition, a stream processor is meant to input data, transform it, and <strong>push</strong> it to downstream data products (DBs/ data stores)—  not to be the final “endpoint” for other systems (backend / BI / end-users) to <strong>pull</strong> data from it.</p>
<p>Unlike a database (/batch processor), that can usually be a full “end-to-end” data solution — ingesting, transforming and serving data to the backend — a stream processor can almost never be implemented without an additional data product (database / batch processor) that enables users to access, index, and search over its output. A stream processor without a database after it is like email without email inboxes — the data flows, but there’s no easy way to view or interact with anything after it’s sent.</p>
<p>This means that unlike the batch processor (that doesn’t need any other component to serve a full end-to-end data product)— the stream processor depends on and must couple and integrate itself with numerous other data product for users to be able to benefit from it’s value.</p>
<p>If your backend reads/ writes to a specific PostgreSQL instance (in a specific PostgreSQL version), even the best stream processor in the world probably won’t be helpful unless it can properly ingest from and sink data to your specific PostgreSQL instance/ version.</p>
<p>Similarly — if your database has a complex partitioned schema that involves many ENUMs and custom columns — no stream processor int the world would be usable unless it can properly handle the scenario of a new partition being added in your database, or a way to process and transform all of your custom-typed columns and ENUMs.</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/685069a0d0cb58afd361c879_0*UW8NrhnbFtI68HPk.png" alt /></p>
<p>(From Jay Kreps' original vision for Kafka — <a target="_blank" href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying</a>)</p>
<p>As a live testimony for this — some of the hardest bugs we ever encountered while building Epsio, were actually bugs in other data products we integrated with. Since we needed to rely so heavily on the specific format and way each data store does things — the smallest of glitches in a source / destination database could easily translate into a glitch in Epsio.</p>
<h4 id="heading-2-stream-processors-are-batch-processors-plus-more"><strong>2 — Stream processors are batch processors, plus more</strong></h4>
<p>Fundamentally, batch processing is just a “subset” of stream processing. A batch processor only works on the CURRENT data you have. A stream must both process the CURRENT data you have, and also “prepare” (/build structures) for any FUTURE data that arrives.</p>
<p>A great example of this difference is how JOIN operations are handled (specifically, hash joins). In batch processing, since the system knows it will never receive additional data, it can optimize the JOIN by building a hash table only for the “smaller” side of the join — say, the left side — and then iterate over the current rows in the larger side (the right) in an un-orderly fashion, constantly looking up the corresponding rows in the other side’s hash table.</p>
<p>A stream processor in contrast, needs to build not only a hashtable for the small side, but ALSO a hashtable for the larger side of the JOIN — to make sure it can quickly lookup corresponding rows if any change in the future is made to the smaller side of the JOIN. It needs to build all the structures a batch processor needs for the same operation, plus more.</p>
<p>(Some stream processing systems — or similar systems like TimescaleDB’s continuous aggregates — mitigate this overhead by limiting support for processing updates to only one side of the join).</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/685069b8bb499962f264e3f6_0*F8SK6fm6SbeuOpX8.png" alt /></p>
<p>Even after building these additional structures, stream processors still must “take care” of additional edge cases batch processors don’t need to. Stream processors for example, many times must guarantee that the order they received data correlates with the order they output data. While a batch processor receives a single batch of data and outputs a single batch — a stream processor that constantly receives a stream of changes, needs to make sure the output stream of changes correlates to the order of input stream of changes (image how hard that could be when trying to process that stream in parallel!). A row that got INSERTed and then DELETEd must never be DELETEd in downstream systems before it got INSERTed.</p>
<p>As evidence of a stream processor being “a batch processor, plus more”, you can frequently find stream processing systems that also have support for batch queries (e.g., Flink, Materialize, and Epsio soon ;)), while no batch processor supports streaming queries. To build a batch processor — simply take a stream processor and remove some components!</p>
<h3 id="heading-how-were-these-difficulties-overcome-until-today"><strong>How Were These Difficulties Overcome Until Today?</strong></h3>
<p>Historically, the creator of Kafka (Jay Kreps) talked about tackling the difficulties of stream processing by breaking down these huge complexities into many small “bite size” components, thereby making it easier to build each one of them: “By cutting back to a single query type or use case each system is able to bring its scope down into the set of things that are feasible to build”. Specifically, he <a target="_blank" href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">portrayed</a> a world where many “small” open source tools play together to form the complete “data infrastructure layer”:</p>
<blockquote>
<p><em>Data infrastructure could be unbundled into a collection of services and application-facing system apis. You already see this happening to a certain extent in the Java stack:<br />‍<br />‍</em><a target="_blank" href="http://zookeeper.apache.org/"><em>- Zookeeper</em></a> <em>handles much of the system co-ordination (perhaps with a bit of help from higher-level abstractions like</em> <a target="_blank" href="http://helix.incubator.apache.org/"><em>Helix</em></a> <em>or</em> <a target="_blank" href="http://curator.incubator.apache.org/"><em>Curator</em></a><em>).</em><a target="_blank" href="http://mesos.apache.org/">*<br />- Mesos*</a> <em>and</em> <a target="_blank" href="http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html"><em>YARN</em></a> <em>do process virtualization and resource management<br />Embedded libraries like</em> <a target="_blank" href="http://lucene.apache.org/"><em>Lucene</em></a> <em>and</em> <a target="_blank" href="https://code.google.com/p/leveldb"><em>LevelDB</em></a> <em>do indexing</em><a target="_blank" href="http://netty.io/">*<br />- Netty*</a><em>,</em> <a target="_blank" href="http://www.eclipse.org/jetty"><em>Jetty</em></a> <em>and higher-level wrappers like</em> <a target="_blank" href="http://twitter.github.io/finagle"><em>Finagle</em></a> <em>and</em> <a target="_blank" href="http://rest.li/"><em>rest.li</em></a> <em>handle remote communication</em><a target="_blank" href="http://avro.apache.org/">*<br />- Avro*</a><em>,</em> <a target="_blank" href="https://code.google.com/p/protobuf"><em>Protocol Buffers</em></a><em>,</em> <a target="_blank" href="http://thrift.apache.org/"><em>Thrift</em></a><em>, and</em> <a target="_blank" href="https://github.com/eishay/jvm-serializers/wiki"><em>umpteen zillion</em></a> <em>other libraries handle serialization</em><a target="_blank" href="http://kafka.apache.org/">*<br />- Kafka*</a> <em>and</em> <a target="_blank" href="http://zookeeper.apache.org/bookkeeper"><em>Bookeeper</em></a> *provide a backing log.</p>
<p>If you stack these things in a pile and squint a bit, it starts to look a bit like a lego version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems. This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented, but it might be a path towards getting the simplicity of the single system in a more diverse and modular world that continues to evolve. If the implementation time for a distributed system goes from years to weeks because reliable, flexible building blocks emerge, then the pressure to coalesce into a single monolithic system disappears.*</p>
</blockquote>
<p>Even though this architecture definitely succeeded in improving the “specialization” of specific components and advanced the capabilities of the streaming ecosystem dramatically — it kind of feels like somewhere along the road, the end user that was supposed to be safeguarded from this implementation choice (“This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented”.) ended up paying the price.</p>
<p>In today’s world — there is nearly no technical barrier for great stream processing. Whether it’s scale, complexity or robustness — nearly any application <strong>can</strong> be built with the tools the market has to offer. The interesting question though is —  <strong>at what cost</strong>?</p>
<p>Even setting aside the sheer number of components users must manage — Debezium, Kafka Connect, Kafka, Flink, Avro Schema Registry, and more — the internal knowledge required to operate these tools effectively is staggering. Whether it’s understanding how Flink parallelizes jobs (does an average user need to understand how PostgreSQL parallelizes queries?), thinking about the subtleties of checkpointing, watermarks, or how to propagate schema changes to transformations — the end users is definitely <strong>not</strong> abstracted away from the internal implementation choice.</p>
<h3 id="heading-a-new-generation-of-stream-processors"><strong>A New Generation Of Stream Processors</strong></h3>
<p>Innovation is dirty. And perhaps this historical unbundling and complexity <strong>must</strong> have taken place in order for us to reach where we are today and to overcome all these difficulties. But similar to the evolution of batch processors — where the “dirty” and “complex” innovation of Oracle must have taken place to allow the rise of the easy to use PostgreSQL, in the past couple of years — <strong>a new generation of stream processors is now rising</strong> that focus their innovation on the end user, not <em>just</em> technical capabilities.</p>
<p>Similar to the movement from Oracle to PostgreSQL — stream processors in this new generation are not necessarily less internally “complex” than the previous ones. They compound all the “internal” innovation the previous generation had, while somehow still abstracting away that innovation from the end user.</p>
<p>In this new generation — <strong>simplicity is the default and complexity a choice.</strong> An end user is able dive deep into the complexities and configurations of the stream processor if he wishes to — but can always have the option to use a default version/configurations that “just work” for 99% of the use cases.</p>
<p>For this new generation of stream processors to be built, a radical change in the underlying technology had to happen. The endless process of stitching and gluing together generic old components was broken — and a new paradigm, a  much more holistic one, had to be built.</p>
<p>New stream processing algorithms, with much stronger end-to-end guarantees (<a target="_blank" href="https://github.com/TimelyDataflow/differential-dataflow/blob/master/differentialdataflow.pdf">differential dataflow</a> &amp; <a target="_blank" href="https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf">DBSP</a> with their internal consistency promises) are used and top-tier replication practices inspired by great replication tools like peerdb, fivetran, etc… are incorporated.</p>
<p>In this new generation of stream processors, each component is designed specifically for it’s role in the larger system — and with strong “awareness” to all the external systems and integrations it must serve.</p>
<p>The concept of “database transactions,” for example, is enforced in all internal components in order to abstract away ordering and correctness issues from the user. While stream processors in the old generation (Debezium, Flink, etc.) by default lose the bundling of transactions when processing changes (meaning large operations performed atomically on source databases might not appear atomically on sinked databases) — our new stream processor must (and does!) automatically translate a single transaction at the source database to a single “transaction” in the transformation layer, and into a single “transaction” in the sink database.</p>
<p>The transformation layer is aware of the source database it transforms data from so it can build the most optimal query plans (based on schemas, table statistics, etc..), and the replication layer is aware of the transformation layer (so it can automatically start replicating new tables that new transformations require).</p>
<p>Parallelism, fault tolerance,  and schema evolution are all abstracted away from the user — and a <strong>single</strong> simple interface is built to control all the stream processors mechanism (ingesting data, transforming data, sinking data).</p>
<p><img src="https://media.licdn.com/dms/image/v2/D4D22AQGD_pDtgJc8Tw/feedshare-shrink_800/B4DZeI1JKZGsAk-/0/1750347323134?e=1757548800&amp;v=beta&amp;t=XaZSHsNBV9UG5PenMTRS-akTWH7RpPMFyZqDpJ5q0S0" alt="diagram" /></p>
<h3 id="heading-moving-forward"><strong>Moving Forward</strong></h3>
<p>Streaming is hard. But as the barrier of adoption in the market shifted from a theoretical one (is reliable stream processing at scale even possible / a real solution with real use cases?) to a usability one (how hard is it for me to use?) — the underlying technologies that power streaming had to and need to change.</p>
<p>Although probably always harder than batch processing, if we’re able to truly make stream processing (almost) as easy as batch processing — the implications to the data world would be staggering. Heavy queries/pre-calculation would be easily pre-calculated (and kept up-to-date) using incremental materialized views, data movement between data stores would be a solved issue, dbt model/ETL processes could be made incremental using a streaming engine, and front ends would be much more reactive (e.g. <a target="_blank" href="https://skiplabs.io/">https://skiplabs.io/</a>). The cost of data stacks will drop dramatically, and engineers would be able to focus much more on what they are supposed to — building application logic!</p>
<p>Similar to how Snowflake revolutionized the data warehousing world with its ease of use and scalability — I truly believe as a software engineer that an “easy to implement” yet robust stream processor can break the existing trade-off between batch processing and stream processing (easy to use vs. real-time/performant) — and unlock a new and exciting world in the data ecosystem!</p>
]]></content:encoded></item><item><title><![CDATA[The Bidirectional Stream Processor: Why Pull Beats Push for Crash Recovery]]></title><description><![CDATA[At Epsio — we’re building a streaming SQL engine, focusing our engineering effort towards ease of use / removing the middleware usually associated with stream processing.
As part of that effort — we need to ensure that if and when our stream processo...]]></description><link>https://blog.epsiolabs.com/building-a-recovery-mechanism</link><guid isPermaLink="true">https://blog.epsiolabs.com/building-a-recovery-mechanism</guid><dc:creator><![CDATA[Gilad Kleinman]]></dc:creator><pubDate>Tue, 05 Aug 2025 11:43:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754398448916/373d819a-7ab4-4619-a345-9022a77cdd91.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At Epsio — we’re building a streaming SQL engine, focusing our engineering effort towards ease of use / removing the middleware usually associated with stream processing.</p>
<p>As part of that effort — we need to ensure that if and when our stream processor crashes, no data duplication occurs. This means that even in the scenario of a disk malfunction, network issues, unexpected compute restart — there shouldn't be duplications in the output stream. Each inputted change should be received once, processed once, and its derived change should also be outputted only once to the component after the stream processor.</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*Cgf4s8pNVRxePBt8ItW2kg.png" alt /></p>
<p>As you’ll see in this blog post, certain scenarios can make this especially challenging — particularly when it’s difficult to determine exactly what the stream processor managed to process or output before it crashed. Since our mechanism can’t depend on any local disk or state that might be lost during a crash, we had to get a bit creative in figuring out which messages had already been sent by the stream processor — and which hadn’t.</p>
<h2 id="heading-problematic-scenario-1-connection-to-sink-breaking">Problematic Scenario #1 — Connection to Sink Breaking</h2>
<p>The first (and easier to handle) failure scenario is when the connection to the sink breaks during a batch write or delete operation. This can happen either because of a network hiccup, a sink/database restart, etc..</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*hE9K4iiMp2T3UcI-Fg2Mkw.png" alt /></p>
<p>When this failure occurs (e.g., an insert operation receives a timeout or a TCP connection breaks forcefully mid insert), the stream processor is left in the dark: did the sink manage to apply any of the writes before the disconnect? If we retry the operation blindly, we risk duplicating data. But if we skip it, we could lose records altogether.</p>
<p>To handle this, we employ one of two strategies, depending on the capabilities of the target sink:</p>
<h3 id="heading-utilizing-transactions">Utilizing transactions</h3>
<p>Luckily, many of the data sinks we work with have transactions built-in. This makes life much easier because we can apply a batch of changes in bulk, and if our connection breaks in the middle — all is good because the transaction wasn’t committed yet! For example:</p>
<pre><code class="lang-plaintext">BEGIN;
INSERT &lt;change1&gt;
INSERT &lt;change2&gt;
INSERT &lt;change3&gt;
COMMIT;
</code></pre>
<p>If for some reason, the server disconnects between insert2 and insert3 — we know the previous inserts didn’t get committed yet — so we can just restart the entire write operation when we re-establish the connection and commit. Great, right? No need to worry about duplicate writes anymore?</p>
<p>…</p>
<p>…</p>
<p>…</p>
<p>…</p>
<p>…</p>
<p>…</p>
<p>Wrong!</p>
<p>In the world of data processing, we need to be prepared for failure at ANY POINT IN TIME. This include the specific time when we run the COMMIT command. If the connection breaks while we perform the COMMIT — how can we know if the transaction was committed (and not that the reply confirming the COMMIT just never reached us?), or if we need to re-apply the transaction?</p>
<p>You might think this scenario is fairly rare (how often does a database restart? And how unlucky must one be to have a restart exactly on the commit command) — but this is something that actually happened many times for some of our users (when running many views and writing into many data stores that frequently upgrade/restart due to scaling events).</p>
<p>Surprisingly, some of the most famous stream processors (Flink included!) market themselves as “exactly once” stream processors, but can’t handle this edge case:</p>
<blockquote>
<p>“If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.” — <a target="_blank" href="https://flink.apache.org/2018/02/28/an-overview-of-end-to-end-exactly-once-processing-in-apache-flink-with-apache-kafka-too/">An Overview of End-to-End Exactly-Once Processing in Apache Flink</a></p>
</blockquote>
<p>Epsio, on the other hand, overcomes this by utilizing a built-in feature many data stores have to check if a specific transaction ID was committed. This means that when the engine needs to decide whether to re-apply the changes to the sink, it can just check if the previous transaction ID was committed or not. For example, in PostgreSQL:</p>
<pre><code class="lang-plaintext">BEGIN;
SELECT xid_current(); -- Save this for later. If commit fails, check txid if was committed
INSERT &lt;change1&gt;
INSERT &lt;change2&gt;
INSERT &lt;change3&gt;
COMMIT;
</code></pre>
<p>If the COMMIT command fails, Epsio simply checks if the transaction ID was previously committed or not and decides if it needs to re-apply all the changes again.</p>
<h3 id="heading-making-writes-idempotent">Making writes idempotent</h3>
<p>Sometimes, transactions aren’t available in the data sink — so Epsio has to “cheat” its way into only delivering changes once.</p>
<p>Instead of trying to guarantee that a change is written only once to the datastore, an alternative approach is to make the write process idempotent: if a particular change has already been written, any subsequent attempt to write it again will be safely ignored.</p>
<p>Epsio handles this by assigning a unique identifier (UUIDv7) to each change before it’s sent to the sink. It then relies on unique indexes (or similar constraints) in the target datastore to ensure that duplicate writes are ignored — typically by using an UPSERT or merge operation.</p>
<p>For example, if a batch of three changes is being written to the sink and the connection drops after the first change is written, retrying the batch will result in only the remaining two changes being applied. The first change, already present, will be ignored.</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*wOS5blw3dWkH83kftbaqqw.png" alt /></p>
<h2 id="heading-problematic-scenario-2-engine-crashes-mid-processing">Problematic Scenario #2 — Engine crashes Mid-Processing</h2>
<p>A more complex (and interesting!) scenario arises when the stream processor itself crashes or restarts mid-processing.</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*MwrCBgaNZI2oF8clQC4pyA.png" alt /></p>
<p>In this scenario, it’s unclear how the processor should proceed upon recovery. How does it know which changes were already written? How does it continue to ensure no duplicate changes are written to sink, even in the scenario of state data loss?</p>
<p>Traditional stream processing systems — <em>ahem, ahem, Flink</em> — handle this by combining end-to-end transactions with periodic state checkpoints. Here’s how it works:</p>
<ol>
<li><p>For each input batch, start a transaction in the sink.</p>
</li>
<li><p>Process, transform and write the batch into the sink (using the transaction opened in previous stage)</p>
</li>
<li><p>After fully processing the batch, back up the internal state (key-value store, input stream offsets, metadata, etc.).</p>
</li>
<li><p>Commit the transaction after all steps are successful.</p>
</li>
</ol>
<p>If a failure occurs mid-processing, a “replica node” (a node sitting in standby) can revert to the last successful checkpoint/backup. Since the transaction isn’t committed while the batch is still being processed, any partially outputted changes from this batch are discarded, and the replica node simply retries processing the entire batch.</p>
<p>The huge (!) downside of this approach — is that the “frequency” of your transactions is highly limited by how fast you can perform checkpoints and backup all your internal state. Consider Confluent’s default checkpoint interval of 1 minute — this overhead means that <strong>achieving exactly-once guarantees in Confluent’s Flink implementation currently requires accepting at least 1 minute of latency in your stream processing pipeline</strong>.</p>
<blockquote>
<p><strong>“Exactly-Once</strong>: If you require exactly-once, the latency is roughly one minute and is dominated by the interval at which Kafka transactions are committed” — <a target="_blank" href="https://docs.confluent.io/cloud/current/flink/concepts/delivery-guarantees.html">Delivery Guarantees and Latency in Confluent Cloud for Apache Flink</a></p>
</blockquote>
<h3 id="heading-how-does-epsio-solve-this">How does Epsio solve this?</h3>
<p>Unlike traditional stream processors that function as a “one-way street,” continuously receiving data from sources and pushing it downstream, Epsio is <strong>bidirectional</strong>. This means it can not only <strong>push data</strong> to the sink but also <strong>pull data</strong> from it when needed.</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*laquRlEoI4X8BSWWpxjjOA.png" alt /></p>
<p>This bidirectional approach is crucial because it enables failure recovery in ways that traditional stream processors can’t. While traditional processors simply push data to the sink, Epsio can pull data back from the sink during failure scenarios and compare the current state in the sink with its expected internal state — thereby avoiding outputting changes it already outputted.</p>
<p>For example, consider a highly available Epsio deployment with a “Main” instance and a “Replica”. This system receives a stream of changes and multiplies each input integer by 2. To prepare for failure scenarios, the replica will continuously maintains internally the “expected” result state of the sink.</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*cJ1vRa2YhxJV48FXzdAgKA.png" alt /></p>
<p>When a failure occurs, the replica will pull the current state from the sink, compare it with its expected state, and apply only the differences (ie new rows that need to be inserted / deleted).</p>
<p><img src="https://cdn-images-1.medium.com/max/1440/1*fHAhTDc1Ov41Nh8kMuTRIg.png" alt /></p>
<p>After applying these changes, the replica can just continue ingesting more events from the input stream, and never worry about duplicate data delivery being an issue.</p>
<p>Because this approach doesn’t require any external checkpoint/backup, and more importantly decouples sink writes from state backup frequency, it enables users to achieve strong delivery guarantees <strong>without sacrificing streaming latency</strong>.</p>
<h2 id="heading-concluding-thoughts"><strong>Concluding Thoughts</strong></h2>
<p>Historically, stream processors have been built to operate in isolation — generic, self-contained components with minimal awareness of their upstream and downstream counterparts. While this approach has merit, it comes with significant trade-offs.</p>
<p>While there are still many things to talk about in the above “strategy” (how can you deal with the scenario multiple instance crash, how do you deal with a scenario where the sink dataset is too large to efficiently maintain on the instance, etc…) — I hope this article gave you a small glimpse into how stream processors could become significantly easier to use — and more efficient — if they were designed to be more “aware” of the surrounding ecosystem.</p>
]]></content:encoded></item><item><title><![CDATA[How we built a Streaming SQL Engine]]></title><description><![CDATA[So you probably wake up every morning asking yourself three of life’s most pertinent questions- how do I build a streaming SQL engine, what even is a streaming SQL engine, and can our Lord drop tables owned by another user.
I too found myself asking ...]]></description><link>https://blog.epsiolabs.com/how-we-built-a-streaming-sql-engine</link><guid isPermaLink="true">https://blog.epsiolabs.com/how-we-built-a-streaming-sql-engine</guid><dc:creator><![CDATA[Maor Kern]]></dc:creator><pubDate>Tue, 01 Apr 2025 21:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1754224814399/4bbb0ae0-6f3f-4633-aa73-8852a709aae2.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>So you probably wake up every morning asking yourself three of life’s most pertinent questions- how do I build a streaming SQL engine, what even is a streaming SQL engine, and can our Lord drop tables owned by another user.</p>
<p>I too found myself asking these questions, sometimes even dreaming about them- often in the form of various SQL operators pointing and laughing at my incompetence as I beg them to answer me.</p>
<p>And so, a year ago, I (quite bravely if I may say so myself) packed my bags and set off on the long and treacherous journey to find answers to these questions. I went from monk to priest to spaghetti-enthusiast, shocked to find themselves concerned with paltry questions such as the Meaning of Life and how to find peace within oneself. But eventually, desolate in the deepest pits of my mind, I happened by a small temple by the name of “Epsio Labs”- a feeling of tremendous revelation came over me, and I walked in.</p>
<p>Friends, today I will share with you the secrets I have found there (despite the numerous NDAs).</p>
<h1 id="heading-what-is-a-streaming-sql-engine"><strong>What is a Streaming SQL Engine?</strong></h1>
<p>A streaming SQL engine keeps queries’ results up to date without ever having to recalculate them, even as the underlying data changes. To explain this, imagine a simple query, such as <code>SELECT count(*) FROM humans</code> . A normal SQL engine (such as Postgres’s, MySQL’s) would need to go over all the different <code>humans</code> every time you ran that query- which could be quite costly and lengthy given our ever changing population count. With a streaming SQL engine, you would define that query once, and the engine would constantly keep the resulting count up to date as new humans were born and the old / sickly ones died off, without ever performing a recalculation of counting all humans in the world.</p>
<h1 id="heading-how-to-build-a-streaming-sql-engine"><strong>How to build a Streaming SQL engine</strong></h1>
<p>For the simple example above, you may have an idea of how a streaming SQL engine could work- first, you would need to do what any normal SQL engine does and calculate the number of humans. Next, every time a human was born you would add one to your result, and every time a human died you would negate one from your result. Easy, right?<br />Let’s try creating a diagram showing a procedural “query plan” and how we would process a new human being born. We’ll have a series of nodes, one for each operation, and a “final” node which will represent a table with our results. Since we’re a streaming engine and dealing with <em>changes</em>, we’ll represent the messages passed between our nodes as:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:959/1*5aMhkvO4eM58JPFXTEhWjQ.png" alt /></p>
<p>where the key is <em>what</em> we want to change, and the modification is <em>by how much we want to change it</em>. So for example, if we wanted to pass a message to a node telling it “Hey Mr. Node, 1.5 Apples have been added”, it would look like this:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:646/1*LZFOpfqt3sL7wEGY2w_bRw.png" alt /></p>
<p>Each node will be responsible for receiving changes, performing some type of operation, and then outputting changes itself. Another important concept here is that we can add modifications together if they have the same key. So the two changes <code>apple: 1.5</code> and <code>apple: 2</code> are equivalent to <code>apple: 3.5</code> . If the modifications equal zero together, it’s as if nothing at all happened- for example, if we have two changes <code>apple: 3</code> and <code>apple: -3</code> , it’s equivalent to not having streamed any changed (You can think of it as me giving you three apples, utterly regretting my kindness, then taking away your three apples. For you it would be as if nothing happened at all- aside from your broken pride). To make this make more sense, let’s draw out the nodes for our query (<code>SELECT count(*) FROM humans</code>), and add in the first human, Adam.</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*zt4KzYosgm0hVwkFGOdPLQ.png" alt /></p>
<p>As we can see, a new human named “Adam” was born. The “Counter” node, perhaps called for its ability to count, holds an internal state of the current count of humans in the world. Whenever it receives a <em>change</em> (an addition or deletion of a human), it updates its internal count, and then outputs relevant changes- in this case, only one change was necessary, telling the next node to add the number one (signifying the total number of humans) once. In this instance, the next node was the “Final Results Table”, an actual honest-to-god relational table (perhaps in Postgres). You can imagine every change translating to a <code>DELETE</code> or <code>INSERT</code> based on the key and modification.</p>
<p>Next, let’s add in Eve:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*Et0FR_DFbqsyO1ZHbYMbig.png" alt /></p>
<p>This time, the counter node needed to output two changes- one to cancel out the previous change it outputted, and one two add the new updated value. Essentially, if we were to look at all the changes the Counter node outputted over time, we’d get:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:360/1*ILTxWLJS2WXoclOpluRxog.png" alt /></p>
<p>Since we’re allowed to combine changes with the same key, the above is equivalent to:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:360/1*Vl1lYUw7hUMqKshsfYHv1w.png" alt /></p>
<p>A modification of zero means we can simply remove the change. Therefore, we’re left only with <code>2: +1</code> in the final result table- exactly what we wanted.</p>
<p>Are you starting to feel that sweet sweet catharsis? Ya, me too.</p>
<h1 id="heading-a-slightly-more-interesting-example"><strong>A slightly more interesting example</strong></h1>
<p>Let’s imagine we’re the devil and have two tables:<br />1. A “Human Table” that contains two columns, a unique identifier and their name.<br />2. An “Evil Table” that maps human ids to whether or not they are evil.</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*1LF2yHSqNJErO5QdfQbFhA.png" alt /></p>
<p>Now, for obvious reasons, let’s say we wanted a count of evil humans per name:</p>
<pre><code class="lang-plaintext">SELECT humans.name, count(*) FROM humans 
JOIN evil_humans ON humans.id = evil_humans.human_id 
WHERE is_evil IS true 
GROUP BY humans.name
</code></pre>
<p>To be able to create a query plan (a series of nodes) from this query, we’re going to need to introduce a few new types of nodes here.</p>
<h2 id="heading-filter-node"><strong>Filter Node</strong></h2>
<p>A filter node filters a change by its <em>key</em>, regardless of its <em>modification</em>. If a change “passes” the filter, the filter outputs it as is. To really give you the feel of this, let’s diagram a filter node that passes only changes with keys that are equal to the word “cats”.</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1125/1*7J3fvww666tGqp_Qyyeqdw.png" alt /></p>
<p>As we can see, we gave the filter node 3.4 cats and, astonishingly enough, it passed them on without changing a thing. Let’s try passing dogs through the filter node:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1139/1*jNZQVsBrZD02qZ4qu-JC7A.png" alt /></p>
<p>Whoa! This time the filter did <strong>not</strong> pass anything on. I imagine you get the idea. Moving on.</p>
<h2 id="heading-joins"><strong>Joins</strong></h2>
<p>A Join node is responsible for receiving changes from two nodes, and outputting changes whose “join keys” match. It does this by holding an internal state (all in storage) of all the changes that pass through it from both sides, mapped by their respective join keys. So in our example with</p>
<pre><code class="lang-plaintext">JOIN evil_humans ON humans.id = evil_humans.human_id
</code></pre>
<p>we would create one Join node, with two mappings:<br />On the left side, for <code>id</code> to <code>name</code><br />On the right side for <code>human_id</code> to <code>is_evil</code></p>
<p>In practice, the mappings would look something like this:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*h8nyNBikQdp0qQHSetAaKg.png" alt /></p>
<p>Every time the Join node would receive a change from one of its sides, it would look on the other side for a matching key. If it found one, it would output the combined values. Let’s try drawing out a simple example where the Join node receives a new human named Tommy with id 232, and then a change saying id 232 is not evil.</p>
<p>First- a new human named Tommy enters the world:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*_IvDajebrqy4gVIehKGFfw.png" alt /></p>
<p>Ok, we streamed a change that tells the Join node that Tommy (id 232) has been added. The Join node looks in its right mapping for a corresponding change for key 232, and finds none. It therefore outputs nothing; but it does update its <strong>internal mapping</strong> to reflect the fact Tommy has been added- this will help us when we do the following:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*OEn-Ca9zMWbSNmxgK68-ug.png" alt /></p>
<p>Here, the Join node received a change from its <strong>right</strong> side telling it “id 232 is not evil”. The Join node then looked at its left mappings, found a corresponding change (232: Tommy- the change we just streamed before) and outputted the combined change.<br />But this is not the end of the Tommy saga- at any moment new changes could come in. Perhaps Tommy could fall down the steps and die. This would result in the Join receiving <code>Tommy, 232: -1</code>, which would then have the Join node outputting <code>(232, Tommy, false): -1</code> — cancelling out the previous change the Join sent. Or perhaps Tommy could change in his evilness- we’ll keep that idea for an example down the line.</p>
<p>— -</p>
<p>Side note- you may have noticed we said “join key to change” but don’t actually keep the modification count in the join mapping. In the real world we do, and then multiply the modification counts of both sides to get the outputted modification count</p>
<p>— -</p>
<h2 id="heading-group-bys"><strong>Group Bys</strong></h2>
<p>Our “Group By” node is very similar to the counter node we had before (in truth, they are one and the same wearing different hats- but that’s a tale for another time). The Group By node outputs aggregates per buckets- always ensuring that if you were to combine all changes it outputted, you would be left with at most one change per bucket (similar to how we took all the changes over time that came out of the Counter node, and saw that we’re left with only one as others cancel out). It does this by holding an internal mapping (in storage) between each bucket and its aggregated value. So in our example</p>
<pre><code class="lang-plaintext">SELECT humans.name, count(*) ... GROUP BY humans.name
</code></pre>
<p>The Group By node would hold a mapping something like this:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:900/1*RvSq9RJxuwcHmkvv-bdP0w.png" alt /></p>
<p>Let’s try drawing out a simple example showing what would happen if we entered a new name the Group By node has never seen before:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*xazi6HkJ1kzkPZNkLDeE4g.png" alt /></p>
<p>Ok, what happened here? The change coming into the Group By node tells the Group By node to add one to Richard. The Group By node looks in its internal mappings, and sees it has no entry for Richard. It adds an entry, and then outputs that the amount of Richards is one (this is the <em>key</em> of the change- <code>(Richard, 1)</code> ). Let’s go ahead and add another two Richards:</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*E58HXQo0s8zKPOWpW5f-rQ.png" alt /></p>
<p>Slightly more interesting- the Group By node receives another change telling it to add two to Richard. The Group By node updates its internal state, and then outputs two changes- one to <strong>remove</strong> the previous change it outputted, and one to add a change with the new updated count of Richards.</p>
<h2 id="heading-putting-it-all-together"><strong>Putting it all together</strong></h2>
<p>Back to our original query:</p>
<pre><code class="lang-plaintext">SELECT humans.name, count(*) FROM humans 
JOIN evil_humans ON humans.id = evil_humans.human_id 
WHERE is_evil IS true 
GROUP BY humans.name
</code></pre>
<p>To set the stage for our upcoming example, imagine a young goody-two-shoes coder named Tommy, id 232 (you may remember him long ago from our explanation about the join). Tommy was a super cool dude who regularly downvoted mean people on StackOverflow (evilness=false).</p>
<p>One day, Tommy got kicked in the head by a horse, and as a direct result force pushed to master while deleting the CI. We’ll represent this occurrence with two changes- one to cancel out the old change saying Tommy wasn’t evil <code>(232, false): -1</code> , and one to add in the new change saying he is evil<code>(232, true): +1</code> :</p>
<p>Zoom image will be displayed</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1800/1*6hZhe3iJC3W-5TaC0ICGsQ.png" alt /></p>
<p>Bonus points if you see a nice optimization we can do here- try playing with the order of the nodes :)</p>
<p>Let’s do a quick breakdown of what we see above- so we outputted the two changes we talked about <code>(232, false): -1</code> and <code>(232, true): +1</code> . The Join node receives it, looks on its other side (ids -&gt; names), finds a name (Tommy), and outputs the inputted changes together with the name “Tommy”. Next, the filter node <code>WHERE is_evil IS true</code> filters out the change <code>(232, false): -1</code> , and, since its evil value is false, only outputs <code>(232, true): +1</code> . The Group By node takes in this change, looks in its mappings, and sees there existed a previous entry for Tommy (while not a very evil name, I have met some mean Tommys in my life). The Group By node therefore sends out one change to remove the old change it sent out with <code>(Tommy, 7): +1</code> (this happened in a previous addition of an evil Tommy). It then sends out another change introducing the new change to the Tommy count.</p>
<h1 id="heading-wait-but-why-go-to-all-the-trouble"><strong>Wait, but why go to all the trouble?</strong></h1>
<p>So, now there are 8 Tommy’s in the world that are evil, and we didn’t need to rerun our query to calculate this. You may be thinking- well, Mr. Devil, you really didn’t need a streaming SQL engine to do that. If you had a humans table and evilness table, just create indexes on them. You’d still need to go over all the records each time queried, but at least the lookups would be quick. The Group By would still need to do everything from scratch, but at least…<br />So yes, it is possible to optimize queries so they run fairly quickly on the fly- up to a certain point. As more Joins, Group Bys, and (god forbid) WITH RECURSIVEs are added, it becomes more and more complex to optimize queries. And as more Tommys and Timmies and Edwards and Jennies (not to mention Ricardos and Samuels and Jeffries and Bennies) are added to our system, even those optimizations might begin to not be enough (and don’t even get me started on the evils of in-house caching). Streaming SQL engines are, to paraphrase a non-existent man, <em>totally badass,</em> and straight up solve this<em>.</em></p>
<h1 id="heading-in-conclusion"><strong>In Conclusion</strong></h1>
<p>Feel ready to build a streaming engine? You definitely have the building blocks- but you’re missing a few major concepts here such as how to be consistent (to <em>never</em> output partial results), how to do this all with high throughput (in everlasting tension with latency), and how to interact well with storage (async-io anyone?). Maybe I’ll write another blog post, maybe not.</p>
<p>By the way, as for the Lord dropping a table owned by another user: it apparently comes down to if he uses RDS (on a self-hosted DB he’s a superuser, but on RDS nobody is. Yes, not even the Lord).</p>
<p>Since posting this article, the Elders at Epsio realized there’s no point to keeping their secrets so secret and published their streaming engine for the world to use- check out Epsio’s <em>blazingly fast</em> SQL engine <a target="_blank" href="https://www.epsio.io/">here</a>.</p>
]]></content:encoded></item><item><title><![CDATA[How we made our streaming joins 50% faster]]></title><description><![CDATA[When describing the core SQL (Or relational algebra) operators, the humble join shows up pretty fast. In fact, you would be hard pressed to find an app built on top of an SQL database which doesn’t use a join in one of its queries. For good reason! J...]]></description><link>https://blog.epsiolabs.com/how-we-made-our-streaming-joins-50-faster</link><guid isPermaLink="true">https://blog.epsiolabs.com/how-we-made-our-streaming-joins-50-faster</guid><dc:creator><![CDATA[Kobi Grossman]]></dc:creator><pubDate>Sun, 09 Feb 2025 22:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753223793914/15058b56-4106-4fd3-b376-d679b27d36d6.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When describing the core SQL (Or relational algebra) operators, the humble <strong>join</strong> shows up pretty fast. In fact, you would be hard pressed to find an app built on top of an SQL database which doesn’t use a join in one of its queries. For good reason! Joining two sets of data based on some predicate is the basis for creating any sort of relation between logical entities, whether it is one-to-one, one-to-many, etc.</p>
<p>At Epsio, we’re building a fast, streaming SQL engine which computes the result of arbitrary SQL queries in real time, <a target="_blank" href="https://www.epsio.io/blog/how-to-create-a-streaming-sql-engine">based on changes</a>. So, it’s no surprise that we’re very motivated to have an optimized implementation of join. In this post we’ll describe a key idea that allowed us to optimize our streaming join in common cases.</p>
<p>But before we dive into optimizations, let’s first describe how the <strong>Symmetric hash join</strong> algorithm, which is commonly used in stream processing, works. We will only discuss inner join in this post for simplicity, even though there is a lot to talk about other join types.</p>
<h2 id="heading-background-symmetric-hash-join"><strong>Background: Symmetric Hash Join</strong></h2>
<p>Let’s say that we are running a bookstore, and there are 2 tables in our sales database: <code>customer</code> and <code>store_sales</code>. Each row in the <code>customer</code> table has a unique <code>id</code> primary key along with a <code>name</code> column. To connect customers with sales, each <code>store_sales</code> row has a <code>buyer_id</code> column pointing to the relevant customer, and a <code>book_name</code> specifying the book name. We can visualize this relation like so:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*2wI8NCG9XfDUzZOl_cFe7w.png" alt /></p>
<p>If we want to retrieve all the customers with the book names they bought, our query would look like this:</p>
<pre><code class="lang-vbnet">
SELECT customer.name, store_sales.book_name FROM customer
JOIN ON customer.id = store_sales.buyer_id;
</code></pre>
<p>The result in this example would be a single row: <code>("Adam", "How to pronounce SQL, and other flamewars")</code>.<br />How can we implement this join?</p>
<p>If we have all the data ahead of time, A naive solution could be to first build a hashtable for <code>store_sales</code> keyed by <code>buyer_id</code> . We would then iterate over each row in <code>customer</code>, reading the matching sale. In Python:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Build the store_sales hashtable</span>
<span class="hljs-keyword">for</span> sale <span class="hljs-keyword">in</span> store_sales:
    store_sales_hashtable[sale.buyer_id] = sale.book_name
<span class="hljs-comment"># Probe the store_sales hashtable using customer</span>
<span class="hljs-keyword">for</span> c <span class="hljs-keyword">in</span> customer:
    <span class="hljs-keyword">if</span> c.id <span class="hljs-keyword">in</span> store_sales_hashtable:
        sale = store_sales_hashtable[c.id]
        output_result((c.name, sale.book_name))
</code></pre>
<p>This is the classic <strong>hash join</strong> algorithm.</p>
<p>In a streaming context however, things get more complicated. We don’t have all the data ahead of time, but instead receive live changes. This means that e.g whenever a new sale is added, the streaming engine receives a change message containing the new row just added to <code>store_sales</code>. It then needs to understand how this changes the query result, and issue an update to the previous result accordingly. In the case of our join example, this looks like this:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*0AWHokyPXivUDrbO_WsP7Q.png" alt /></p>
<p>We had received a new sale, with <code>buyer_id=1</code>. So we need to check if we have a customer with <code>id=1</code>, and if so retrieve their name to return the correct result.</p>
<p>This is exactly where our classic hash join fails. We build a hashtable for <code>store_sales</code>, but we have <strong>no state for</strong> <code>customer</code>. In the classic algorithm, we can probe the hashtable using the full data of <code>customer</code> table. Here we don’t have the entire table available, since we are working based on changes and don’t have all the data up front. Thus we won’t be able to find the matching customer for our sale.</p>
<p>If we flip the roles and only build a hashtable for <code>customer</code> instead,  we encounter a similar problem. Assume that instead of adding a sale we receive the following change: The name of “Adam” had changed to “Adam, for real” (as so happens). Our new result would then look like this:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*rWog4ObHdeL8DvzzNHM9Dw.png" alt /></p>
<p>But if we only store a hashtable for <code>customer</code>, we will not be able to find the matching <code>store_sales</code> rows when the change is received! Luckily, there is a simple change that solves our problem.</p>
<p>Well, like many other problems in  computer science, this one is also resolved by adding another hashtable. In particular, we maintain two hashtables: one for <code>customer</code> keyed by <code>id</code>, and another for <code>store_sales</code> keyed by <code>buyer_id</code>.</p>
<p>Now whenever a new customer is added, we lookup the <code>store_sales</code> hashtable by the customer id to return the matching sales. We then write the customer to the customers hashtable, to ensure that it is visible for future lookups for when another sale is added.</p>
<p>The other direction is similar: when a new sale is added, we lookup the matching customer in the customer hashtable to see who bought the item. We then write the new sale to the <code>store_sales</code> hashtable.</p>
<p>So  when Adam buys a new book, we lookup the customer hashtable with <code>buyer_id=1</code> and write the new sale:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*jsXSogcm-2DfSNq7tOGRvw.png" alt /></p>
<p>We will then return the new result row <code>("Adam", "Dataflow and the art of incremental view maintenance")</code>. This is the essence of the <strong>Symmetric Hash Join</strong> (or <em>SHJ</em> for short) algorithm: on a change to one side of the join, write the change to the hashtable of this side, and then read the matching rows from the hashtable of the other side to get the matching row. Pretty simple, right?</p>
<h2 id="heading-the-population-problem"><strong>The Population Problem</strong></h2>
<p>This algorithm is the core join algorithm used in Epsio. There is one key detail we have glossed over though: Most of the time, the hashtables for the 2 join sides don’t fit in memory. We need to have persistency to be able to do the join with reasonable data sizes.</p>
<p>This is a perfect candidate for using a <strong>Key value store</strong>: Our key is the join key, and the value is the row. In particular, we use the rock-solid <a target="_blank" href="https://rocksdb.org/">RocksDB</a> to save the join states, with each state in a separate RocksDB database. As RocksDB plays a big role in optimizing our join implementation, we will describe it in more detail below.</p>
<p>Another important detail, and what brought us to look closer into join optimization,  is the <strong>population</strong> phase. Epsio, like other streaming engines, has an initial data load phase. Before we can respond to live changes to data and update the query result, we must first compute the result for the existing data in the user’s database. We also need to populate our inner states, like the join hashtables, so that we could stream changes to compute updated results. This phase can  be a computationally intensive task, since the initial amount of data can get quite large.</p>
<p>We have noticed that with many joins in the query, our population time isn’t meeting our expectations. When thinking about it, this isn’t surprising: During population, we can receive hundreds of thousands of rows as input into one join at a time, with different keys. Following the SHJ algorithm as described above, we will be performing these 2 steps <em>a lot</em>:</p>
<ol>
<li><p>Lookup a key in a RocksDB database. This will be Disk I/O once the join database is large enough.</p>
</li>
<li><p>Write a the join key and its corresponding rows to another RocksDB database. The Disk I/O likely won’t be done immediately, as RocksDB flushes data to disk in a separate thread as detailed below.</p>
</li>
</ol>
<p>These 2 steps make the symmetric hash join both a <strong>write intensive</strong> and a <strong>read intensive</strong> operation. No wonder that it’s a big part of our population time! Given the importance of join, we started profiling to see how we can improve its performance.</p>
<h2 id="heading-lets-rock"><strong>Let’s rock!</strong></h2>
<p>To get an accurate picture for where we can improve, we benchmarked the population of a <a target="_blank" href="https://www.tpc.org/tpcds/">TPC-DS</a> query with many joins, <code>query4</code>. Here’s a shorter version of it with the key details:</p>
<pre><code class="lang-csharp">
 <span class="hljs-keyword">select</span> 
       c_customer_id, 
       d_year,
       sum(ss_ext_list_price - ...) year_total,
       <span class="hljs-string">'s'</span> sale_type,
       ...
<span class="hljs-keyword">from</span> customer
   <span class="hljs-keyword">join</span> store_sales <span class="hljs-keyword">on</span> c_customer_sk = ss_customer_sk
   <span class="hljs-keyword">join</span> date_dim <span class="hljs-keyword">on</span> ss_sold_date_sk = d_date_sk
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> c_customer_id ,d_year, ...
union all

<span class="hljs-keyword">select</span> 
      c_customer_id,
      d_year,
      sum(cs_ext_list_price - ...) year_total,
      <span class="hljs-string">'c'</span> sale_type,
      ...
<span class="hljs-keyword">from</span> customer
   <span class="hljs-keyword">join</span> catalog_sales <span class="hljs-keyword">on</span> c_customer_sk = cs_bill_customer_sk
   <span class="hljs-keyword">join</span> date_dim <span class="hljs-keyword">on</span> cs_sold_date_sk = d_date_sk
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> c_customer_id, d_year, ...
union all

<span class="hljs-keyword">select</span> c_customer_id,
       d_year,
       sum(ws_ext_list_price - ...) year_total,
       <span class="hljs-string">'w'</span> sale_type,
       ...
<span class="hljs-keyword">from</span> customer
   <span class="hljs-keyword">join</span> web_sales <span class="hljs-keyword">on</span> c_customer_sk = ws_bill_customer_sk
   <span class="hljs-keyword">join</span> date_dim <span class="hljs-keyword">on</span> ws_sold_date_sk = d_date_sk
<span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> c_customer_id, d_year, ...
</code></pre>
<p>(There is also another inner self-join later in the query, but we can ignore it for the purpose of our analysis here).</p>
<p>The goal of this query is to compute the yearly total purchases per <code>customer</code>, across different sales channels: <code>store_sales</code>, <code>catalog_sales</code> and <code>web_sales</code>. For this purpose, there is a join between <code>customer</code> and each of the sales table, along with a join of the <code>date_dim</code> dimension table with every sales table. This gives a total of 6 joins in the core of our query — perfect for analysis!</p>
<p>When looking at profiles for this query, we quickly noticed that RocksDB is a bottleneck. For example, writes to the databases of the different join sides  would take a very long time. We started tackling this by looking into the usual suspect: <a target="_blank" href="https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide">Tuning RocksDB configuration</a>. This worked well and resolved our long running writes, but we were still left with much room for improvement.</p>
<p>Studying the flamegraph more carefully, we found it. The <em>ancient curse.</em> The one that will strike fear into the heart of even the most experienced RocksDB users. The word that was hanging in the air since the start of this post like Chekhov’s merge-sort, the one they have all guessed by now but hoped it wouldn’t show up. The creature lurking in the background threads, in all its glory. <strong>Compaction!</strong></p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/67a8b9e806da3e192ced5b41_5.webp" alt /></p>
<h2 id="heading-whats-compaction-or-where-did-my-cpu-cycles-go"><strong>What’s compaction, or: Where did my CPU cycles go?</strong></h2>
<p>(If the sentence “We need to compact L0 to reduce our read amplification” makes sense to you, feel free to skip this part)</p>
<p>To understand what’s happening, we first need to take a detour and understand how RocksDB stores data. This is a topic for its own blog post, with <a target="_blank" href="https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works/">many</a> <a target="_blank" href="https://docs.speedb.io/rocksdb-basics">resources</a> already covering it. We will give a short summary of the read and write flows to see why we need compaction.</p>
<p>Whenever a new (Key, Value) pair of bytes is inserted to a RocksDB database, it is first written to an in memory data structure called a <strong>memtable.</strong> This data structure is usually a <a target="_blank" href="https://en.wikipedia.org/wiki/Skip_list">Skiplist</a>, but all that matters is that it is a sorted list of <code>(K, V)</code> pairs. Once a memtable exceeds its configured capacity, 64MB by default, it is marked as <em>immutable.</em> A new <em>active</em> memtable is created in its place, while the immutable memtable is “sealed” and inserted into a queue of memtables waiting to be flushed to disk. The memtables are flushed in the background to not stall writes.</p>
<p>A memtable is flushed to disk as an <strong>SST</strong> file. The specific format doesn’t matter here much, but the main point is that the each SST is sorted, and thus has a first and last key in some order. For example, at some point in time the <code>store_sales</code> database can look like this:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*0kXX_b7fisPPMb6Lfw3MlQ.png" alt /></p>
<p>Notice that while both SST files are sorted, their key ranges are <strong>overlapping.</strong> Indeed, so far we haven’t described some intentional merging of sorted data. If our keys aren’t arriving in a sorted order, which happens most of the time, we will naturally have multiple sorted but overlapping files on disk.</p>
<p>So what does this mean for our read flow? If we want to find a key <code>K</code>, we will have to:</p>
<ol>
<li><p>Do a binary search on the active memtable</p>
</li>
<li><p>If not found, do a binary search on all immutable memtables</p>
</li>
<li><p>If not found, perform a binary search on <strong>every SST file</strong>, loading blocks from disk for search</p>
</li>
</ol>
<p>Step 3 is where we will suffer from key overlap the most. We have to check every SST file for our key, where the amount of SST files can be in the thousands. Lots of I/O, lots of waiting for our KV pair. How can we speed this up?</p>
<p>This is where the <strong>Compaction</strong> background process comes into play, which occurs in any LSM-tree based database and RocksDB in particular. Its main purpose is to merge multiple overlapping SSTs into non overlapping files (and deleting entries, but this is out of scope for our insert only population flow).</p>
<p>There are several compaction algorithms, with complicated conditions for what to compact, when and how. This is an entire (interesting!) field of research which we will not get into here. Suffice to say that the compaction process will write new SST files instead of the existing ones, performing a merge sort so that we have a sorted “run” of data:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/67a8c3152afa6eb063fc7e3b_7.png" alt /></p>
<p>Now when we’ll lookup a key, we will only have to search one SST. Nice!</p>
<p>But there’s a catch. Compaction is an <strong>intensive process</strong> by itself. We read many SST files, perform a merge sort on a large amount of keys, and write possibly many SST files back. In RocksDB, compaction happens in background threads, causing the many <code>BGThread</code> samples we saw above.</p>
<p>There are many strategies to reduce compaction overhead in RocksDB and in general, but you can’t avoid its overhead entirely. Or can you?</p>
<h2 id="heading-disabling-compaction-for-bulk-load"><strong>Disabling compaction for bulk load?</strong></h2>
<p>Because of high compaction overhead, it is <a target="_blank" href="https://rockset.com/blog/optimizing-bulk-load-in-rocksdb/">common advice</a> to disable compaction in bulk load cases like our join during population, and do one large compaction at the end of population. Sadly, this didn’t work for our use case: empirically we didn’t see much of an improvement from it. We also can’t avoid compacting for too long; As noted above, the join algorithm is also <strong>read intensive:</strong> we write to one join side database and immediately read from the other. If we don’t compact, our read performance will suffer.</p>
<p>At this point we started looking a bit more into the data for our specific benchmark, TPC-DS <code>query4</code>, in the SF100(=A raw size of ~100GB) version of the dataset. We then noticed something interesting..</p>
<h2 id="heading-the-asymmetry-of-joined-tables"><strong>The asymmetry of joined tables</strong></h2>
<p>As you may remember, as part of the query we had a join of <code>customer</code> with 3 sales tables: <code>store_sales</code>, <code>catalog_sales</code> and <code>web_sales</code>. We checked the row count for each of these tables, and found out these numbers:</p>
<ul>
<li><p><code>customer</code>: 2M rows</p>
</li>
<li><p><code>store_sales</code>: ~<strong>288M</strong> rows</p>
</li>
<li><p><code>catalog_sales</code>: ~<strong>144M</strong> rows</p>
</li>
<li><p><code>web_sales</code>: ~<strong>72M rows</strong></p>
</li>
</ul>
<p>The <code>customer</code> table is much smaller than the tables it is joined with! Anywhere from 36x to a 144x size difference. This isn’t a TPC-DS specific thing of course. One of the use cases of join is to model one-to-many relations, in some cases one-to-much much more.</p>
<p>This is great news for us! Even though the symmetric hash join algorithm itself is, well, symmetric, our data is not. This means that one side of the join will be write intensive, and the other side will be read intensive**.** So how can we use this asymmetry to our benefit?</p>
<h2 id="heading-using-the-asymmetry-to-minimize-compaction-overhead"><strong>Using the asymmetry to minimize compaction overhead</strong></h2>
<p>This unlocks many possible optimizations, the one relevant for our compaction problem is: <strong>When one side of the join has finished populating, stop compacting the other side</strong>.</p>
<p>Why does this work? Let’s look at the <code>customer</code> and <code>store_sales</code> join again, with sizes of 2M and 288M rows, respectively. We have much less customers than sales, so at some point all of our customer rows had arrived, and compaction had finished for the <code>customer</code> database:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*uWLhyJLdi__9NDRljgjBvA.png" alt /></p>
<p>At the same time, new <code>store_sales</code> batches of rows keep arriving, creating more SSTs. Once the last customer had been written to the database, we stopped compacting store_sales, causing an overlap with the newer files written:</p>
<p><img src="https://miro.medium.com/v2/resize:fit:1260/1*fZ62NO7l5-asfDVzPZP5GQ.png" alt /></p>
<p>However, this doesn’t matter. From this point we are only writing to store_sales, and reading from the already compacted <code>customer</code>. Thus we get to both write and read fast without paying for unnecessary compaction.</p>
<p>We went to  implement this optimization, and  noticed a roughly <strong>1.5x</strong> improvements in population times of join heavy queries! That’s awesome, but can we do better?</p>
<h2 id="heading-can-we-exploit-the-asymmetry-further"><strong>Can we exploit the asymmetry further?</strong></h2>
<p>The insight regarding join sizes enables other interesting optimizations that are possible. For example, in RocksDB the memtable type can be specified. One memtable type that we mentioned earlier is <code>SkipList,</code> which allows concurrent writes but has <code>O(log n)</code> lookup cost. Another one is <code>HashSkipList</code>, Which as the name suggests is a hashtable where the bucket is a skiplist. The <code>HashSkipList</code> type doesn’t support concurrent writes, but has an (amortized) <code>O(1)</code> lookup speed.</p>
<p>Perhaps we could use the fast write yet slightly slower read <code>SkipList</code> for the larger join side, and the slower write but faster read <code>HashSkipList</code> for the smaller join side?</p>
<p>Honestly, we don’t know yet — we still need to verify this. All optimization ideas are good on paper, but aren’t very interesting without careful measurements. But the general idea is the same: there are probably many ways we can use the (reasonable) assumption of join size asymmetry to our advantage.</p>
<h2 id="heading-key-takeaways"><strong>Key Takeaways</strong></h2>
<p>Given the importance of our population performance and the prevalence of join, it was great to see such an improvement from this technique. But beyond this specific case, we’ve learned some lessons which can be applied more broadly:</p>
<ul>
<li><p><strong>Know your data:</strong> This is a big one. There are many known optimization techniques that work well across the board, like reducing excessive memory copies, avoiding serialization costs, etc. These should be implemented of course —  Making all cases equally faster is definitely important!<br />  But treating the underlying data as a black box misses an important piece of the puzzle. The real world isn’t a collection of uniform random variables (Thankfully, that would’ve been really boring).<br />  When considering optimizations, thinking about it as such possibly misses interesting ideas. If real world joins tend to behave in a certain way, why not use it?<br />  In fact, TPC-DS and other benchmarks make this line of thought easier: Even though every synthetic benchmark dataset has the issue of being not entirely realistic, they <em>are</em> designed to model the real world. Using them as a baseline gives a good starting point for these sort of ideas.</p>
</li>
<li><p><strong>Know your stack:</strong> Like in any database system, Epsio has many components, with RocksDB being an important one, especially when considering performance. Even though it’s tempting to go with a catch-all “good” configuration for all streaming operators, this misses many avenues for optimization. Knowing how the underlying storage engine works deeply (In case of RocksDB, sharpening your C++ skills along the way) is crucial.<br />  On a more personal note, it is also very fun! From the <code>Node</code> struct representation in the <code>SkipList</code>implementation to the high level compaction algorithms, RocksDB is a true marvel of engineering which we’ve learned a lot from.</p>
</li>
<li><p><strong>Measure, measure, measure:</strong> This is a classic optimization talking point, and rightly so. The more you <strong>actually profile a real workload</strong>, the better you optimize.. the real workload. Discussing possible slowdowns is beneficial, and microbenchmarks can reveal some easy wins. But profiling with good visualization for what’s really going on in the system is king. We’ll elaborate more on our methodology here in a future blog post, so stay tuned.</p>
</li>
</ul>
<p>These points continue helping us improve the performance of Epsio. Want to learn more about how we achieve fast response time in our engine? Check out our <a target="_blank" href="https://docs.epsio.io/">docs</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Epsio's Diff in the Streaming Landscape]]></title><description><![CDATA[We recently published an article that gained a bit of traction about how we built a streaming engine, and though the concepts therein received much adulation and praise, many members of the community expressed confusion about how Epsio is different t...]]></description><link>https://blog.epsiolabs.com/epsios-diff-in-the-streaming-landscape</link><guid isPermaLink="true">https://blog.epsiolabs.com/epsios-diff-in-the-streaming-landscape</guid><dc:creator><![CDATA[Maor Kern]]></dc:creator><pubDate>Fri, 03 Jan 2025 22:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753223521497/80644c76-3b52-4c01-925b-ed54fc0709c0.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We recently <a target="_blank" href="https://www.epsio.io/blog/how-to-create-a-streaming-sql-engine">published an article</a> that gained a bit of traction about <em>how</em> we built a streaming engine, and though the concepts therein received much adulation and praise, many members of the community expressed confusion about how Epsio is different than:</p>
<ul>
<li><p>Materialize</p>
</li>
<li><p>Flink</p>
</li>
<li><p>ReadySet</p>
</li>
<li><p>Snowflake Dynamic Tables</p>
</li>
<li><p>BigQuery Incremental Views</p>
</li>
<li><p>Oracle Fast Refresh</p>
</li>
<li><p>MSSQL Indexed Views</p>
</li>
<li><p>Database Triggers</p>
</li>
</ul>
<p>I won’t be going through every similar product, but I’d like to break down the so called magic bullet that is “incremental views”- what inherent tradeoffs exist, and where (and why) Epsio is positioning itself.</p>
<h4 id="heading-a-briefer-why-is-everyone-so-interested-in-incremental-views"><strong>A briefer- Why is everyone so interested in incremental views?</strong></h4>
<p>Incremental views allow you to define a query once, and have the queries’ results stay up to date without ever having to recalculate them, even as the underlying data changes. Imagine those costly queries that make your users wait <strong>multiple seconds</strong> and absolutely eat up your databases resources <strong>(</strong>we’ve even been privy to dastardly situations where queries take upwards of 30 minutes<strong>).</strong> With incremental views, those same queries all of a sudden take milliseconds with almost no compute, all with one simple command- CREATE INCREMENTAL VIEW SELECT ... FROM ... .<br />Here’s a brief explanation from our other article about how this works:</p>
<blockquote>
<p><em>A streaming SQL engine keeps queries results up to date without ever having to recalculate them, even as the underlying data changes. To explain this, imagine a simple query, such as SELECT count(\</em>) FROM humans . A normal SQL engine (such as Postgres’s, MySQL’s) would need to constantly go over all the different humans every time you ran that query- which could be quite costly and lengthy given our ever changing population count. With a streaming SQL engine, you would define that query once, and the engine would constantly keep the resulting count up to date as new humans were born and the old ones died off, without ever performing a recalculation of counting all humans in the world. In comprehensive streaming engines this is possible with every sort of query, no matter the complexity.*</p>
</blockquote>
<p>This can be a huge improvement with very little overhead. Furthermore, in big data environments, this has huge cost implications- every programmer who has worked with Big Query or Snowflake shudders at the words “Full Table Scan”. Incremental views eliminate them; your costs are correlative to your rates of change, <strong>not</strong> the amount of overall data in your system.</p>
<p>So, given incremental views are interesting, let’s talk about a few categories of incremental views, and where Epsio sees itself in comparison with them.</p>
<h4 id="heading-first-category-the-databaseswarehouses-themselves-batch-processors-in-disguise"><strong>First Category — The Databases/Warehouses themselves- Batch Processors in disguise</strong></h4>
<p>Oracle, MSSQL on the DBMS side, BigQuery and Snowflake on the warehouse side.<br />From a UX standpoint, it would have been excellent if you could just define incremental views using these engines instead of needing another solution for them. The above databases/warehouses each attempted in their own way to give you some type of ability to create incremental views without needing to use another service.</p>
<p>The problem is that under the hood, these databases have planning &amp; query engines that were built from the ground up to be <strong>batch processors</strong>. Whenever queried, these databases look at a snapshot of their complete set of data, and calculate a result. They do not deal with a stream of changes (eg “Deal with this insert. Deal with this update. Deal with this delete. etc”), but with constant data (eg “Deal with this static dataset”). Their queries are heavily optimized to be “short lived” (a 30 minute query is considered incredibly long), and produce <strong>one single result set</strong>. A stream processor’s queries normally live forever, and continuously produce <strong>new results as changes come in</strong>. This affects every single part of how they’re implementing both their planning and execution of a query. The language between the query nodes is different; batch engines pass data between their nodes, and do not have any way to pass a deletions/updates. The optimization tradeoffs are different; batch processors usually need much more compute and the ability to go over large amounts of base data quickly, while stream processors need little compute and lots of storage for intermediary representations of data. Even the flow of data is often different- in batch processors, you’ll normally have the query nodes “pulling” data from the nodes above them, while streaming processors normally work in “push” mode (which means different communication channels altogether).</p>
<p>But the proof is in the pudding. Oracle, MSSQL, and BigQuery have endless constraints on what you can do in their incremental views- the easiest example being that all of them do not support Outer Joins (Left/Right etc). And it makes sense; when reading through the three page essay that is the list of constraints on these databases, you get the monumental feeling that they’ve created a mountain of patches on their existing execution engines to support this, and that it had the expected effects on their user facing capabilities. This is by no means a bashing of those databases; they’re absolutely incredible at what they do. Stream processing just isn’t what they do.<br />Snowflake, by the way, is a bit more interesting, with their dynamic tables being released as a preview in June. They’re a bit more vague on what they actually support, and purposefully did not call them incremental views. If they do not support your query, they’ll just do a full refresh behind the scenes (and charge you as such). Playing with them and talking with some companies that tried to use them tells us that they’re valuable as a slightly smarter query refresher, but by no means a streaming solution.</p>
<h4 id="heading-second-category-stream-processors"><strong>Second Category — Stream Processors</strong></h4>
<p>Materialize, ReadySet, Flink, Epsio, etc- all built from the ground up to be stream processors, all very efficient; Materialize and ReadySet built by pioneers in the academic space.</p>
<p>While on the surface they may look similar, there are a multitude of tradeoffs and questions that arise as you start looking at each one in depth.</p>
<h5 id="heading-1-does-this-stream-processor-give-correct-answers"><strong>1. Does this stream processor give “correct” answers?</strong></h5>
<p>This may seem like a bit of an obvious question, but there are actually a variety of <strong>consistency</strong> <strong>models</strong> in databases. A consistency model of a database is a sort of “contract” the database gives you which gives certain guarantees for how writes will be taken into account in reads. The strongest level of consistency guarantees that every write will be taken into account on sequential reads- all your average SQL databases (Oracle, MySQL, Postgres, etc) give you this. Most stream processors do not give you this- as a matter of fact, none of the above (Materialize, Readyset, Flink and Epsio) do. What they do guarantee is that at a certain point in time after writing (maybe a few milliseconds, maybe a few seconds) the inserted record will be taken into account.</p>
<p>For most use cases, eventual consistency of 50ms is something you can swallow. What can be a much bigger issue is if you’re not <strong>internally consistent.</strong> Internal consistency means that every answer given was true at some point in time. Some systems, like Flink, are <a target="_blank" href="https://www.scattered-thoughts.net/writing/internal-consistency-in-streaming-systems/">not internally consistent</a>, which is usually a no-go from the get-go for companies. Imagine an alert going out about an event that never happened, or a user going into your dashboard at <em>just the wrong time</em> and seeing his balance is below zero. This is true about other streaming engines as well, and something I highly advise on checking before picking a streaming engine.</p>
<h5 id="heading-2-how-much-costrisktime-is-inherent-in-trying-to-integrate-them"><strong>2. How much cost/risk/time is inherent in trying to integrate them?</strong></h5>
<p>To answer this per stream processor, we’ll need to understand their <strong>mode of integration</strong>. Some stream processors are built to be more “standalone”, and some are meant to complement existing systems. If you already have a database/data-warehouse, standalone stream processors mean replacing your current database. <a target="_blank" href="https://materialize.com/">Materialize</a>, for example, is an incredible streaming engine, highly geared towards realtime latency in big data environments. They’re a good example of tending towards more of a standalone engine (and you can see over time how this has become more and more the case), looking to be a <strong>replacement</strong> for existing data warehouses (or ETL processes). A few others I can throw into the category of tending towards standalone engines are <a target="_blank" href="https://www.arroyo.dev/">Arroyo</a> (serverless stream processing) and <a target="_blank" href="https://www.feldera.com/">Feldera</a> (data warehouse, built upon the uber-cool-bleeding-edge DBSP). They’re excellent choice if you want to create new solutions (if you want to create a new warehouse), but can often have too heavy integration costs if you’re looking to optimize existing queries/materialized views, as you’ll probably need to move your data warehouse to only use them if you don’t want to end up managing two data warehouses (a quote I really like- “replacing your database/data warehouse is like trying to replace a car’s engine while driving”).</p>
<p>For stream processors that are complementing current systems, the question becomes- <strong>what risks does this new platform pose architecturally, and how do I integrate with it</strong>?</p>
<p>Here ReadySet is very interesting. They sit as a proxy between your server and your database in order to offer partial materialization. Sitting as a proxy has both upsides and downsides; ReadySet can intercept queries and automatically reroute them to their corresponding incremental views (that you define in their portal), thereby taking away the need to make any code changes. The downside is that adding a proxy to your database is adding another Single Point of Failure to your entire system. If ReadySet goes down, the connection from your backend to your database goes down with it. The irony is that although ReadySet is very easy to integrate from a code standpoint, it would take a courageous architect to add a proxy without extensive testing; it makes it more difficult to check a specific use case and grow from there.</p>
<h4 id="heading-so-where-is-epsio-in-all-this"><strong>So where is Epsio in all this?</strong></h4>
<p>Our aim is to give a seamless UX within your existing database, while minimizing risk &amp; integration time. Our belief is that while the implementation requires a different solution than a batch processor, the user experience ought to be as though it were available within your very own database.</p>
<p>From an architectural standpoint, Epsio sits “behind” your database as another replica, and writes back to your existing database. This allows your backend to converse with your database without ever knowing about Epsio. If Epsio were to go down, your backend would be unaffected, and even Epsio’s own incremental views would return results, albeit “stale”. From a risk standpoint, this can be crucial for integrating a new technology; architects are often extremely risk-averse when it comes to production environments, and this architecture allows you to try out Epsio on one use case without risking others.</p>
<p>To give a “seamless” UX, you never interact with Epsio directly, instead calling functions that Epsio creates within your existing database (such as “epsio.create_view”, or “epsio.alter_view”). These functions are transaction safe, which can be critical for the workflow of most companies who want to create/alter views within migrations.</p>
<p>Lastly, there’s an inherent tradeoff between speed and cost that stream processors have to deal with. Epsio is built from the ground up to work well with storage to vastly reduce costs compared to in-memory engines. If you want sub-millisecond latency no matter the cost Epsio may not be the best engine. But if you’re looking for high throughputs at ~50ms latency, Epsio delivers.</p>
<p>...</p>
<p>I think at a bit of a more “meta” level, we believe that incremental view theory has finally reached the point where it’s not a concern of support for SQL operators, or making it <em>that bit more speedy</em>, but a question of how well it ties into your current architecture, both from speed of integration and the cost of the actual product. This is because of incredible work done by the folks who built Noria, Differential Dataflow, DBSP, and countless other streaming methodologies.<br />Epsio is built to bring the best of the world of theory while optimizing on seamless, low time/risk integration and using low cost resources. Can’t wait to see you!</p>
]]></content:encoded></item><item><title><![CDATA[Streaming SQL Arithmetic]]></title><description><![CDATA[Math insists that a+b-c=a-c+b, and we take this as an unshakable truth. But in computer science, when floating-point arithmetic enters the picture, even basic rules like this can fall apart. Most developers never notice these quirks in everyday appli...]]></description><link>https://blog.epsiolabs.com/streaming-sql-arithmetic</link><guid isPermaLink="true">https://blog.epsiolabs.com/streaming-sql-arithmetic</guid><dc:creator><![CDATA[Maor Kern]]></dc:creator><pubDate>Mon, 09 Dec 2024 22:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753199505467/41373e33-f4d8-4e10-b4a3-d89d1b970649.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Math insists that <code>a+b-c=a-c+b</code>, and we take this as an unshakable truth. But in computer science, when floating-point arithmetic enters the picture, even basic rules like this can fall apart. Most developers never notice these quirks in everyday applications — but in the early days of building our Streaming SQL engine, this tiny mathematical inconsistency snowballed into a nightmarish bug. It wasn’t just hard to fix; it was hard to even understand. Let’s dive into the depths of this floating-point horror story together.</p>
<h3 id="heading-how-epsio-stores-result-data"><strong>How Epsio Stores Result Data</strong></h3>
<p>Before we go into the inherent issues with floating arithmetic, let’s talk about how Epsio stores the result data. In Epsio, you define a “view”, which consists of two parts: a name for a result table (this will be a table Epsio creates in <strong>your</strong> database, say a Postgres database), and a query you want Epsio to maintain. Epsio will populate the result table with the answer of the query, and then continuously update it as new changes come in via the replication stream from the database. A streaming SQL engine keeps queries’ results up to date without ever having to recalculate them, even as the underlying data changes.</p>
<p>To make this more understandable, let’s say we had a farm with livestock, and wanted to give the animals on the farm their fair portion of the profits (trying to avoid an Animal Farm here). To know how much total salary we’ve given each animal group, we’d have a query SELECT animal, sum(salary) FROM animals GROUP BY animal . Because the underlying data is always changing (animals being born, asking for raises, etc) we create a simple Epsio view in our Postgres:</p>
<pre><code class="lang-rust">CALL epsio.create_view(<span class="hljs-symbol">'animal_salaries</span>', 
<span class="hljs-symbol">'SELECT</span> animal <span class="hljs-keyword">as</span> group, sum(salary) <span class="hljs-keyword">as</span> total FROM animals GROUP BY animal')
</code></pre>
<p>This would create a table in our Postgres called animal_salaries, which Epsio would continuously update with the ever changing animal salaries.</p>
<p>It might look something like this:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c1722975912e868965bf_67599ebf090e53f3788ea9fd_1_cW_g-TuvMiNtF-3kAd08qA.webp" alt /></p>
<p>You may wonder what that “epsio_id” column is doing there, since we only ever defined two columns in our query. Epsio automatically adds another column to every result table in order to allow for efficient deleting. Internally, Epsio has a sort of “dictionary” mapping the entire row to it’s corresponding epsio id (this is actually held in RocksDB).</p>
<p>In our case, it would look something like this:</p>
<pre><code class="lang-javascript">Bunnies,<span class="hljs-number">720.32</span>:<span class="hljs-number">999</span>
Pigs,<span class="hljs-number">7049.41</span>:<span class="hljs-number">238</span>
</code></pre>
<p>Let’s say all the Bunnies died (someone nefarious ran &lt;code&gt;DELETE FROM animals WHERE animal='bunny'&lt;/code&gt;) , and Epsio wanted to remove the bunnies row. It would find the corresponding row in it’s key mapping (Bunnies,720.32), find the corresponding epsio_id (999), and then run a delete query on Postgres:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> animal_salaries <span class="hljs-keyword">WHERE</span> epsio_id = <span class="hljs-number">999</span>
</code></pre>
<p>You may notice that salaries are not round numbers. Animals can be a bothersome folk who negotiate on the penny, which means we need to keep our salaries (and therefore the sums) as floating points.</p>
<h3 id="heading-enter-floating-points"><strong>Enter Floating Points</strong></h3>
<p>Floating-point numbers are a way for computers to represent real numbers; numbers that can contain decimal points. Instead of storing numbers with absolute precision, floating points use a sort of “approximation”, which allows them to be both efficient in terms of space and also quick with computations. They’re represented using three numbers: a sign (positive or negative), a “mantissa”, and an exponent. So every floating point is essentially represented as:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759a86fe4b171fefae127a0_6759a8673e69f41cbb7b896e_1_8yEUtMItH7hxExWuTnXTrA.webp" alt /></p>
<p>using a nifty calculator (<a target="_blank" href="https://observablehq.com/@benaubin/floating-point">https://observablehq.com/@benaubin/floating-point</a>) we get:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759befd3c96fc4408d9ff1f_6759beb20b7fd05316a24cc2_1_53pteravaJLAsiUx7HeEMw.webp" alt /></p>
<p>As we can see, we’re actually not able to represent 2.4 exactly in 32bit floats! This might not seem to be that big an issue (you might say that 2.399999.. is <em>awfully</em> close) but it can lead to weird things.</p>
<h3 id="heading-weird-things-aka-arithmetic-with-floating-points"><strong>Weird Things (aka Arithmetic with Floating Points)</strong></h3>
<p>So if we’re not able to represent numbers exactly, what happens when we add them up?</p>
<p>The question is <em>how</em> computers manage to add up floats that are written in this method. If the exponents were the same, then by the ever-respected laws of math we could just add up the mantissas, and keep the exponent. But if the exponents are <strong>not</strong> the same, we need to <strong>align</strong> one of the float’s exponents to be the same as the others. We do this by both changing the <strong>mantissa</strong> and the <strong>exponent</strong>- which means we get an ever so slightly different approximation. For example, if 3.8 is approximated to 3.79999 with exponent=1 and we aligned the exponent to 2 (and changed the mantissa correspondingly) it may be approximated to 3.800001.</p>
<p>If we do addition like this, <strong>math may not be associative</strong>. Imagine we add 4.6 and 3.8. Let’s say 4.6 has an exponent of 2 and is approximated to 4.60001, and 3.8 has an exponent of 1 and is approximated to 3.79999. As we said before we need to change one of the floats’ exponents, so let’s change 3.8 from its exponent of 1 to 2, making its approximation 3.800001 instead of 3.79999. If we add these numbers, we may get something very close to the “real” answer, which is 8.4. But if we now <strong>subtract</strong> 4.6 from this answer, and asked our computer “hey mr.computer is that now equal to 3.8”, our computer would say <strong>no</strong>.</p>
<p>3.800001 != 3.79999.</p>
<p>Now imagine we remove our original 3.8 (whose approximation is 3.7999) from that. We would get a number very close to zero, but not zero.</p>
<p>So <code>4.6 + 3.8 - 3.8 - 4.6 != 0</code> .</p>
<p>But… 4.6 + 3.8 — (3.8 + 4.6) <strong>does</strong> equal zero in floating point land. I’ll let you work out why.</p>
<h3 id="heading-why-does-this-matter-in-streaming-sql"><strong>Why does this matter in streaming SQL</strong></h3>
<p>In Epsio’s streaming SQL engine we have a concept of a “change”, which is the basic unit that is passed between query nodes. If you’d like to go into depth on how this works, my other blog post <a target="_blank" href="https://blog.epsio.io/how-to-create-a-streaming-sql-engine-96e23994e0dd">https://blog.epsio.io/how-to-create-a-streaming-sql-engine-96e23994e0dd</a> explains it, but I’ll give you the short and relevant version here.</p>
<p>Every change has a <strong>key</strong> and a <strong>modification.</strong></p>
<p>Let’s say a Llama appears out of the blue in our farm, demanding a very fair salary of 3.8$. We would represent it like this:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c0518d581dac788b06ab_6759bf39c74228bdbe7e671b_1_YQ85wrLOXPIkK2QpF1Gozg.webp" alt /></p>
<p>One of our query nodes is called a Group By node, which in our example from the beginning of the blog would be the last query node before our “final” node, mapping rows to their corresponding epsio_ids. The Group By node outputs aggregates per buckets. It does this by holding an internal mapping (in storage) between each bucket and its aggregated value. So in our example we would have</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c0518d581dac788b06a8_6759c02fc938b3395f9e0b3d_1_C7fQVaDDTD1jdzjNcaLubw.webp" alt /></p>
<p>Let’s imagine we indeed passed Llamas +3.8 into the Group By node. It would look something like this:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c0e1c74228bdbe7fe786_6759c066eb4b2ca1e9660f68_1_rhagjiZ5R9QFJKXh3_nf_Q.webp" alt /></p>
<p>The Group By node outputs that there is now one new group (signified by the +1) of Llamas with the value 3.8.</p>
<p>If that Llama were to then have a Llama baby, we would have:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c0e1c74228bdbe7fe783_6759c088a8815ec6cf6d5860_1_nY_Yb2qJ42okJ2MO-7U-Pw.webp" alt /></p>
<p>The Group By node basically tells the node after it,</p>
<p>“Hey, remove the row I gave you before saying (Llamas, 3.8) and create a new row called (Llamas, 8.4)”.</p>
<p>The Group By node always outputs a minus on what it had for the group (if the group previously existed), and a plus for the new value. Now, you might be thinking, what if both Llamas died? Wouldn’t we reach the exact scenario we talked about before where 4.6+3.8–4.6–3.8 != 0? It would be pretty weird if an end user saw that there are no Llamas in his system but Llamas receive $0.000002 of total salary.</p>
<p>The way Epsio solves this is by keeping track of the number of records that passed through. If the number of records from the Group By node is ever summed up to zero (e.g. we had one new record and one deletion of a record), we simply filter it out (in truth- all our aggregations are “tuples” which can have many values that we add up together at once, but that’s for another time). This is how we avoid weird situations with groups having 0.00000x values for floats.</p>
<p>But you might be beginning to get a feel that something weird is going on here- in a batch processing SQL engine this problem doesn’t exist as acutely, since if there are no Llamas you would never output the group. Because stream processing engines are keeping state over time, <strong>modifications that aren’t perfectly associative</strong> can cause trouble.</p>
<h3 id="heading-finally-the-bug"><strong>(Finally) The Bug</strong></h3>
<p>So what’s the bug?The answer lies in how our internal mappings work. We’ve previously shown our internal mapping as follows:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c0518d581dac788b06a8_6759c02fc938b3395f9e0b3d_1_C7fQVaDDTD1jdzjNcaLubw.webp" alt /></p>
<p>But how do we actually hold this data, and how do we update it?</p>
<p>Our keys and values are held in RocksDB, an embedded key value store. Every time we merge in a new value for a key, instead of just removing the old key and adding a new key with the updated value, RocksDB simply saves a “merge” record. So we can imagine our mappings looking something like this:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/6759c0e1c74228bdbe7fe77f_6759c0cfb83c119eb878563b_1__ra13qBHWjpRI-VY_Z_7ZA.webp" alt /></p>
<p>RocksDB will actually do the “merging” (in our case addition) when it either runs a process called “compaction”, or when you read a key. The thing is, RocksDB offers no guarantees on what order it’ll do the merging in. Furthermore, we have optimizations that hold some of the data in memory that we add with the result of RocksDB- this is just to emphasize the fact that there is an <strong>implicit expectation</strong> in our code that you’re allowed to “merge” records in any order you want.</p>
<p>This all means that we’re constantly at risk of giving off a slightly different answer- just like4.6+3.8–4.6–3.8 != (4.6+3.8)-(4.6+3.8).This rarely happens precisely because most of the time you <strong>are</strong> merging them in the same order, and so getting the same result. But nobody is guaranteeing that for you, and in rare cases you’ll run into trouble. This trouble comes in the form of telling the Final Node, which we talked about in the beginning of the blog, to <strong>delete a row that never existed</strong>. It therefore cannot find the equivalent epsio_id, and proceeds to error out.</p>
<h3 id="heading-why-this-happened"><strong>Why this happened</strong></h3>
<p>In streaming engines, the concept of a modification must be extremely well defined. To us, it was anything that implements in Rust the <code>Add</code> trait and the <code>Default</code> trait- meaning both that modifications could be added together, and that they could zero out. But there was a third “trait” (not in the Rust sense) that we didn’t think too much about when building the underlying infrastructure: associativity. While “hacks” such as keeping the count (a very common solution in other SQL Engines) can seem to patch the problem, the underlying issues with associativity will come to light in other fashions. For us, the right solution was to transform floats into absolute precision numbers when they arrive at accumulative query nodes. Absolute precision means that we <strong>never</strong> have “approximations” like in floats, which guarantees us perfect associativity. The cost and overhead of having absolute precision was negligible in most situations, while engineering around them was full of pitfalls.</p>
]]></content:encoded></item><item><title><![CDATA[Why We Built a Streaming SQL Engine]]></title><description><![CDATA[Following our previous blog post “How we built a Streaming SQL Engine,” we are delighted to share some of the reasoning and thoughts that led us to build Epsio (as we always say — start with how, only later ask why).
Why we built Epsio
For most of ou...]]></description><link>https://blog.epsiolabs.com/why-we-built-a-streaming-sql-engine</link><guid isPermaLink="true">https://blog.epsiolabs.com/why-we-built-a-streaming-sql-engine</guid><dc:creator><![CDATA[Gilad Kleinman]]></dc:creator><pubDate>Wed, 21 Aug 2024 21:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753199319761/5db66c78-ef72-4b00-8920-e899c954ae0c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Following our previous blog post “<a target="_blank" href="https://www.epsio.io/blog/how-to-create-a-streaming-sql-engine">How we built a Streaming SQL Engine</a>,” we are delighted to share some of the reasoning and thoughts that led us to build Epsio (as we always say — start with how, only later ask why).</p>
<h4 id="heading-why-we-built-epsio"><strong>Why we built Epsio</strong></h4>
<p>For most of our adult lives, my co-founder Maor and I have been working on developing and designing software products, and scaling them from mere ideas to fully-fledged solutions that cater to the needs of thousands / tens of thousands of users.</p>
<p>Two interesting (though obvious) things almost always happened as time passed while working on these projects:</p>
<ul>
<li><p><strong>The amount of data our product relied on increased</strong>. As more users were using our projects, the more data we had to store.</p>
</li>
<li><p><strong>Our products become more complex</strong>. Over time, management would request additional features and enhancements, resulting in a much more complex backend architecture and more complex database queries.</p>
</li>
</ul>
<p>Although these two trends may seem obvious, they always had severe implications, particularly on the performance of the databases we were using. As a new feature was requested or a new big customer onboarded, we always remember asking ourselves “will the DB handle this?”</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14d9227d11977422268_6563951bb0c01b3ea3745639_1*ydCuvVIXkI_5HrAxSYLp0g.png" alt /></p>
<h4 id="heading-why-our-queries-became-slow-the-gap-between-the-data-we-store-and-the-data-we-seek"><strong>Why our queries became slow: The gap between the data we store and <em>the data we seek</em></strong></h4>
<p>In a philosophical way, whenever one of those things occurred (more data was stored / queries became more complex), the gap between <em>the data we stored</em> and <em>the data we sought</em> became bigger.</p>
<p>The more data we had, and the more complex a query we ran, the more <em>things</em> (disk reads, calculations, correlations, etc…) our database needed to do to calculate the result based on the data we stored. The more <em>things</em> our database needed to perform, the wider the gap became, and the longer it took the database to provide a result.</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14d9227d11977422256_6563953a40fccd94b2016355_1*RM0oDe6cu1OXz44fhkdlQw.png" alt /></p>
<p>If, for example, we were building a salary management system with a dashboard that shows the total sum of salaries in our company — and the number of salaries in the company were to suddenly double, our database would need to perform twice the amount of work each time someone opens the dashboard. This would result in our database either using twice the resources it did before or taking twice the time it used to. Both options are, obviously, not ideal.</p>
<p>Even in the scenario where the number of salaries doesn’t grow, it is very likely that over time we would want to add additional or more complex queries to our dashboard (such as changes in salaries over time, distribution graphs, etc.), again — resulting in even more work for our database to perform to calculate what we want to show in the dashboard from the underlaying data.</p>
<h4 id="heading-how-did-we-overcome-these-performance-issues-until-today"><strong>How did we overcome these performance issues until today?</strong></h4>
<p>Usually, when facing these performance issues, we would try using all the traditional methods (indexing, optimizing the queries, etc.) that we were taught to “accelerate” our queries. All of them may improve the performance in the short term, but usually these solutions would just be “patches” and would not hold water as data volumes and query complexity continued to grow.</p>
<p>Methods such as adding indexes, increasing the computing power of the database, and rewriting queries would usually indeed shorten query times. However, since the performance would still be correlated to the data volumes and complexity, our queries would continue to slow down as the gap continued to widen. A query that was optimized to run 10 times faster would very quickly become slow again as data increased by 10 times its size.</p>
<p>Alternative methods such as materialized views, caches, or denormalizations that pre-calculated the <em>data we seek</em> instead of just “accelerating” queries would also seem very promising at the beginning, but would quickly prove difficult to implement and maintain when we actually tried using them. Given that the results were pre-calculated, every time the underlying data changed, we would need to recalculate the entire result from scratch to reflect the new data, which would cost us expensive compute power or lead to stale data if we didn’t constantly update the results.</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14c9227d1197742223f_6563955226645cc2338f944f_0*3WGlaAdzyRWbq_NT.jpeg" alt /></p>
<h4 id="heading-introducing-epsio"><strong>Introducing Epsio</strong></h4>
<p>Epsio solves this problem by never trying to reprocess the entire dataset whenever a result of a query is needed or whenever the underlying data changes. Instead, Epsio pre-calculates results of defined queries and incrementally updates them whenever the underlying data changes, <strong>without ever recalculating the entire dataset</strong>.</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14d9227d11977422282_6577275ae20b3e0de545d231_Screen%2520Recording%25202023-12-11%2520at%252017.09.25%2520(1).gif" alt /></p>
<p>By pre-calculating complex queries and incrementally maintaining their results, we can achieve the huge benefits of denormalizations and caches, where results are instant, without the need to invalidate or recreate results whenever the underlying data changes. For us, this method of executing complex queries was the only way to effectively bridge the gap between the <em>data we stored</em> and the <em>data we sought</em>, and to “prepare” our database for the specific queries we were about to execute.</p>
<h4 id="heading-how-does-this-actually-work"><strong>How does this actually work?</strong></h4>
<p>To demonstrate how Epsio actually works, imagine the backend system to manage salaries we mentioned before. Imagine that the system contains a table with the salaries of all employees in a company, and a dashboard running a query that calculates the sum of salaries by department. If our imaginary system were to be used in a very large company, our salaries table might easily contain millions of records, meaning that our dashboard would probably take multiple seconds or minutes to load without us optimizing its queries extensively.</p>
<p>To utilize Epsio, we can simply define our query in Epsio. First, Epsio scans the entire base table and saves the initial result of the query to a results table:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14d9227d1197742224c_656395af26645cc2338fe612_0*PlG98LFePYyPfy4g.png" alt /></p>
<p>Then, when new data arrives or existing data gets updated or deleted, Epsio performs the minimum calculation needed to update the previous result. In this case, just adding or subtracting the previous count per department:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14d9227d1197742226b_656395afe2527b66eb6181d6_0*DlBa_b3Yj9jzoX-l.png" alt /></p>
<p>As the above calculation is very efficient (compared to recalculating the entire query), we are able to take a complex query that originally took seconds / minutes to run and provide instant and always up-to-date results, all while using less compute power and without spending a second on “query optimizations”.</p>
<p>Although it’s very simple to imagine how the above query could be incrementally updated when new data arrives, apparently (as you can read in the <a target="_blank" href="https://www.epsio.io/blog/how-to-create-a-streaming-sql-engine">previous blog post</a> we published), the same concept and way of thinking could be applied to much more complex queries and SQL operators that contain JOINs, ORDER BYs, DISTINCT, and many more — meaning that almost any complex and slow SQL query that runs frequently in our backends could benefit from this huge performance boost if processed “incrementally.”</p>
<h4 id="heading-why-epsio-is-not-a-new-database"><strong>Why Epsio is (not) a new database</strong></h4>
<p>Although our initial hunch when thinking about the architecture of Epsio was to build a new database, we quickly understood that it is important for our new engine to be “<em>Easy to integrate, Easy to trust, and Easy to use</em>”. When thinking about these three “E”s, we understood that building a new database simply doesn’t address any of those, and probably isn’t the right architecture for us.</p>
<p>Specifically, we wished for an architecture where:</p>
<ul>
<li><p>Zero migrations are required for use.</p>
</li>
<li><p>The engine could be tested on a small use case without effecting anything not related to that use case.</p>
</li>
<li><p>No additional single point of failure should be added. If the engine falls, our database should still be “queryable”.</p>
</li>
</ul>
<p>To answer all of the above, we decided to implement an architecture where Epsio sits “behind” the existing database, receives the change stream (CDC/replication stream) for the relevant data, and writes back to “result tables” in the original database:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c14d9227d11977422249_65639811ef0efb399813dc4d_Screenshot%25202023-11-26%2520at%252021.09.44.png" alt /></p>
<p>In this architecture, unlike a new database:</p>
<ul>
<li><p>No migrations are required to use Epsio.</p>
</li>
<li><p>Developers can try Epsio on a small use case without it effecting the rest of the database.  Epsio is not in the hot path.</p>
</li>
<li><p>In the (very bad) scenario where the Epsio instance fails, other queries in the database are not affected, and even Epsio’s result tables continue to be accessible, although their results might be stale.</p>
</li>
</ul>
<p>All of the above means we can continue working with the databases we already love and trust, while still enjoying the significant benefits of stream processing — or as the great Hannah Montana used to say: “You get the best of both worlds.”</p>
<h4 id="heading-so-whats-the-catch"><strong>So, what’s the catch?</strong></h4>
<p>Although we truly believe that the “incremental” way of thinking can really change the way we process and query our data to the better, at the end of the day, similar to everything else in the world of databases (or software in generally) — everything is a tradeoff.</p>
<p>Some queries that change drastically between runs (ad-hoc queries) or that query data that changes drastically between runs, might be less efficient in Epsio. In the world of tradeoffs you can never solve all queries, but if we can still solve a huge portion of them, we should be happy with the big “incremental” (no pun intended) change it offers in the process of making developers focus more on building logic and less on optimizing stuff.</p>
<h4 id="heading-concluding-thoughts"><strong>Concluding thoughts</strong></h4>
<p>After spending way too much time optimizing complex queries and trying to scale them as our products scale, we hope that Epsio will serve as a new paradigm to better bridge our applications and our databases — and to ensure that our databases perform only the minimum required operations to meet our complex application needs.</p>
<p>...</p>
<p>If you enjoyed this blog post and wish to try Epsio’s <em>blazingly fast</em> (but actually just cheating) SQL engine, check out our docs <a target="_blank" href="http://docs.epsio.io/">here</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Out Of Memory Shenanigans]]></title><description><![CDATA[At Epsio Labs, we develop an incremental SQL engine — in brief, an SQL engine that operates on changes instead of reprocessing the whole dataset each time a query runs. I’ll leave explaining why that’s useful to another blog post of ours — Why We Bui...]]></description><link>https://blog.epsiolabs.com/out-of-memory-shenanigans</link><guid isPermaLink="true">https://blog.epsiolabs.com/out-of-memory-shenanigans</guid><category><![CDATA[memory-management]]></category><category><![CDATA[Rust]]></category><dc:creator><![CDATA[Gilad Kleinman]]></dc:creator><pubDate>Tue, 06 Aug 2024 21:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753198526046/3e808bca-3958-4d85-b983-2c2673fb1835.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At Epsio Labs, we develop an incremental SQL engine — in brief, an SQL engine that operates on changes instead of reprocessing the whole dataset each time a query runs. I’ll leave explaining why that’s useful to another blog post of ours — <a target="_blank" href="https://www.epsio.io/blog/why-we-built-a-streaming-sql-engine">Why We Built a Streaming SQL Engine</a>.</p>
<p>In the early days of Epsio, one of our first ways to stress test our engine was to run the TPC-DS benchmark queries (an industry standard benchmark) with a large data scale (100 TB of records).</p>
<p>We ran one of our newly born tests, and after several minutes, our test machine completely <strong>froze for 10 minutes!</strong><br />Afterwards, we noticed <strong>our engine process got killed</strong> (some would say brutally murdered). Looking at our system metrics and logs, we saw the <strong>system memory was at 99%</strong>, which was much higher than expected.</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d67_6596fd6b7ac8606c5d1eaaa2_1*Z7DvCau4A4SsAXw7BcEoUg.png" alt /></p>
<p>Memory Utilization on 16 GB machine, Between 19:06 and 19:17 logs were not sent (thus the awfully straight line)</p>
<p>In this blog post, we will navigate the uncertainty of memory utilization, unveil the mystery of our process’s memory bloat, and discuss how we tackled it.</p>
<h4 id="heading-what-happens-when-the-machine-is-out-of-memory"><strong>What happens when the machine is out of memory?</strong></h4>
<p>Data and memory-intensive programs such as ours often reach high memory usages. If the workload is large enough and no memory regulation is in place, it might cause your system to reach close to 100% memory utilization.</p>
<p>In order to keep our machines running even in such scenarios, Linux gives us two configurable tools:</p>
<ol>
<li><p><strong>Swap memory</strong> — A file or partition on disk that’s used as a supplement to physical memory (RAM). When more RAM is needed than what’s available, inactive pages in memory are moved to the swap space in order to free memory space. We chose to disable swap in our test machine because we consider extreme memory utilization a bug; we wouldn’t want to rely on swap to save us due to its performance impact. Also, Epsio might be deployed on a machine that has swap disabled.</p>
</li>
<li><p><strong>OOM Killer</strong> — The Out of Memory Killer is a process ran by the Linux kernel when the system memory is extremely low. It reviews all the running processes and assigns a score to each, based on memory usage and other parameters. It then chooses one or more of them to be killed. [<a target="_blank" href="https://www.kernel.org/doc/gorman/html/understand/understand016.html">1</a>] [<a target="_blank" href="https://lwn.net/Articles/761118/">2</a>]</p>
</li>
</ol>
<p>The culprit of our process’s assassination is of course the OOM Killer, who chose to kill the engine process because of its high memory usage.<br />To ensure that our accusation is true, we can look at the system logs with <em>journalctl</em> (a utility to query various services’ logs):</p>
<p>(You can ignore the nasty command)</p>
<p><code>journalctl --list-boots | awk '{ print $1 }' | \   xargs -I{} journalctl --utc --no-pager -b {} -kqg 'killed process's -o verbose --output-fields=MESSAGE      Tue 2023-02-02 16:21:52.348846 UTC [s=694483b10fab40da9c36d06a1f288df8;i=acc;b=8ee2faf0530244958351de3c743bd18a;m=d991edcc;t=605b8a8153aae;x=affb06d11f0113d9]   MESSAGE=Out of memory: Killed process 9049 (epsio_engine) total-vm:16253052kB, anon-rss:14777084kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:30128kB oom_score_adj:0</code></p>
<p>Let’s focus on the output:</p>
<p><code>Out of memory: Killed process 9049 (epsio_engine) ... anon-rss:14777084kB</code></p>
<p><strong>Our process got killed by the OOM Killer</strong> while having <strong>14.7 GB</strong> allocated to it in physical memory (see “anon-rss” value in the output).</p>
<p>So OOM Killer was spawned, which means <strong>our system was out of memory</strong>. When you get that close to the sun, bad things start to happen — my machine was unresponsive via ssh, application logs were not sent, and <strong>everything seemed to be frozen</strong>. It took 10 minutes until my machine was responsive again, which is weird. You’d think the OOM Killer would kill a process in a millisecond.</p>
<p>Apparently, this is a known phenomenon in Linux that is sometimes called “<strong>OOM hangs/livelocks</strong>”. The reason for this phenomenon is a bit complicated, but to be concise, when memory is requested and there’s not enough available memory, the system first tries to reclaim memory pages that are backed by a file on the hard disk (memory-mapped files).<br /><strong>If enough memory is reclaimed, the OOM killer is never spawned</strong>. This can go on and on, causing the system to be slow and unresponsive.<br />Interestingly enough, SSDs’ fast performance makes this phenomenon even worse. [<a target="_blank" href="https://lwn.net/Articles/759658/">3</a>]</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d44_6596ff91b0a9ffaedec2e096_0*0Y5omham6rxPbkvw.jpeg" alt /></p>
<p>We definitely never want to reach an unresponsive state because of OOM hangs. We can avoid this by killing our process before the system memory is at full capacity (see <a target="_blank" href="https://github.com/rfjakob/earlyoom">earlyoom</a>). For the simple task of making our stress test machine want to talk to us, we chose to add a <a target="_blank" href="https://manpages.ubuntu.com/manpages/bionic/man7/cgroups.7.html">CGroup</a> in order to limit the process’s memory (the process will be killed when the limit is exceeded).</p>
<p>This mitigation makes our machine happy; however, the engine process will still be abruptly killed because of memory bloat, which takes us to the root cause of the problem: <strong>Why did Epsio need such an extreme amount of memory in that test?</strong></p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d37_6596ff91f7ddba3d39d191f2_1*lJrL-dXGWFzoHymhrRzePQ.jpeg" alt /></p>
<p>Processes fighting for memory on my ubuntu machine, circa not so long ago</p>
<h4 id="heading-epsio-view-and-its-big-pipeline"><strong>Epsio View And Its Big Pipeline</strong></h4>
<p>The failed test creates an Epsio view for a TPC-DS query. An Epsio view is an incremental view; it shows up-to-date results of the query. When an Epsio view is created, a new Epsio engine process is spawned. The engine first calculates the current results of the query — we call this stage <strong>population</strong>, and its purpose is similar to other SQL engines.<br />After population, the engine begins to consume row changes (e.g. inserts, deletes, updates) and update the results accordingly — we call this stage <strong>CDC Streaming</strong> (<a target="_blank" href="https://en.wikipedia.org/wiki/Change_data_capture">Change Data Capture</a>).</p>
<p>The specific test that failed created an Epsio view with a query similar to the one below (this is a simplified version):</p>
<pre><code class="lang-csharp"><span class="hljs-keyword">select</span> i_item_id, avg(cs_quantity) agg1
 <span class="hljs-keyword">from</span> catalog_sales
 <span class="hljs-keyword">join</span> item <span class="hljs-keyword">on</span> cs_item_sk = i_item_sk
 <span class="hljs-keyword">join</span> promotion <span class="hljs-keyword">on</span> cs_promo_sk = <span class="hljs-function">p_promo_sk
 <span class="hljs-title">where</span> (<span class="hljs-params">promotion.p_channel_email = <span class="hljs-string">'N'</span> or promotion.p_channel_event  = <span class="hljs-string">'N'</span></span>)
 <span class="hljs-keyword">group</span> <span class="hljs-keyword">by</span> i_item_id</span>
</code></pre>
<p>Our SQL engine constructs a data pipeline for the query. Each node in the pipeline corresponds to an SQL operator. Here is an illustration:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/670c063266629d51d7d4003f_670c0625acd5dada29800807_Untitled-2024-10-13-2039.png" alt /></p>
<p>Each root node (catalog_sales table, item table, promotion table) is an input node that yields a batch of rows either from the population or CDC stages. If we processed <strong>one batch</strong> of rows at a time and waited for it to reach completion, we would waste idle CPU cores and increase the latency of the next batch. <strong>That’s why we want to stream multiple batches of rows in parallel.</strong></p>
<p>Now, try to think of <strong>all the ways memory might bloat</strong> when we stream these batches of thousands of records:</p>
<ul>
<li><p>We might stream too many batches too fast, which is bad because memory will become scarce.</p>
</li>
<li><p><strong>Each join node might output more rows than the sum of its inputs</strong>. For example, if the first join key in our query is cs_item_sk = i_item_sk, for a batch of 10,000 rows from the left side of the join that have cs_item_sk = 1 and another 10,000 rows from the right side of the join that have i_item_sk =1 , the results would be the cross product of the rows (<strong>100,000,000 rows!</strong>).</p>
</li>
<li><p><strong>A bottleneck node</strong> might cause a “traffic jam” in the pipeline, causing memory to be occupied for more time. This might not seem like a problem, but remember that multiple Epsio views can run simultaneously, and if all of them have lots of batches in memory waiting for the bottleneck to finish, system memory will become scarce.</p>
</li>
</ul>
<p>Those disasters were <strong>already foretold</strong>, to tackle these situations we:</p>
<ul>
<li><p>Limited the amount of parallel batches.</p>
</li>
<li><p>Because a node like Join might output larger batches than its input, we introduced a <strong>stalling mechanism that will wait for more memory to be available</strong> before streaming batches between nodes.</p>
</li>
<li><p>Added a spill-to-disk mechanism that can choose to move stalled batches from memory to disk.</p>
</li>
<li><p>Introduced more optimizations to our SQL query planner in order to have a more efficient pipeline. For example, <strong>because our query’s filter node has predicates (filters) relevant to the promotion table’s columns, we can move it to be before the join node.</strong> That way, we avoid unnecessary joins between rows that would be filtered out later anyway, resulting in less memory usage and better performance. This optimization is known as “Predicate Pushdown” because it pushes the predicates (filters) closer to the data source (table input).</p>
</li>
</ul>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/670c0859bfda26ab4b0fc822_670c084f1a89ee901393fac8_second.png" alt /></p>
<p>Some of these mitigations are good examples of memory regulation. However, even with the mitigations in place, the test caused our process memory to bloat. <strong>The process memory usage was much higher than expected and exceeded our memory regulation limits by far</strong>. At this point, we decided to go more in depth.</p>
<h4 id="heading-malloc-and-its-shenanigans"><strong>Malloc And Its Shenanigans</strong></h4>
<p>To understand our process memory usage, we have to understand what happens when our process <strong>allocates dynamic memory</strong>.</p>
<p>In Rust, dynamic memory allocations use the global memory allocator, which in our case is the malloc allocator implemented in glibc 2.35.<br />So under the hood, our process basically uses <em>malloc</em> and <em>free</em> as any normal C program does.</p>
<h4 id="heading-malloc-being-greedy"><strong>Malloc Being Greedy</strong></h4>
<p>The allocator asks the OS for memory by using a syscall, but it strives to reduce the number of calls for performance reasons. Hence, <strong>when <em>free</em> is called,</strong> <strong>it doesn’t necessarily return the freed memory chunk to the OS</strong>, but instead holds several data structs for the bookkeeping of <strong>free chunks</strong>. On the next malloc call, the allocator looks for a suitable free chunk to return. If found, it will return a pointer to the chunk and mark it as used. Otherwise, it will issue another syscall for memory.</p>
<p><strong>From the OS’ perspective, the user process still occupies the memory in the free chunks.</strong></p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d40_6597001db0a9ffaedec352fe_0*ypMjtbA-KeY4l4BW.jpeg" alt /></p>
<h4 id="heading-memory-profiling"><strong>Memory Profiling</strong></h4>
<p>Interesting stuff! Now that we understand a bit more about malloc, let’s profile our process memory usage to see how many free chunks are there. We can use the libc <a target="_blank" href="https://man7.org/linux/man-pages/man3/mallinfo.3.html"><em>mallinfo2</em></a> function (<a target="_blank" href="https://codebrowser.dev/glibc/glibc/malloc/malloc.c.html#__libc_mallinfo2">__libc_mallinfo2</a>), which returns a struct containing information about memory allocations.<br />From the <em>mallinfo</em> man page:</p>
<blockquote>
<p><em>The structure fields contain the following information:<br />..<br /><strong><strong>uordblks</strong></strong><br />The total number of bytes used by in-use allocations.</em></p>
<p><strong><em>fordblks*</em></strong><br />The total number of bytes in free blocks.*</p>
</blockquote>
<p>Let’s create a task that will log mallinfo results every 10 seconds:</p>
<pre><code class="lang-rust"><span class="hljs-keyword">use</span> libc::mallinfo2;
<span class="hljs-keyword">use</span> tokio::time;
<span class="hljs-keyword">use</span> epsio_tracing::info;

<span class="hljs-keyword">pub</span> <span class="hljs-function"><span class="hljs-keyword">fn</span> <span class="hljs-title">start_mallinfo_monitor</span></span>() -&gt; JoinHandle&lt;()&gt; {
    tokio::spawn(<span class="hljs-keyword">async</span> <span class="hljs-keyword">move</span> {
        <span class="hljs-keyword">let</span> <span class="hljs-keyword">mut</span> interval = time::interval(Duration::from_millis(<span class="hljs-number">10000</span>));
        interval.set_missed_tick_behavior(time::MissedTickBehavior::Delay);
        <span class="hljs-keyword">loop</span> {
            interval.tick().<span class="hljs-keyword">await</span>;
            <span class="hljs-keyword">unsafe</span> {
                <span class="hljs-keyword">let</span> mallinfo2_struct = mallinfo2();
                info!(
                  message = <span class="hljs-string">"mallinfo2"</span>,
                  mallinfo2 = mallinfo2_struct
              );
            }
        }
    })
}
</code></pre>
<p>When we put <em>uordblks</em> and <em>fordblks</em> on a graph, we get:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d6d_6597dcb72b2e5b942d2841d4_1*ThUNgwlkaOhYmJawBadZdA.png" alt /></p>
<p>So we are actually not even using most of the occupied memory; ~11GB out of 15GB are actually free chunks.</p>
<p>To shed more light on the matter, I used the <a target="_blank" href="http://malloc_info/"><em>malloc_info</em></a> libc function, which prints a huge nasty XML with much more details:</p>
<pre><code class="lang-php-template"><span class="xml">more arenas
...
<span class="hljs-tag">&lt;/<span class="hljs-name">heap</span>
&lt;!<span class="hljs-attr">--Arena</span> <span class="hljs-attr">number</span> <span class="hljs-attr">60--</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">heap</span> <span class="hljs-attr">nr</span>=<span class="hljs-string">"60"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">sizes</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">size</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"33"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"48"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"96"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"2"</span>/&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">size</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"113"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"128"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"128"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"1"</span>/&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">size</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"49"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"49"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"79625"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"1625"</span>/&gt;</span>
  ...
  <span class="hljs-tag">&lt;<span class="hljs-name">size</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"224705"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"224705"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"224705"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"1"</span>/&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">size</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"300689"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"524241"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"37369240"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"88"</span>/&gt;</span>
  <span class="hljs-comment">&lt;!--There are 235 free chunks whose size ranges between 524353 to 67108785
   bytes, and occupy 804350523 bytes in total. --&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">size</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"524353"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"67108785"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"804350523"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"235"</span>/&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">unsorted</span> <span class="hljs-attr">from</span>=<span class="hljs-string">"65"</span> <span class="hljs-attr">to</span>=<span class="hljs-string">"65"</span> <span class="hljs-attr">total</span>=<span class="hljs-string">"65"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"1"</span>/&gt;</span>
  <span class="hljs-comment">&lt;!-- The "top" chunk is not included here --&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">sizes</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">total</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"fast"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"3"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"224"</span>/&gt;</span>
<span class="hljs-comment">&lt;!--Total count of free chunks and total size 852MB--&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">total</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"rest"</span> <span class="hljs-attr">count</span>=<span class="hljs-string">"8250"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"852802425"</span>/&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">system</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"current"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"1173217280"</span>/&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">system</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"max"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"1220870144"</span>/&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">aspace</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"total"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"1173217280"</span>/&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">aspace</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"mprotect"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"1173217280"</span>/&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">aspace</span> <span class="hljs-attr">type</span>=<span class="hljs-string">"subheaps"</span> <span class="hljs-attr">size</span>=<span class="hljs-string">"18"</span>/&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">heap</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">br</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">heap</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">br</span>&gt;</span><span class="hljs-tag">&lt;<span class="hljs-name">heap</span> <span class="hljs-attr">nr</span>=<span class="hljs-string">"61"</span>&gt;</span>
...
More arenas</span>
</code></pre>
<p>It turns out <strong>our process has more than 60 arenas</strong> (corresponding to the XML’s heap element). <strong>But what the hell is an arena?</strong> to explain that we need to understand how malloc deals with multiple threads.</p>
<h4 id="heading-malloc-and-multi-threading"><strong>Malloc And Multi-Threading</strong></h4>
<p>In order to efficiently and safely handle multi-threaded applications, glibc’s malloc needs to tackle the situation where <strong>two threads might access the same memory region at the same time</strong> (which would cause parallelization bugs). Therefore, the allocator segments the memory into different regions. These regions of memory are called “arenas”.</p>
<p><strong>Each arena structure has a mutex in it</strong>, which is used to control access to that arena. This mutex assures that two different threads won’t access the same memory at the same time. Contention for this mutex is the reason why multiple arenas are created.</p>
<p>The number of arenas is capped at eight times the number of CPU cores by default.</p>
<h4 id="heading-memory-fragmentation"><strong>Memory Fragmentation</strong></h4>
<p>It’s important to note that each arena has a heap (<a target="_blank" href="https://linkthedevil.gitbook.io/all-about-vpp/chapter1">sometimes multiples</a>) — a contiguous memory region holding the memory chunks (used or free). An heap can be grown or shrunk using the <em>brk</em> and <em>sbrk</em> syscalls.</p>
<p>As our process continues to stream row batches through the data pipeline, <strong>many small mallocs and frees</strong> are called, and heaps might get fragmented — gaps of free chunks between used chunks start to form. adjacent free chunks are combined into bigger free chunks. But <strong>a heap can shrink only by its “top” chunk</strong> — the free chunk at the current end of the heap. Free chunks trapped between allocated chunks might be used for the next allocation but will not be returned to the OS. We lovingly refer to this situation as <strong>memory fragmentation</strong>.</p>
<p><strong>With multiple arenas and heaps, memory fragmentation is even more prevalent</strong> because each heap gets fragmented on its own. There are fewer chances to coalesce free chunks and to find a suitable free chunk for an allocation.</p>
<p>To illustrate how costly memory fragmentation can be, I wrote a small C program. The program allocates 1,000,0000 buffers and then <strong>frees all of them except the last one</strong>:</p>
<pre><code class="lang-cpp"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span></span>

<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> NUM_BUFFERS 1000000</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> BUFFER_SIZE 1024  <span class="hljs-comment">// 1KB</span></span>
<span class="hljs-comment">// 1000000 buffers * 1KB = ~1GB</span>

<span class="hljs-function"><span class="hljs-keyword">void</span>* <span class="hljs-title">allocate_buffers_then_free</span><span class="hljs-params">()</span> </span>{
    <span class="hljs-comment">// Allocate an array of NUM_BUFFERS buffers</span>
    <span class="hljs-keyword">char</span>* buffers[NUM_BUFFERS];
    <span class="hljs-keyword">int</span> i;

    <span class="hljs-keyword">for</span> (i = <span class="hljs-number">0</span>; i &lt; NUM_BUFFERS; i++) {
        buffers[i] = (<span class="hljs-keyword">char</span>*)<span class="hljs-built_in">malloc</span>(BUFFER_SIZE);

        <span class="hljs-comment">// Check if malloc was successful</span>
        <span class="hljs-keyword">if</span> (buffers[i] == <span class="hljs-literal">NULL</span>) {
            <span class="hljs-built_in">fprintf</span>(<span class="hljs-built_in">stderr</span>, <span class="hljs-string">"Failed to allocate memory for buffer %d\n"</span>, i);
            <span class="hljs-built_in">exit</span>(<span class="hljs-number">1</span>);
        }
        <span class="hljs-comment">// Write to buffer in order to assure it's mapped to physical memory</span>
        <span class="hljs-comment">// This is done because overcommit is enabled see https://www.linuxembedded.fr/2020/01/overcommit-memory-in-linux</span>
        <span class="hljs-built_in">memset</span>(buffers[i], <span class="hljs-number">0</span>, BUFFER_SIZE);
    }
    <span class="hljs-built_in">printf</span>(<span class="hljs-string">"Allocated buffers\n"</span>);

    sleep(<span class="hljs-number">10</span>);

    <span class="hljs-comment">// Freeing all the buffers except the last one</span>
    <span class="hljs-keyword">for</span> (i = <span class="hljs-number">0</span>; i &lt; NUM_BUFFERS - <span class="hljs-number">1</span>; i++) {
         <span class="hljs-built_in">free</span>(buffers[i]);
    }
    <span class="hljs-built_in">printf</span>(<span class="hljs-string">"Finished freeing buffers\n"</span>);
}

<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">main</span><span class="hljs-params">()</span> </span>{
    setbuf(<span class="hljs-built_in">stdout</span>, <span class="hljs-literal">NULL</span>);
    allocate_buffers_then_free();
    <span class="hljs-comment">// Sleep until being killed</span>
    <span class="hljs-keyword">while</span> (<span class="hljs-number">1</span>) {
        sleep(<span class="hljs-number">10</span>);
    }
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>I compiled (gcc -o memtest memtest.c), ran the program, and watched my machine’s memory with watch free. Here is the output I got:</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// free showed our system had 13GB of free memory</span>
Allocated buffers
<span class="hljs-comment">// system had 12GB of free memory (1GB was allocated for out process)</span>
<span class="hljs-comment">// 10 seconds of sleep...</span>
Finished freeing buffers
<span class="hljs-comment">// system memory still had only 12GB of free memory</span>
<span class="hljs-comment">// Meaning the freed buffer chunks were not freed back to the OS</span>
</code></pre>
<p>What’s even more interesting to see is that <strong>if we change the program to free all the buffers except the first one</strong>, the memory will be freed back to the OS. Another adjustment that achieves the same outcome is increasing the <strong>buffer size to 1MB</strong> and reducing the number of buffers to 1000. Because over a certain allocation size (<a target="_blank" href="https://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#index-M_005fMMAP_005fTHRESHOLD">M_MMAP_THRESHOLD</a>), <strong>malloc will use <em>mmap</em></strong> <em>(</em>a syscall that can allocate memory outside of the heap), and mmaped buffers are easily freed to the OS with the <em>munmap</em> syscall.</p>
<p>I invite you to try it on your machine (be careful not to trigger an OOM).</p>
<p><strong>Back to our process’s <em>malloc_info</em> XML</strong>… Our process had 64 arenas (the maximum number for an 8 core machine), and each one had between 100 and 1000 MB of free chunks. It strongly implies that malloc being greedy + memory fragmentation amplified by our use of multi-threading, are the reasons we have so much memory wasted on free chunks.</p>
<h4 id="heading-switching-memory-allocator"><strong>Switching Memory Allocator</strong></h4>
<p>Memory fragmentation is a hard problem that most memory-intensive programs face. To solve it, we could have tuned malloc with configurations or changed our application’s allocation patterns (which is a great topic for another blog post).<br />Even so, like many other memory-desperate developers, <strong>we started seeking a more adequate allocator</strong>.</p>
<p>In an academic research article called “<a target="_blank" href="https://db.in.tum.de/~durner/papers/study-memory-allocation-adms19.pdf?lang=de">Experimental Study of Memory Allocation for High-Performance Query Processing</a>”, we read about other memory allocators and their pros and cons. The <strong>Jemalloc allocator</strong> was the star of the show. It was noted that one of its strong suits is memory fairness, which was defined as “returning freed memory back to the OS”.</p>
<p>One of the ways Jemalloc achieves memory fairness is by having a time-based mechanism that releases free chunks back to the OS with the <em>munmap</em> sycall (it strongly relies on <em>mmap</em> for its allocations).</p>
<p>We switched our global allocator to Jemalloc, which is very easy in Rust:</p>
<pre><code class="lang-rust"><span class="hljs-keyword">use</span> tikv_jemallocator::Jemalloc;

<span class="hljs-meta">#[global_allocator]</span>
<span class="hljs-keyword">static</span> GLOBAL: Jemalloc = Jemalloc;
</code></pre>
<p>We ran the stress test, and surpassingly, it passed successfully. There was no notable performance difference in any of our stress tests.</p>
<p>We used Jemalloc stats (from the great tikv_jemalloc_ctl crate) in order to monitor memory fragmentation:</p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d6a_6598058b8c07f7538cb37ca3_1*s064-PXF6luKIYBTDVA0-w.png" alt /></p>
<blockquote>
<p><em>resident — Total number of bytes in physically resident data pages mapped by the allocator.</em></p>
<p><em>‍</em>‍<em>allocated — Total number of bytes allocated by the application. (in-use allocations)</em></p>
</blockquote>
<p><em>resident</em> minus <em>allocated</em> should be approximately equal the free memory used by our process. So <strong>free chunk memory consumption is low</strong>, which is great! There’s a small problem with the graph — It keeps growing steadily. This can be explained by the fact that we haven’t reached the limit where we start stalling our data stream (12 GB).</p>
<p><strong>Ultimately, the test passed, and the system memory seemed to be lower by more than half compared to the same test with <em>glibc</em>.</strong></p>
<p><img src="https://cdn.prod.website-files.com/66e7c0e8297c80a6f19bbd49/66e7c1509798fc06d9a64d49_659805b4e99ede73446a6ea3_0*k04HX6wsZWF5qcw0.jpeg" alt /></p>
<p>So we just made our stress test pass and the system will succeed in processing <strong>one</strong> incremental view with billions of records! <strong>But will it pass if a thousand incremental views are created?</strong></p>
<h4 id="heading-multi-process-architecture-memory-implications"><strong>Multi-Process Architecture Memory Implications</strong></h4>
<p>As you probably guessed, we had <strong>a stress test that spawned a thousand incremental views, and it crashed miserably</strong> because of high memory utilization.</p>
<p>At the time, we were using a multi-process architecture; <strong>each incremental view had a dedicated engine process that maintained it —</strong> received row changes (<strong>CDC</strong>), and ran them through the data pipeline.</p>
<p>Unfortunately, the existence of 1,000 engine processes presented a new hurdle for us. While each process had its own memory regulator limiting its usage to 4GB (for example), there was no certainty that other processes wouldn’t also utilize 4GB, potentially resulting in an OOM.</p>
<h4 id="heading-moving-to-single-process-architecture"><strong>Moving To Single-Process Architecture</strong></h4>
<p>It became more and more evident that <strong>an architecture change was needed</strong>.</p>
<p>A process per incremental view provides isolation between them; however, a <strong>single-process architecture holds many benefits for us</strong> in terms of performance and capabilities:</p>
<ul>
<li><p>Memory fragmentation is lower.</p>
</li>
<li><p>Memory utilization does not depend on the number of incremental views.</p>
</li>
<li><p>Memory regulation in our data pipeline is easier. <strong>In a multi-process architecture, different processes are not aware of other processes resource consumption and needs.</strong></p>
</li>
<li><p>Only one RocksDB embedding.</p>
</li>
<li><p>Overall decrease in memory overhead.</p>
</li>
<li><p>Communication between incremental views and with their coordinator is easy and fast.</p>
</li>
</ul>
<p>For these reasons, <strong>we decided to move to a single-process architecture</strong> where all the incremental views are just asyncio tasks maintained by one process. Now, instead of depending on the number of incremental views, our memory utilization depends on the total number of sql operations.</p>
<h4 id="heading-is-single-process-architecture-better-for-sql-engines"><strong>Is Single-Process Architecture Better For SQL Engines?</strong></h4>
<p>It’s important to note that single-process architecture isn’t necessarily better than multi-process, as with everything in life, there are tradeoffs.<br />Just to name a few:</p>
<ul>
<li><p>It may be harder to implement. e.g. aborting a query (or incremental view in our case) is harder. In multi-process architecture, it is as easy as killing a process. This is less critical for us since incremental views are seldom aborted.</p>
</li>
<li><p>Different query executions share the same address space, so <strong>a memory bug in one query might affect other queries</strong>. Fortunately, this is of lesser concern for us, thanks to Rust.</p>
</li>
</ul>
<p>PostgreSQL for example, uses a multi-process architecture for its worker processes (the processes in charge of executing queries).<br />In the <a target="_blank" href="https://www.postgresql.org/message-id/flat/31cc6df9-53fe-3cd9-af5b-ac0d801163f4%40iki.fi">Let’s make PostgreSQL multi-threaded thread</a>, there’s an excellent comment that portrays the negative impact of single-process architecture:</p>
<blockquote>
<p><em>OOM and local backend memory consumption seems to be one of the main<br />challenges for multi-threadig model:<br />right now some queries can cause high consumption of memory. work_mem is<br />just a hint and real memory consumption can be much higher.<br />Even if it doesn’t cause OOM, still not all of the allocated memory is<br />returned to OS after query completion and increase memory fragmentation.<br />Right now restart of single backend suffering from memory fragmentation<br />eliminates this problem. But it will be impossible for multi-threaded Postgres.</em></p>
</blockquote>
<p>Basically, when a query’s execution is complete, <strong>Postgres can choose to kill the process</strong> if it detects that it has irregular physical memory usage or memory fragmentation. That way, Postgres eliminates both of these problems.<br />Having said that, while a query’s execution time is usually between milliseconds and hours, an incremental view’s execution rarely completes — it should be up until the view was dropped by a user or Epsio was shutdown. Therefore, this mitigation is also less relevant for us.</p>
<h4 id="heading-conclusion"><strong>Conclusion</strong></h4>
<p>Memory-intensive applications can pose challenges<strong>. It’s crucial to consider the internal behavior of your allocator, regulate memory usage</strong>, and not rely solely on the OS to save the situation. Experimenting with different inputs and scenarios can cause different allocation patterns that might or might not cause fragmentation or bloating.</p>
<p>If you want to try Epsio’s <em>blazingly fast</em> (and super memory regulated) SQL engine, <a target="_blank" href="https://www.epsio.io/">check us out here</a>.</p>
<p>References:</p>
<ul>
<li><p><a target="_blank" href="https://www.kernel.org/doc/gorman/html/understand/understand016.html">https://www.kernel.org/doc/gorman/html/understand/understand016.html</a></p>
</li>
<li><p><a target="_blank" href="https://lwn.net/Articles/761118/">https://lwn.net/Articles/761118/</a></p>
</li>
<li><p><a target="_blank" href="https://developers.redhat.com/blog/2014/10/02/understanding-malloc-behavior-using-systemtap-userspace-probes">https://developers.redhat.com/blog/2014/10/02/understanding-malloc-behavior-using-systemtap-userspace-probes</a></p>
</li>
<li><p><a target="_blank" href="https://news.ycombinator.com/item?id=35469632#:~:text=In%20some%20ways%20Rust%20is,to%20reuse%20previously%20allocated%20memory">https://news.ycombinator.com/item?id=35469632#:~:text=In%20some%20ways%20Rust%20is,to%20reuse%20previously%20allocated%20memory</a>.</p>
</li>
<li><p><a target="_blank" href="https://stackoverflow.com/questions/39753265/malloc-is-using-10x-the-amount-of-memory-necessary">https://stackoverflow.com/questions/39753265/malloc-is-using-10x-the-amount-of-memory-necessary</a></p>
</li>
<li><p><a target="_blank" href="https://linkthedevil.gitbook.io/all-about-vpp/chapter1">https://linkthedevil.gitbook.io/all-about-vpp/chapter1</a></p>
</li>
<li><p><a target="_blank" href="https://engineering.linkedin.com/blog/2021/taming-memory-fragmentation-in-venice-with-jemalloc">https://engineering.linkedin.com/blog/2021/taming-memory-fragmentation-in-venice-with-jemalloc</a></p>
</li>
<li><p><a target="_blank" href="https://linkthedevil.gitbook.io/all-about-vpp/chapter1">https://linkthedevil.gitbook.io/all-about-vpp/chapter1</a></p>
</li>
</ul>
]]></content:encoded></item></channel></rss>