InfluxDB 3 & Rust
Welcome to Cloud Native Campus,
a podcast to help you navigate
the vast landscape of the cloud
native ecosystem. We're your hosts.
I'm David Flanagan,
a technology magpie that can't stop
playing with new shiny things.
I'm Laura Santamaria,
a forever learner who is
constantly breaking production.
Do you want a single database to
store high precision,
multidimensional time series data
that supports infinite cardinality?
Well, we're not there yet,
but Paul Dix does share his vision
and roadmap for InfluxDB three.
Do you want to hear, David,
not have to talk about Russ to get a
guest to actually talk about rust.
Now's your chance.
In all seriousness, we not only get
to talk about a move from Go to rust,
but also about observability and
how it's changed over time,
as well as a little bit about
open source licensing changes.
So less rust mean go.
No, but really, let's all rust.
All right.
Thank you for joining us, Paul.
And for anyone who's not aware,
can you please tell us a little bit
more about you and what you're up to?
Yeah, so I'm Paul Dix.
I'm the co-founder and current
CTO of InfluxData.
We are the company that make
InfluxDB, which is an open
source time series database.
It's useful in use cases like
tracking system metrics and
application performance metrics.
Sensor data, that kind of stuff.
My background is as a programmer.
I started the company in 2012 and
we went through Y Combinator,
actually under a different idea.
And then we pivoted to this idea in
the fall of 2013. Awesome. Thank you.
There's more I can go on like at
length, but I'm trying to trying
to be concise. Yeah. All right.
Well, let's talk about your work
on InfluxDB. So. Over the last.
I'm not sure how many years you start
to work on a rewrite of InfluxDB,
moving towards using rust on a
project called IOCs,
which I believe this year you have
now pushed out into production as
part of your InfluxDB Cloud offering.
So I really wanted to kind of drill
into that and just understand more
about one why a rewrite change in
programing language re-architecting
the system using Apache Arrow,
a huge task for a company to take on
and I'd love to know more. Yeah.
So I'll just start with the obvious,
which is a rewrite is basically like
the worst possible thing you can do.
Like unless you're just like a
sadist or a glutton for pain.
Um, you should avoid a rewrite at all
costs. Okay, that's out of the way.
Um, so InfluxDB, the original
version is written in go,
and the core architecture of the
database is built around what I would
call like, a clever hack. Right.
So it's essentially it's kind of
like two databases in one thing.
One is basically a time series
database that organizes data on
disk as individual time series,
which are value timestamp pairs
in time ascending order. Right.
So as all the data comes in, it tries
to organize that data on disk in
that format, which is basically
like a very like indexed format.
The other piece of this is an
inverted index that maps metadata
to individual time series.
So the metadata is like a
measurement name,
a tag key value pair, a field name.
So just like, you know,
usually people are familiar with
inverted indexes from document
search where you have a document,
you have an inverted index of the
words that appear in the document.
And then when you search, you find
the words and you find the documents
that have those words, right.
In this case we say, oh tag key value
pair like host equals server A or
region equals Us-west and then you
find the time series that match
that piece of metadata. Right.
So what that means is as you're
ingesting all the data, there's a lot
of indexing work that happens. Right?
So it chews through like a lot
of CPUs and stuff like that.
And it reorganizes the data.
Now when a query comes in, if the
query is for one individual series,
generally those queries are very,
very fast, right?
This is a system that's optimized
for that kind of query workload and
for that kind of like needle in a
haystack query specifically on,
you know,
the the individual series and the
time range that you're looking for.
But again, like the problem becomes
as your index, the set of metadata
that you're tracking grows,
commonly referred to as like the
cardinality problem. Right.
The number of unique tag values that
occur, the actual just the number
of individual series that occur.
You spend more and more time
indexing the data,
and it just becomes more and more
expensive to ingest it. Right?
You spend more CPUs, you spend
more disk space storing the index.
And like all this other stuff,
it becomes really painful,
particularly as you try to add
like more and more granularity
and visibility, like more
precision into the measurements
that you're taking. Right?
Like, again,
more kind of metadata that you're
capturing around those things.
And then the other side of this
is when you want to do a query.
If you have a query that's going
to touch, you know,
tens of thousands, hundreds of
thousands or even millions of
individual time series, right?
You want to do an aggregation across
all these time series in a region or
whatever, and compute something.
Those queries become
prohibitively expensive.
And oftentimes, like,
you can't even do them because the
engine will just fall over, right?
You just the way to do it and to
map it onto that index structure
is just way, way too expensive.
So basically that's the version
of the database written in go.
Version one, version two,
same storage engine we created
like we wrote a storage engine
from scratch that kind of has this
architecture and over, you know,
we created the initial version
of that storage engine in, oh,
I lost my screen for a second.
Hopefully we're there.
We're we created the initial
version of that storage engine in
prototyped it in the fall of 2015,
and then we had the first release
in early 2016 that had that,
and we've iterated on it and
added to it over time.
And essentially what we found over
the next whatever number of years,
four years, is that people wanted
to have higher cardinality data.
They wanted to feed in data where
they actually didn't have to
worry about the cardinality or
the ephemerality of the values
that they were feeding in,
and they wanted to do more
analytical queries across it. Right.
And all of this stuff was built
around our query language in.
SQL, which is a query language
that looks kind of like SQL.
So it's kind of like it's kind of
familiar and friendly, but it's
for people who really know SQL,
it can be frustrating in unique
ways because it doesn't actually
work exactly like SQL,
but for some things it's like
it's super easy to create like a
time series query or whatever.
So again,
like there are multiple problems
we were trying to solve for how do
we solve this cardinality problem?
How do we give people a query
engine that can be useful for
analytical style queries on
larger chunks of data?
And then this other piece, which
was we needed to figure out how to
store a massive amount of data at a
much more cost effective way. Right?
InfluxDB one and two have just a
base level of assumption that
you have a, you know,
a locally attached SSD or, you know,
whatever, like an EBS volume,
high performance network volume
with provisioned IOPs and whatever,
and all of your data is stored on
that. And it's super expensive.
And for a lot of our use cases,
right, people could have a
year's worth of data, but 99% of
their queries hit basically the
the trailing like few hours or
few days worth of data. Right.
But they they want this stuff
available and accessible,
but they don't need it available,
accessible, you know,
at the same response times.
And they definitely don't need it,
like stored on expensive NVMe
drives or whatever. Right?
So we needed to figure out a way
to decouple the compute of
ingestion and query from the
actual storage of the data. Right.
And obviously, like all this stuff
is like building up over the
course of like, you know, 2017,
2018 is when I'm noticing this 2019,
it's just becoming more apparent.
And the thing is, also during
this time, there are interesting
things happening out there in
the infrastructure world, right?
The rise of Kubernetes,
for example, wasn't there when
when we first created InfluxDB.
So this idea of like containerized
applications and like this ephemeral
application stack or ephemeral
compute stack that you layer on.
And then the other thing, the rise
of object storage as basically
like a common storage layer.
Those things happened over the
course of that decade of,
you know, the tens, the 20 tens.
And I think one company that
really took advantage of that,
at least the the rise of like,
object storage and decoupled compute
from storage, was snowflake, right?
They're the first company that I
think that really commercialized
this idea of.
We can create this big data system
that stores data super cheaply
and just layers on compute on
demand to execute queries on it.
Now, snowflake is obviously
designed for completely different
use case than InfluxDB, right?
Snowflake is a data warehouse.
Data at scale.
Whatever InfluxDB is about
operational data and real time,
right?
You need to be able to query it
within milliseconds of it,
writing it into the database,
and you need those queries to
return generally subsecond,
so that you can build monitoring
learning systems on it.
You can build real time dashboarding.
So in 2019.
I actually like in the fall of 2018
is when I started picking up rust and
I thought, like again, like first
commit on InfluxDB was in 2013.
But the basis technology that we
built for it was actually we had done
and started in the fall of 2012,
and rust was not in a place
where I would use it.
Then I used go because I thought
we will be able to move faster.
Creating the database.
We use go as the language.
But in 2018 I started picking up
rust and thinking, okay,
this is actually interesting.
And then in the fall of 2019 is when
the async await stuff landed in rust.
And when that landed, I thought,
okay, this is probably going to
be like a a serious language for
building server side software
where you have to handle network
requests and like all this other
kind of stuff. Right.
That was I think for me that was
like the final like piece that I
was looking for.
That rust was had actually arrived
at a point where you where you
could use it to build a complicated
piece of server software,
and you wouldn't have to build
everything yourself. Right? There.
Certainly successful projects in
rust that started before that,
right? Like Linkerd did that.
But that point I was like, okay,
that's interesting.
So coming into the beginning of 2020,
which is when I kicked off this
project, you know, I just said,
okay, we need a different
database architecture.
This combination of this inverted
index plus this time series storage
and the way the entire database
engine works is not going to work
for how for for for what we want
to build, for the requirements
we want to meet. Right.
Like we're using like memory
mapped files, which is like again,
you're not going to get that in
object storage.
And it's not great to use in
containers and like all this
other stuff.
So I was like, okay, if we're going
to re-architect the entire database,
that's basically a rewrite of the
database. We could do it in go.
And, you know, there there's a bunch
of stuff that exists in go that we
would reuse that obviously wouldn't
be rewritten the the language parser
and like all this other stuff,
but basically like I was looking
at the project and I thought,
this is a rewrite.
Like if we if we actually try to
do make these big changes.
And again, in like late 2019,
early 2020, one of the other
things I noticed out there in
the world was Apache Arrow.
Like I'd known about the project
for a little while.
Apache arrow is like an in-memory
columnar format specification.
Um, and I was looking at that and
I was looking at Apache Parquet,
which is a file format for this
kind of structured analytical data.
And I thought, well,
there's I think there's really
something interesting here.
So again, like I wrote like some blog
post in like early 2020 where I said,
like, I thought that the different
pieces of the Apache Arrow project
would become like a way for
companies that are building data
warehousing systems and big data
systems and streaming data systems,
basically like all these like
analytical systems that are working
on observational data of any kind,
right?
Whether it's server monitoring
data or sensor data or whatever,
those standards would become a way
for people to collaborate and build.
You know, common infrastructure
but also proprietary solutions.
And those will be like the touch
points in terms of like how you
exchange data between these systems.
So that was kind of the thesis
in early 2020.
Rust as a language is going to
be better because one like the
multithreading support in rust I
think is just the way it handles
it is way, way better, right?
Because it's kind of enforced by
the compiler.
Um, we wanted strict control over how
memory is managed and that kind of
stuff, which obviously like basically
we didn't want a garbage collected
language performance was going to be
a critical thing, like go super fast.
Don't get me wrong,
I love it as a language.
It's way easier to learn and
work with than rust.
I think my personal opinion,
but I just thought for this kind
of software, for a database system
that has to perform at scale
with high performance like rust,
just seems like the logical choice.
And if it wasn't rust,
it would be like C plus plus.
But I think in this day and age it
would be it's just a better choice
to use rust. Yeah. And then yeah.
And then over the course of the
next three and a half years,
like initially we it was basically
like me and one other guy within
influx for a couple of months.
And then we hired somebody else,
Andrew, who's still with us.
And basically the three of us kind of
treated it as a research project
almost for the first like six months.
It wasn't like I wasn't within
the company.
There's no way I was going to
get like people to buy off on.
Oh, Paul wants to rewrite the
database. Yeah.
Let's put a bunch of effort into
that.
Um, no, it was basically like,
I'm going to do this as a
research project because I think
it's interesting and I'm going
to see where it leads.
And then by November of 2020,
uh, we like we'd had enough of
the pieces figured out, right?
We were going to use Apache Arrow.
We're going to use parquet as
the persistence format.
Object storage is where all the
data is going to live.
Uh, Apache Arrow flight was going
to be the RPC mechanism that has
since evolved into flight SQL,
which is a new standard they
have for essentially making
doing RPC and SQL queries in
these kind of data systems.
And a project called Apache Data
Fusion.
Right now it's a sub project of
Apache Arrow.
It's data fusion is a SQL parser,
planner, um, optimizer and
execution engine written in rust.
And we're like, at that point in
the summer of 2020 when we decided
to build around these things,
you know, data fusion wasn't even
close to as far along as it is now.
And we knew, um, we knew that we'd be
like investing significant effort.
Like we actually looked at using
other engines as the core of the
database.
We looked at some C plus plus engines
just to think like, you know,
we didn't want to write our own,
but we needed something that was
like optimized for our use case
for time series.
And we saw that regardless of what we
picked up, we would end up having to
do a lot of work and we'd almost,
you know, have to take partial
ownership of the project.
And we thought this umbrella of
projects under Apache, under the
Apache Foundation, under arrow were,
you know, they were they were early,
but they were promising.
And if we really put our effort
behind it, it would cause more
people to also start programing
against it and whatever.
So in November of 2020,
I announced, like, hey, working
on a new core of the database.
It's called IOCs because nobody
was comfortable with me calling
it InfluxDB 3.0 because again,
they're like, you're not going
to rewrite the database,
so there's no way we're doing this.
I was like, well, we'll just
let's just see how it goes.
Um, so but yeah, at that point I
announced it and still it was
like three of us working on it.
It was still very, very early stage,
but that allowed us to it got, you
know, a number of people out there in
the world interested in the project,
and we hired some great people that
joined us so that by March of 2021,
you know, we had a team of,
I think nine people and we spent
years writing a database,
which we launched into
production earlier this year.
So how long did it take before
you were allowed to call it 3.0?
Before people stopped telling,
you know,
you can't rewrite this database?
Uh, I mean, we we announced
publicly that it was InfluxDB
3.0 on April 26th of this year.
So inside the company, it took you
like 2 or 3 years to get everybody on
board with the idea of no, really,
we did. And we're about to finish it.
By, I would say by.
By probably, I mean.
Basically by I'd say 2021,
like summer or fall of 2021,
people within the company are like,
okay, we need this like new core
database engine.
Because like at that stage,
it was obvious what the limitations
of the previous engine were.
Right in the beginning,
people were like, Paul,
what are you talking about?
Like, you know, there were some
people who got it intrinsically
and other people were like,
we don't need to do this right now.
And then by again,
like I'd say like the fall of 2021,
everybody in the company was like,
okay, we definitely need this
new database engine.
When's it going to be ready?
And I'm like, guys, you know,
we're we're not baking a quiche here.
Like it's going to take some time.
Um, yeah.
So, uh, so yeah, by and then by,
I'd say like the spring of 2022,
it was like, okay, this is, this is
obviously what we need to be doing.
And then definitely by the fall
of 2022, it's like, okay,
we're getting everybody focused
on this new database engine and
we're going to call it 3.0.
And then it just became a
question of when, when we were
going to be more public about the
fact that it was 3.0, but in my
mind it was always InfluxDB 3.0,
even though it is a total rewrite
and the database architecture is.
Drastically different than the
underlying database architecture
of one and two. All right.
Nice I love that.
We just ask a question and then
set you loose, and then.
You just go for it. Sorry.
Yeah, go for it. No, don't be sorry.
That's the good part. It's fine.
Well, don't know if there should
be more back and forth.
No, no, no,
that was absolutely perfect.
And you know, you kind of
answered my second question,
which is good as well because we're
thinking through the problem space.
Well, here, I think there's a lot
of context about what happened in
the industry, what happened with
the early versions of InfluxDB, why
this rewrite was required, why rust?
Why all these things now make a
lot more sense, right?
But, you know,
let's reemphasize one of those things
you talked about with the 20 tens.
It's like this was the decade where
containers and cloud took off.
All right.
People were using ephemeral compute,
spinning up VMs on GCP as they're
launching dozens to hundreds to
thousands of containers
orchestrated with Kubernetes.
And all of these have their own
signals.
They have their own logs,
they have their own metrics,
they have their own traces.
Traces are now important because at
the same time of this wild cloud
container orchestration evolution,
people decided to start doing
microservices because the
technology enabled all these new
architectures too.
So I spend a lot of my time,
I'm just going to set context and
not ask questions and then just
let you infer the questions.
But still, I spend a lot of my
time working with companies that
are trying to build out platforms.
You know, they want to make it
easier for their developers to
deploy to production.
And I think one of the
challenges I've seen is that
people really struggle with,
I need a database for logs,
I need a databases for tracing,
I need a database for my metrics.
I need to be able to aggregate
and query and all this stuff,
and they make it really complicated
for what is, in essence,
all the same data structure.
I don't think there's that much
difference from a metric,
a trace and a log.
It's all a collection of events.
The difference between a trace and a
metric is an aggregation of some raw
level event, and the challenge has
always been the cost of storing it
at this super high dimensionality
with super high dimensions versus
the easiness of querying it.
Which is why we probably do
metrics and terrible versions of
histograms and all this other stuff
that we now accept as the norm.
Right? Yeah, does.
And I'm going to quote something
on your website that you might
hate me for, right.
But when we talk about IOCs and
DB three, you specifically say
infinite cardinality.
So the question is does IOCs gevers
or can or InfluxDB three sorry,
give us a single store for all
of these signals, all of this
observability and monitoring data.
And can you give us a bit more
insight into what that infinite
cardinality actually means in a
practical term? Yeah. So?
To store the data.
Yes, 3.0 can store all of that
kind of kind of data, right?
Um, because of the fact that
cardinality doesn't matter,
the problem becomes what happens
when you try to query that data.
Pulling it out. Right.
And that's basically like the the
patterns for how you query the
data and what people expect or
why you end up with, you know,
I think why you end up with three
different systems for, for storing
each of those kinds of data,
right? Traces, metrics and logs.
In my mind, they're all just like
different views of the same thing.
Ultimately, like if you wanted to,
you could just have traces and
skip the logs and metrics. Right?
You can infer you could derive
everything else from raw traces
because traces, again, like you could
have just a blob string field in
the trace that has log info. Right.
So but the problem is the problems
are like what happens at scale.
Right?
And when you start generating a
ton of this data,
do you end up having to sample it.
And then how do you actually
access the data.
What are the access patterns. Right.
So for right now you know the.
The metric access patterns are like,
I have this metric and I want to
look at it.
And the idea behind metrics is
you're actually metrics are a
summary view of some underlying
distribution or some underlying
thing that you're tracking.
Generally, metrics are not the
raw high precision view, right?
For example, if you want the
average response time in one minute
intervals to an API call, right,
a specific like API endpoint,
right now the raw view is
individual requests, right.
And you log every bit of detail you
would want on that request. Right.
What host received the request?
What user submitted it,
what token they were using, what
endpoint the actual data included
in the request, the response time,
the response itself. Right.
You could you can get down to just
an insane level of precision.
But the problem is, to do that at
scale is completely unreasonable.
It generates more data than you ever
even stored and all this other stuff.
So you end up creating these
systems to summarize things.
And the problem that people
frequently run into is like, well,
if you didn't think ahead of time
of what you needed to summarize,
when you go to look at the summaries,
your metrics,
the answer you need isn't there and
the requests already happen. Right.
So logs are a way to capture more
detail and then kind of like try to
figure it out after the fact. Right.
So the idea with logs is it's it's
something where you're doing an ad
hoc investigation where you're not
continuously like looking for some
signal that triggers like a problem
in the system, but so ultimately
like storing all that data,
you can use the same format.
Parquet is a storage format,
for example, can store all that kind
of data, but querying it effectively
and efficiently is difficult.
And that usually requires either
secondary indexing structures or
other ways to organize the data
so that you can actually query
it effectively.
Now Aspirationally 3.0 wants to be.
The home for all that kind of data.
Basically,
for any kind of observational
data you can think of. Right.
And like for us, it's not just like,
you know, the server infrastructure
monitoring use cases, but also more
and more sensor data use cases.
And again, with sensors you find
the same kind of thing, right.
People can deploy more sensors
in their environment for the
machines they're tracking or the
environments they're tracking,
but that they can also increase the
precision of the measurements. Right?
They can increase the sampling
interval.
They can increase the precision in
terms of what gets tracked with each
measurement that gets created. Right.
And that's all the metadata that
you could potentially track
about something with sensors.
You could be, again,
like all the stuff around the
customers or the users.
It could be around the location,
the, you know, the lat long,
like all that other kind of stuff.
So getting there, like we're there in
terms of being able to store it,
we're not there in terms of
being able to query it.
We organize data into large chunks,
and then the query engine just
kind of brute forces those chunks.
So I think. The it's. Yeah.
That that is going to take some
time to get to the stage where we
can actually do all those things.
I think, uh,
there's some other stuff around,
like with logs and tracing use
cases where the schemas are very,
very dynamic and they're not
always consistent. Right.
If you try to like, pull out
structured fields from these things,
a lot of times people won't have the
same field types for something that's
named the same thing because they're
in different services or whatever.
So those are all just like kind
of like weird. They're fun. Yeah.
Infrastructure like horrible,
like problems that you just end
up having to deal with.
Um, so whereas like right now I
would say with 3.0, it's better
for like structured events, right,
where you have events that you're
tracking and you want to get high,
high precision data,
right where you can slice and dice
it any way you choose. Right.
So systems where you can think of
like use cases where you think that'd
be good is like if you're doing
usage tracking and an API audit,
logging any type of individual
events metrics is also a use case.
Obviously, that InfluxDB is used for.
And this engine is useful for as
well.
Um, but logs and traces are a bit
trickier because of the again,
like kind of how flexible the
schemas are as people deploy
them in different systems.
And the thing with tracing that's
weird is all the like, I think most
of the tracing front ends look
like you're looking at a metric
view or a log view and it's like,
oh, go look at the trace.
So basically what you're doing is
you're jumping off to look for a
trace by a trace ID, right.
Which is a essentially that implies
what you want is an index which maps
an ID to an an individual trace.
And of course, time series database
isn't really designed to do that.
Or like the way our database is
structured is not really
designed to do that.
Now, there are ways ideas we have for
being able to layer that in without
having to create super expensive
secondary indexing structures,
but all of that stuff is going
to take time.
So I think with tracing it would
would make it easier is when you
have a trace ID, you basically
always have a time range as well.
So for a system that stores
traces at scale,
it would be easier to say like, oh,
give me this, give me this trace ID,
but this is like the time
envelope that it appears in.
At least that's what I'm imagining.
Make it easier. But yeah.
Uh, I don't know. Yeah.
Metrics, logs and traces is basically
like the gold standard for how
you do observability right now.
But I don't think.
I don't think that's the end state.
I think it's not ideal,
like the usability isn't ideal.
It can be painful.
Tracing is super like tracing is
expensive from a development
perspective in terms of putting
it into your code.
But it's also expensive in terms
of like an infrastructure and like
being able to collect all this data.
And then you get into like figuring
out, okay, do we need to do sampling
because there's too much and it's
all still too difficult to use,
which tells me there's a lot of
room for for innovation.
But the thing is,
there are a lot of really smart
people trying to fix these problems.
It's just really hard because the
like, the volume of data keeps
increasing and the demands of the
user base also keep increasing.
Right.
Yeah, I used to work at logging
company and back around the same time
when you were discussing, how do
we move this off to other things?
Um, and I remember that being
part of the discussion that we
were handling petabytes of data.
And how do you handle just that much?
Um, but I'd find it interesting
if you do, you end up with a lot
of legacy systems as well,
because that was always the issue
with logging, is that to go back
and change it from the string,
that who knows what the developer
decided to put in there if they
decided to put anything in there.
And then there's all these
different levels of logs and
they're all strings that you
have to figure out how to parse.
Then eventually people move to
structured logging,
but that wasn't a thing.
So that's a whole nother can of
worms to open, I imagine.
Yeah, I mean, you then you have to
either like, parse the legacy logs
into some sort of structure, right?
I mean, and again,
like the structure is really about.
Query performance. Right.
Because you don't you could you could
create the structure on the fly.
Right?
You could just store all the
logs and whatever.
And basically like at query time,
parse each log line in the structure.
And again, like you could do
this on the fly to be like, oh,
well that query didn't work,
but I got this error back.
So I'll change my the way I'm
structure, I'm trying to parse
that particular line or whatever.
But the problem is that those queries
are super expensive to run, right?
They cost a lot of CPU,
they cost a lot of network bandwidth.
So really like the thing about
structured logs and even metrics
is about essentially introducing
structure and summarizations that
make the queries efficient enough to
run for whatever the use case is.
And I think that's also one of the
thing to keep in mind is like,
is this a case where you're building
for a system that does automation,
where it needs,
like the query to return in,
you know, tens of milliseconds,
hundreds of milliseconds?
Is it something where it's like and,
you know,
the automated systems generally like
those queries run all the time,
right? Many, many, many times.
Right. Or is it.
An ad hoc query from a user who's
doing some investigation, in which
case they're more than willing to
wait tens of seconds for a return.
And those queries are few and
far between. Right.
You're not doing that.
Many of them, and I think. The.
I think the ad hoc thing is,
is probably easier to solve with
that pattern of let's take these
blobs of data, put them into object
storage, and spin up compute on
the fly to handle those.
Um, it's the it's the it's the
real time systems where you have
to really think about layering
in structure and optimizations,
that you can answer the queries
fast enough. All right.
So you mentioned at the start how you
had this InfluxDB in version one,
this SQL, SQL like language that
frustrated people because it wasn't
real sequel that they were familiar
with working with other databases.
With. Data fusion.
Do people get access to more
traditional SQL?
That's part one of the question.
Part two is with InfluxDB two,
there was a lot of investment into
the flux language with the messaging
around how flux was purpose built
for time series, then SQL wasn't.
I'm curious if that has changed.
And do we feel now that is the
right language for time series,
or is there still a future for
flux with InfluxDB three? Yeah.
So first I'll mention about InfluxDB.
So some people found it frustrating
because it was on like SQL.
But a surprising number of people
have told us they actually prefer
InfluxDB to SQL for writing some
basic time series queries. Right?
Because it's just like super easy
to write a thing where you're,
you know, getting a summarization
by these different things.
Whereas like in a SQL engine,
you might have to deal with
windowing functions and partitioning
and like all this other stuff.
So I think InfluxDB has a place that
will continue, will continue to
serve, just because we've gotten
that feedback that a number of
people actually prefer it to SQL.
So with data Fusion,
we have a fully featured SQL engine
that supports all of these things.
Joins window functions,
partitioning like all this,
all this complex SQL stuff.
We've been adding in more and more
functions to make time series stuff.
Uh, to make it more capable of doing
time series specific style queries.
And we'll continue to do that.
We're certainly not done with that.
Um, so the thing with 3.0 is, again,
it's built around this data fusion
query engine, which is a SQL engine.
Now all that's written in rust,
obviously 2.0 is written in go.
And flux as a language was we
developed for 2.0 and we developed
it. So there were a couple of things.
One, for the users of InfluxDB,
one, they frequently had requests
that they just wanted to express
more arbitrary and complex logic
in their queries than could be
expressed in SQL or indeed in SQL.
So we were just like,
we need essentially like we need
like a scripting language paired
with a query language so that people
could do more complex things.
And it became like, oh,
you can use this also to connect
to third party systems,
to join with your time series data
on the fly, or to send data out
to third party systems, right?
Be it another database or a third
party API or whatever. Right?
So that was that was part of it.
Another part of it was I had a thesis
back in whatever, like, I mean,
originally in the fall of 2014,
I was talking about changing the
language from InfluxDB to
potentially something that looked
more like a functional language,
which is what flux is.
And I decided at that time to
stay with flux.
But I had a theory that like, oh, I
think for the time series use case, a
functional language would be better.
It would be more like expressive
and more powerful for working
with time series data.
And flux was our attempt at that.
Right. And I think so.
We built obviously the flux language,
the scripting language,
the query engine, the planner,
the optimizer, everything from
the ground up, which is a very
large like that's basically like
two very large separate projects.
Um, so, uh, and all of that is
written in go now coming to 3.0.
We thought, okay,
we need a way to to bridge flux.
And we also want to see if we
can bring over influx.
And with influx we had the idea.
Well it looks like SQL.
So what we can probably do is write a
parser in rust to parse the language,
which isn't too hard to create.
That will then translate an SQL
query into a data fusion query plan.
So we had one person start on
that in the summer of last year.
And basically now this year we
have that actually works.
And it works really,
really well. Right.
Because and it was basically
just one person who did a lot of
that effort last fall.
We added on additional people onto
the team to build the last like bit,
which is like API bridge,
which is to represent all of that
with the InfluxDB one query API.
So we're really happy we were
able to bring that over,
and it gets all the benefits of that
data Fusion Query engine, right?
So when they're when they're
performance optimizations or other
things like that, they just come
they when we pull that in InfluxDB
gets all of that for free.
Now with flux, what we tried to do
was like it was like the surface area
of it is way, way too large to try
and like create it again in rust.
Although I would love to do that.
Um, it's just it's just like way,
way too big of a project to do.
And we don't we don't have the
time or resources to do it.
So what we did was in our cloud two
platform, we had flux processes
that communicated with the old TSM
storage engine via this gRPC API.
So what we did was in in 3.0 we
created that gRPC API,
and we connected flux up to it.
And what we found through like
some production mirroring and
actually letting customers test
it was one that gRPC API was like
kind of it had some edge cases
that were poorly specified.
So there were weird bugs that
would pop up that we just
unforeseen bugs like that would
surface in the flux query.
Things that worked on the previous
one that don't work on this,
but more importantly, the performance
of that bridge is not good, right?
It basically there were there
were queries that would work in
the old flux to TSM version that
worked in a decent amount of time,
and queries on the flux of 3.0
bridge that just like timed out.
And again, like one of the things
we were trying to do with 3.0,
one of the reasons we adopted
this new query engine is because
we wanted queries to be super,
super fast, right?
Query performance was a super
important thing, and a lot of
flux queries like we saw,
there were queries that just
wouldn't work in that engine.
So when we created this bridge,
because of the way that API works,
it's literally it's built around
how data is organized in that
TSM storage engine.
And the 3.0 engine does not have
the same organization.
So in order to present the data
in that organization,
the 3.0 engine has to do a lot of
like post hoc sorting and filtering.
And that sorting is basically
you chew up CPU time doing that.
And basically it wasn't performance.
So like right now we have a
theory that there would be a way
to update the flux engine so
that it uses the 3.0 native API,
which is basically a flat SQL,
and that that the flux engine
could do the work itself.
To reorder the data and the way it
expects it to be ordered, and maybe
that would be performant enough.
But for the time being,
we're not focused on that.
We're still focused on the core query
engine, which means InfluxDB and SQL,
and adding capabilities and
performance to it that that like
right now, it's already faster
than 1.0 at a number of queries.
But there are still some queries.
That's definitely not faster
with and we want to improve
those situations and spend our
our time on that for now.
And then see,
see later what we can do with flux.
We are people in the community have
expressed interest in actually
self-organizing to do that work.
So we've actually created a separate
community fork of flux that we're
going to be pointing people to.
And that fork will be a place
where people can collaborate on
this idea without.
The thing is, we can't do it in our
primary branch because we run this in
production, in our cloud environment.
It's just too difficult to try
and pull these changes in,
as people like we need to give people
the ability to, like, iterate on
their own without having to go
through our production pipeline.
So that's the idea is in 3.0 in
InfluxDB and are native and
supported flux.
We're still trying to figure out
what we I mean, I will say a
separate thing about flux,
which is maybe not obvious to people,
or for some people, maybe it is,
but the language is highly
polarizing, right?
It's a new language, and a lot
of people are like, I don't want
to learn your stupid language.
And I get that, um, I did not
really get that six years ago,
but I get it now.
Um, and so there's basically what we
found is that a lot of people just
didn't want to pick up the language.
They wanted to work with
something they already know.
And again, with InfluxDB QL,
it's a different language.
But the thing is,
it looks like it feels.
It feels like an old friend,
like you.
You know it so you can pick it up
without having to do too many things.
But with flux, it was a serious
adoption blocker for a lot of people.
But then on the other side of this,
there is a slice of people who took
the time to learn flux, and they
absolutely love it because they can
do things in that language that
they could not do in SQL or in SQL.
And I think that is kind of a
testament to the like,
the reason why we built the language,
because we wanted something that
enabled arbitrary processing
inside the core of the database.
So again, it's like one of those
things where depending on who
you are and how you look at it,
flux is either like a great thing
and it's like we need to keep
pushing this forward or it's why
these guys build this language.
This doesn't make any sense.
It's it's tricky. All right.
Awesome. Thank you.
Did you have a question, Laura?
No, no, just random commentary. Yeah.
We're running out of time very
fast here, so let's finish up with
something a bit a bit different.
Um, recently, HashiCorp announced a
license change to the the bustle
license, and you posted some thoughts
on Twitter saying that it kind
of gave you a bit of pause for
thought on what source available
in open source is or the future.
And I'd love for you just to
talk about that and maybe even
bring it into the context of
what's the future for the
license on InfluxDB three. Yeah.
So, uh, so basically in my mind,
open source, the, the, the bustle,
the community licenses, all those
things, those are not open source.
Those are basically the new
version of shareware or
commercial freemium software,
right? It's commercial software.
And if and frequently they will
offer that software to you to use
for free under certain conditions.
And if you are one of the people
who meet those conditions,
maybe you're happy and you'll be
able to use it, right?
Tons of people continue to use
MongoDB, the SSL version.
Tons of people use elastic still
after they change their license.
Redice confluence like they're,
you know, literally every single
infrastructure open source
creator has changed their
license over the last six years.
Like there isn't one I can think of
who hasn't? Except for us, maybe.
So, uh, and I totally get the
motivations to do that.
But in my mind,
like open source, I actually
don't like copyleft licenses.
I like GPL, GPL, I don't consider
them to be real open source.
To me,
open source is really about freedom.
Freedom to do what I want with
that code.
And if you put any sort of
restriction on it,
which a copyleft license does have
restrictions on it, then that's,
you know, restricting my freedom.
To me,
open source is about freedom. So.
And again, I think for a company
that's producing open source code.
You have to be okay with the fact
that people are going to do anything
with that code, including up to
and including competing with you.
And if you are not okay with that,
you should not be putting that
code out as open source, because
that is what's going on. Right?
And generally the best thing for
a company to do that's building
products and stuff like that is
to only put open source code in
something that they wish to have
become a commodity. Right?
So the operating system,
the server operating system that
your, your, your servers run on,
you want that to be a commodity.
Many times you want the database
to be a commodity.
You want this these core
infrastructure components that are
essentially not part of the value
that you deliver to your customers.
Essentially, you want them to be
commodity, so you don't have to pay a
lot extra for those things, right?
You want the price of those things to
be driven down as low as possible.
But if you are a vendor,
the thing you sell, you don't
want that to be a commodity.
You want to be able to sell it
for the highest possible price.
And the so the problem that
vendors have that create a project
where their primary monetization
path is essentially that project
plus something or whatever,
it becomes like, well,
we're putting all this effort into
the open source thing, and a bunch
of people are using it for free,
and there are a bunch of freeloaders,
and they're a bunch of like,
competitors who are taking our stuff,
and then they decide to change the
license. And the problem I have.
So there there are multiple
pieces to this.
One is as the creator of a project,
if I want it to get the broadest
possible adoption,
I want more people to use it.
I'm incentivized to have that project
be permissively licensed. Right?
All all other all else being equal,
a developer or a user looking at
a piece of software, literally,
if everything else is the same and
the difference is a commercial
license or a permissive Apache
or MIT license, they're going to
pick the permissive one, right?
There's no reason not to because
you get a bunch of other stuff.
So they choose that.
But the problem now that I think
the HashiCorp thing kind of
highlights is it's just yet
another vendor in a long list of
vendors over the last six years
who've changed their licenses.
And now, basically, it's causing a
lot of distrust in the developer
community because they see like, oh,
here's a new open source project.
And they look like is that
project by a VC backed company,
right, a VC backed startup.
If it is,
they're going to be like, well,
it's an open source project. Yeah.
But I don't believe that it's going
to continue to be open source,
which is a totally valid thing to
think given how things have been
going over the last six years.
So, you know, previously, again,
my thesis, if you want broadest
possible adoption,
put it in a permissive license.
That idea is kind of getting damaged
by the fact that people keep changing
their license from permissive to
something commercial, right?
And I don't know if there's a
solution to that.
I do think separately,
I think that HashiCorp probably
made an error here.
Like, the thing is the license change
only protects forward commits, right?
You can't retroactively change a
license.
So they could have just as easily.
Put all their.
Development effort into a closed
source private fork and then made
the open source piece a downstream.
You know, dependency of that
closed source fork and then not
tell anybody about it, right?
Just be like, yeah, we're just
going to do this or whatever.
Or conversely, they could have
just said, we're going to donate
this thing and do a foundation,
but all of our developers are
going to be working on this
closed source thing, right?
The effect would have been the same
in terms of the end outcome. Right.
Because you have this open
terraform fork.
But the difference in everybody's
perception of HashiCorp as a
company would have been
dramatically different. Right?
The business and commercial effect
would have been exactly the same,
which is all their R&D, all their
development tokens are going only
into the commercial software,
which is fine. That is their right.
If they want to do that,
they should totally do that.
And it's also fine if they want
to change their license.
That is totally permissible, right?
There's no just because you
create something open source one
time doesn't mean you owe it to
the world to continue for the
rest of your life,
to put more and more into it. Right.
Um, but I think there were more
graceful ways to handle it that
could have delivered the same
business outcome for them,
at the very least. All right.
Let's extend that question by
one sentence and just say, like,
does the fear that people have now
and rightly so, about single vendor,
VC backed open source projects
not always being open source?
Does that mean that InfluxDB
three could be an Apache
Foundation project in the future?
Or are you not worried about the
perception or the fear of people
adopting it because it's a VC
backed single vendor project?
So I do not think we will put
InfluxDB three into a foundation.
My goal is to have a permissive
license project.
Um, but the truth is, like I don't.
The problem with foundation projects
is the the bar is usually too high.
Like InfluxDB doesn't meet the
bar for most foundations, right?
There aren't multiple companies
contributing to InfluxDB, you know,
version one, two, or three, right?
We're the ones developing it alone,
so it doesn't really hit the
level of a foundation. Terraform.
You certainly could have put it
into a foundation.
I'm certain a number of foundations
would have taken it gladly as a as a
project, as a top level project.
But InfluxDB doesn't hit that.
It's not it's not on the same level
in terms of the contribution.
And like all this other stuff,
um, there's that, uh, I mean,
I think, I don't know,
like for broadly for the,
for the community and building trust
with people and stuff like that.
I don't know if there's a solution.
Right.
I think the license itself matters
because if it's Apache two or MIT,
then, you know,
people can do whatever they want
with that software for all time
as long as for that point.
But maybe there's a model where
you can commit to, you know,
transitioning into open governance
over some period of time.
I think early on in a project,
open governance is actually more
of a hindrance than.
You know, a benefit because you
want like a small,
tight knit group of people that
are driving the project forward.
Um, but yeah, I don't know.
It's tricky.
There are so many things I want
to say and argue about here,
but no, we're out of time.
Um, I think the one thing I guess I
will say is the one little challenge,
and I'll just kind of leave it here,
I guess. Uh, Red hat tried the.
We're going to switch the order
of things and make the
downstream the open source part.
And I would argue that the
community really doesn't like that.
So I don't know.
There's a lot of different ways.
I admit that I am a huge fan of
the Agplv3,
but maybe that's why I will argue it.
But we unfortunately don't have
time and want to really badly.
To be continued.
To be continued. That's correct.
Yeah, we could have an entire
session on open source licensing.
I could talk about that for like
hours. Should be so much fun anyway.
All right. Yeah, well.
We want to keep you any longer,
so feel free to use 30s if you wish,
just to tell people where they can
learn more about influx, follow
your Twitter, anything like that.
Feel free to shamelessly plug it
in if you wish.
On Twitter I'm at Paul Dix InfluxDB
where influxdata.com or influxdb.com.
You can find us 3.0.
We have available as a
multi-tenant cloud product,
as a dedicated cloud product,
and soon open source builds or
builds of 3.0 will be available.
But we're not.
Again, we're focused on our
commercial offerings at the moment
for obvious reasons. But yeah.
All right. Thank you very much.
Thanks for joining us.
If you want to keep up with us,
consider subscribing to the podcast
on your favorite podcast app, or
even go to cloud native compass FM.
And if you want us to talk with
someone specific or cover a
specific topic, reach out to us
on any social media platform.
Until next time when exploring the
cloud native landscape on three.
On. 312, three.
Don't forget your compass.
Forget your compass.