Amazon Neptune – Fast, reliable graph database built for the cloud

Is this JanusGraph under the covers? Guessing since Neptune is a nod to Janus.

igravious · 8 years ago

Support for various storage backends:

   - Apache Cassandra®
   - Apache HBase®
   - Google Cloud Bigtable
   - Oracle BerkeleyDB

I don't understand how a database doesn't have its own native store. What exactly does a graph database actually do if it doesn't manage the data fed to it? Same is true for CayleyGraph† https://github.com/cayleygraph/cayley and proabably others.

†Plays well with multiple backend stores:

   - KVs: Bolt, LevelDB
   - NoSQL: MongoDB
   - SQL: PostgreSQL, CockroachDB, MySQL
   - In-memory, ephemeral

rajman187 · 8 years ago

There are two main paradigms here

1) "native" graph db Neo4J is an example of this. This takes advantage of index-free adjacency. Each node knows what other nodes it is connected to and hence traversals are very fast. The issues you run into are when you try to scale. Data that fits onto a single machine is fine and you can replicate your data for fast parallel reads/traversals across disparate regions of a massive graph. However you no longer have the concept of data sharding and distributing the graph as index-free adjacencies don't translate across physical machines. And another drawback is highly connected vertices, you will expend a tremendous amount of resources deleting or mutating a vertex with, say, 10^6 edges. But that vertex is probably a bot so you should delete him anyway.

2) inverted index graphs, non-native graphs, whatever anti-marketing name it might have. These rely on tables of vertices and other tables of edges. Indexes make them fast, not as fast for reads but very fast for writes. And you get distributed databases (Cassandra, for example, a powerful workhorse of a backend with data sharding and replication factor, etc.). But then you have to yet another index to maintain and the overhead can get expensive. This is the model adopted by DataStax, who bought Titan DB (hence the public fork to Janus) and integrated it with some optimisations and enterprise tools (monitoring etc, solr search engine) to sit on top of Cassandra.

Both now have improved integration with things like Spark. Cypher is probably faster than Tinkerpop Gremlin especially with the bolt serialisation introduced in recent versions of neo4j.

So janus is the graph abstraction layer of the second type and so needs somewhere to save these relationships. It all comes down to use case (and marketing) to decide what works best for you.

talove · 8 years ago

Here's one way to look at it: A graph database can be reflected with common, well-understood data structures. You can use a lot of backends to represent those data structures.

Graph database projects are often times just an adapter for doing Graph queries on-top of another store.

At their core, a graph database can be reflected simply with just documents and adjacency lists https://en.wikipedia.org/wiki/Adjacency_list

manigandham · 8 years ago

JanusGraph provides a graph data model on top of an existing storage layer. In this case it's using wide-column key/value systems. It works well, letting each layer do what it's good at while limiting the amount of separate systems needing to be maintained.

riku_iki · 8 years ago

> What exactly does a graph database actually do if it doesn't manage the data fed to it?

It provides tools to run complex queries on graphs, and manages data models and indices to execute them fast.

barakm · 8 years ago

As per the other commenter, Cayley provides a graph data model on top of an existing storage layer.

However, when we get to manage the storage layer (Bolt, Level -- that's being generalized into local-KVs in the next release) we get to build our own indexes for better data management and performance. But there's no reason we can't hand that job off either -- hence supporting multiple (remote) backends. For the local stuff, though, at some point, Bolt is just a very good BTree implementation.

It seems a lot of Amazon services are managed instances of open source applications. For example, commenters are suggesting this may be based on Janus. Elastic load balancers, at least originally, were likely based on haproxy. Etc etc.

Has anyone ever considered the licensing implications of this? How is amazon able to convert an open source product into a proprietary one and then charge for access to it?

Of course you can argue they’re charging for the infrastructure management, not the software itself. But that argument quickly breaks down as Amazon introduces new software, under new names, with a proprietary management interface over an open source core. Try to find the source code; you can’t.

And if you accept the premise that they’re just charging for hosting, then it leads to the question of why an open source project doesn’t reap any benefits from that hosting, or at the very least, from the management interface on top of it.

It seems like a better solution would be something akin to AWS marketplace, where open source projects are available to be hosted, and the maintainers can see some revenue from them.

It seems like unfair rent seeking behavior that amazon is able to slap a management interface on open source software and then charge for it under the guise of “hosting.”

eitland · 8 years ago

> How is amazon able to convert an open source product into a proprietary one and then charge for access to it?

Totally no problem with liberal licensed open source software.

This is also the intended behaviour of such licenses.

Also many of those big bad commercial companies contribute back big time to a number of projects. Why? I guess sometimes because devs want to and also because it makes sense business wise so they don’t have to maintain the code themselves.

chatmasta · 8 years ago

Depends on the license, doesn’t it? I’m not a licensing expert, but my understanding is GPLv2 / copy left licensing means that if you create a derivative product, you need to open source the new code along with the dependencies.

Seems like a management interface is a clear cut derivative product. Where’s the source code?

Or perhaps amazon does consider licensing and only builds on top of, eg Apache licensed projects?

convolvatron · 8 years ago

it seems clearly ok to sell managed services on open source platform x.

the question in amazons case is that since they sell the infrastructure, they can and probably do undercut any competing providers by charging themselves less.

so, it seems monopolistic. otoh their service is good and their customers get at least reasonable prices. so ... ?

SEJeff · 8 years ago

Here is a great article on just this. It is commonly known as the "gpl loophole", which RMS is entirely fine with. If you want to prevent this, you license the software the with Affero GPL, which explicitly forbids this.

http://radar.oreilly.com/2007/07/the-gpl-and-software-as-a-s...

Amazon / Google / etc are not redistributing the software as it is running on their servers in their environment, therefor, there is nothing wrong with the existing licenses.

supergreg · 8 years ago

Don't free and open source licenses apply only during redistribution of the software? Unless it is licensed with the Affero GPL, just connecting to a service does not require its source code to be available. That is assuming Amazon modifies the software. If they don't, then there's nothing to argue.

Are they making money with software they didn't build? Yes, but so are we.

connorelsea · 8 years ago

Time is money. It takes time to manage servers/infra - services like this let people make the choice between spending their time or their money managing infra. The category of managed infra is huge and goes beyond Amazon

chatmasta · 8 years ago

My question is not about why someone would pay for the service. It’s about where amazon got the right to charge for it without open sourcing their derivative work.

To be clear, it’s not the hosting of open source applications I see as the problem, but the closed source management/orchestration software built on top of it.

lmeyerov · 8 years ago

Awesome surprise to see the embargo lifted -- sounds like I can now say the Graphistry team will be doing a follow-up talk at Amazon Re:Invent tomorrow (Thursday) on Amazon Neptune + Graphistry. We've been incorporating this into visual investigation workflows for security, fraud, health records, etc. They've been doing cool bits on the managed graph layer, and were early to graph GPU tech (Blazegraph team members), and our side starts bringing that kind of thinking to visual GPU analytics & workflow automation tech.

If you're in town and into this stuff, ping me at leo [at] graphistry, and would love to catch up Th/F for coffee+drinks. Also here + email, of course!

kendallgclark · 8 years ago

If you actually read the docs, it's not Janus-based, it's based on BlazeGraph, which Amazon reportedly acquihired last year.

jnwatson · 8 years ago

Is that public information? I don't see any press releases about it.

There was no PR. But there are traces, like Amazon acquired the domains, etc. Many former Blazegraph engineers are now Amazon Neptune engineers according to LinkedIn, etc. It was rumored widely in the graph db world fwiw.

rdslw · 8 years ago

Yet another amazon service to lock you in.

And then after two years, when you're no longer startup with 100usd bill, but bigger company, you're completly tied to a jungle of amazon products, and your exit strategy is very very costly.

clever amazon, clever.

derefr · 8 years ago

There is some truth to this, but in a larger sense (on an ecosystem level, rather than from the perspective of an individual company), I can only be happy when AWS enters a new space. It makes that component into table-stakes in the IaaS game, which means every other big player is about to step up with their own offering as well, and the third-party SaaS and open-source self-hosted offerings in the same space all are going to heat up as well.

Consider the evolution of container hosting services: first we had PaaSes like Heroku with proprietary container formats; then we got Docker, but Docker Swarm was nascent and there was no serious Docker Swarm IaaS-cloud offering. But then, very quickly, AWS built ECS; Google responded with Kubernetes; and then Kubernetes became the open standard, made everyone forget about Docker Swarm, and took over (and is even replacing ECS now.)

That's what happens when AWS enters a space. And it's great.

PaulHoule · 8 years ago

It supports RDF/SPARQL which gives you migration options to twenty or so triple stores such as rdflib, Jena, Virtuoso, AllegroGraph, etc.

No lock in at all.

dorfsmay · 8 years ago

Same with object stores, nobody had an offer before S3 proved to be practical for almost everything.

chiefalchemist · 8 years ago

Yes and no.

If you factor in the cost of not taking "the easy way" you'd likely never get past the 100usd phase.

You're point is valid. I'm just suggesting the lens / context isn't as one-sided as you've presented.

Put another way, plenty of startups and VCs would love to have "getting out from under AWS" at the top of their good problems to have list.

lolive · 8 years ago

Just an off-topic comment: i am the maintainer of a visual query builder for SPARQL queries. cf http://datao.net

This tool proposes to design query patterns from a graph data model, via drag n drops. The tool can then compile the patterns as SPARQL, run them on an endpoint and format the results as map/forms/tables/graphs/HTML (via templating)/...

Another service of Datao (http://search.datao.net) proposes a search-engine view of those queries so you can type the textual representation of an object in any public SPARQL endpoint, and the service will list the queries currently available in Datao that can be applied upon this object. You can then run these queries with a click, and get the HTML templating of the query results.

Feel free to have a look at the website, if you find any interest in this tool. ANy feedback is welcome.

PS: Sorry for the poor quality of the videos. I manage this project on my spare time :)

randomor · 8 years ago

Only had experience with Cypher, really liked it. It will be interesting to see how Neo4j responds to this. Regardless of tech specs, the fully-managed Neptune vs a community version on AWS Marketplace seems to give Neptune unfair advantage.

mcphage · 8 years ago

> seems to give Neptune unfair advantage

What do you mean by "unfair" here?

Graphguy · 8 years ago

brianbreslin · 8 years ago

Can someone explain to me in lamens terms what a graph database is?

It's a database that's designed to store relationships between objects instead of just facts. It has efficient methods of following long chains of associations. So think of how you store tree structures in a relational database—there are a lot of different ways of doing it, and they're all frustrating. Storing trees is something graph databases do naturally.

kchoudhu · 8 years ago

I've been playing around with graph databases for a while (I am writing one that turns Postgres into one of my own for kicks[1], [2]) and one of the things that became obvious after using the project in production was that it promotes functional reactive programming in a way that most other database paradigms don't.

Even propagation and node invalidation are awesome for rapid what-if style experimentation, and I am so psyched to see more and more attention being paid to graph computing in general.

[1] https://www.github.com/kchoudhu/openarc [2] https://www.anserinae.net/whats-cooking-openarc-edition.html...

michaelbuckbee · 8 years ago

Trying to get this straight in my mind here.

Is it fair to say that traditional RDBMS/SQL are for storing different "sets" of related information (tables for products, users, orders).

Graph databases are for storing data about the _same_ set of data as it interrelates to itself.

- a User and and all their Friends (who are also users) - a Keyword and all associated Terms (which are also keywords)

Is that right?