Table Of Content2
n
d
E
d
i
t
Graph Databases i
o
n
SECOND
EDITION
Discover how graph databases can help you manage and query highly “ Graph analysis is possibly
connected data. With this practical book, you’ll learn how to design and the single most effective G
implement a graph database that brings the power of graphs to bear
competitive differentiator r
on a broad range of problem domains. Whether you want to speed up a
for organizations pursuing
your response to user queries or build a database that can adapt as your p
business evolves, this book shows you how to apply the schema-free data-driven operations h
graph model to real-world problems. and decisions.”
D
This second edition includes new code samples and diagrams, using the
—Gartner
a
latest Neo4j syntax, as well as information on new functionality. Learn IT Market Clock for Database
t
how different organizations are using graph databases to outperform their Management Systems, 2014
a
competitors. With this book’s data modeling, query, and code examples,
b
you’ll quickly be able to implement your own solution.
a
s
■ Model data with the Cypher query language and property e
graph model s
■ Learn best practices and common pitfalls when modeling with graphs
■ Plan and implement a graph database solution in test-driven fashion
■ Explore real-world examples to learn how and why
organizations use a graph database
■ Understand common patterns and components of graph Graph
database architecture
■ Use analytical techniques and algorithms to mine graph
R
database information o
b
in
s
Ian Robinson works on research and development for future versions of the Neo4j o Databases
n
graph database and previously served as Neo’s Director of Customer Success.
,
W
Jim Webber, Neo Technology’s Chief Scientist, is a distributed systems specialist
e
working on very large-scale graph data technology. b
b
Emil Eifrem is CEO of Neo Technology and co-founder of the open source Neo4j e
graph database project. r &
E
if
r
e
m NEW OPPORTUNITIES FOR CONNECTED DATA
DATA/DATA SCIENCE Twitter: @oreillymedia
facebook.com/oreilly
US $39.99 CAN $45.99
ISBN: 978-1-491-93089-2 Ian Robinson,
Jim Webber & Emil Eifrem
2
n
d
E
d
i
t
Graph Databases i
o
n
SECOND
EDITION
Discover how graph databases can help you manage and query highly “ Graph analysis is possibly
connected data. With this practical book, you’ll learn how to design and the single most effective G
implement a graph database that brings the power of graphs to bear
competitive differentiator r
on a broad range of problem domains. Whether you want to speed up a
for organizations pursuing
your response to user queries or build a database that can adapt as your p
business evolves, this book shows you how to apply the schema-free data-driven operations h
graph model to real-world problems. and decisions.”
D
This second edition includes new code samples and diagrams, using the
—Gartner
a
latest Neo4j syntax, as well as information on new functionality. Learn IT Market Clock for Database
t
how different organizations are using graph databases to outperform their Management Systems, 2014
a
competitors. With this book’s data modeling, query, and code examples,
b
you’ll quickly be able to implement your own solution.
a
s
■ Model data with the Cypher query language and property e
graph model s
■ Learn best practices and common pitfalls when modeling with graphs
■ Plan and implement a graph database solution in test-driven fashion
■ Explore real-world examples to learn how and why
organizations use a graph database
■ Understand common patterns and components of graph Graph
database architecture
■ Use analytical techniques and algorithms to mine graph
R
database information o
b
in
s
Ian Robinson works on research and development for future versions of the Neo4j o Databases
n
graph database and previously served as Neo’s Director of Customer Success.
,
W
Jim Webber, Neo Technology’s Chief Scientist, is a distributed systems specialist
e
working on very large-scale graph data technology. b
b
Emil Eifrem is CEO of Neo Technology and co-founder of the open source Neo4j e
graph database project. r &
E
if
r
e
m NEW OPPORTUNITIES FOR CONNECTED DATA
DATA/DATA SCIENCE Twitter: @oreillymedia
facebook.com/oreilly
US $39.99 CAN $45.99
ISBN: 978-1-491-93089-2 Ian Robinson,
Jim Webber & Emil Eifrem
SECOND EDITION
Graph Databases
Ian Robinson, Jim Webber & Emil Eifrem
Boston
Graph Databases
by Ian Robinson, Jim Webber, and Emil Eifrem
Copyright © 2015 Neo Technology, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Marie Beaugureau Interior Designer: David Futato
Production Editor: Kristen Brown Cover Designer: Ellie Volckhausen
Proofreader: Christina Edwards Illustrator: Rebecca Demarest
Indexer: WordCo Indexing Services
June 2013: First Edition
June 2015: Second Edition
Revision History for the Second Edition
2015-06-09: First Release
See http://oreilly.com/catalog/errata.csp?isbn=9781491930892 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Graph Databases, the cover image of an
European octopus, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-93089-2
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is a Graph? 1
A High-Level View of the Graph Space 4
Graph Databases 5
Graph Compute Engines 7
The Power of Graph Databases 8
Performance 8
Flexibility 9
Agility 9
Summary 10
2. Options for Storing Connected Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Relational Databases Lack Relationships 11
NOSQL Databases Also Lack Relationships 15
Graph Databases Embrace Relationships 18
Summary 24
3. Data Modeling with Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Models and Goals 25
The Labeled Property Graph Model 26
Querying Graphs: An Introduction to Cypher 27
Cypher Philosophy 28
MATCH 30
RETURN 30
iii
Other Cypher Clauses 31
A Comparison of Relational and Graph Modeling 32
Relational Modeling in a Systems Management Domain 33
Graph Modeling in a Systems Management Domain 38
Testing the Model 39
Cross-Domain Models 41
Creating the Shakespeare Graph 45
Beginning a Query 46
Declaring Information Patterns to Find 48
Constraining Matches 49
Processing Results 50
Query Chaining 51
Common Modeling Pitfalls 52
Email Provenance Problem Domain 52
A Sensible First Iteration? 52
Second Time’s the Charm 55
Evolving the Domain 58
Identifying Nodes and Relationships 63
Avoiding Anti-Patterns 63
Summary 64
4. Building a Graph Database Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Data Modeling 65
Describe the Model in Terms of the Application’s Needs 66
Nodes for Things, Relationships for Structure 67
Fine-Grained versus Generic Relationships 67
Model Facts as Nodes 68
Represent Complex Value Types as Nodes 71
Time 72
Iterative and Incremental Development 74
Application Architecture 76
Embedded versus Server 76
Clustering 81
Load Balancing 82
Testing 85
Test-Driven Data Model Development 85
Performance Testing 91
Capacity Planning 95
Optimization Criteria 95
Performance 96
Redundancy 98
Load 98
iv | Table of Contents
Importing and Bulk Loading Data 99
Initial Import 99
Batch Import 100
Summary 104
5. Graphs in the Real World. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Why Organizations Choose Graph Databases 105
Common Use Cases 106
Social 106
Recommendations 107
Geo 108
Master Data Management 109
Network and Data Center Management 109
Authorization and Access Control (Communications) 110
Real-World Examples 111
Social Recommendations (Professional Social Network) 111
Authorization and Access Control 123
Geospatial and Logistics 132
Summary 147
6. Graph Database Internals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Native Graph Processing 149
Native Graph Storage 152
Programmatic APIs 158
Kernel API 158
Core API 159
Traversal Framework 160
Nonfunctional Characteristics 162
Transactions 162
Recoverability 163
Availability 164
Scale 166
Summary 170
7. Predictive Analysis with Graph Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Depth- and Breadth-First Search 171
Path-Finding with Dijkstra’s Algorithm 173
The A* Algorithm 181
Graph Theory and Predictive Modeling 182
Triadic Closures 182
Structural Balance 184
Local Bridges 188
Table of Contents | v
Summary 190
A. NOSQL Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
vi | Table of Contents
Foreword
Graphs Are Eating The World, And There’s No Going Back
In the three years since we first wrote Graph Databases, our industry has witnessed a
fundamental shift in the way in which it views its data assets.
Data, always present in some stratum of innovation, has for several decades delivered
only a fraction of its potential, in large part because the technologies at our disposal
have forced us to treat it as though it were nothing but isolated islands of middling
significance. Graphs and graph databases change this completely.
As vertical after vertical discovers the transformative power of connected data, the
breakaway leaders in these industries are stealing an irreversible march on their com‐
petitors. Graphs are everywhere, they’re eating the world, and there’s no going back.
As I wrote in my foreword to the first edition, this change in perspective started
almost two decades ago, when a precocious web search startup challenged the domi‐
nance of the market leaders—AltaVista, Lycos, Excite, et al—through its application
of a simple algorithm that made sense of the way in which web documents are con‐
nected.
Today, Google dominates the web search space. In its wake, other industry leaders
have asked themselves: “What if we take the relationships and connections in our
data and reimagined our business along those relationships? What would that look
like?” The answers to these questions are omnipresent in our online lives today in the
form of Facebook, Twitter, and the like.
What was once a specialist and often proprietary means for realizing the opportuni‐
ties inherent in connected data is now a commodity technology. In the past three
years the features, usability, and performance of the world’s leading graph database
have matured enormously; awareness and adoption have penetrated far wider, deeper,
and more quickly than we could have hoped; and the inventiveness and irreversible
vii
impact of introducing graph databases into formerly discrete-data-oriented domains
have invigorated and challenged the markets at every turn.
In 2011, we thought the main verticals to adopt graph databases would be software,
financial services, and telecom; and largely we were right. However, what’s been even
more amazing has been the adoption of graph databases outside of those top three
verticals.
We’ve seen industry after industry being eaten by graphs. In each case, the adoption
of graph technology has resulted in better products and more remarkable customer
experiences. Companies such as Pitney Bowes, eBay, and Cisco are deploying the
graph to solve some of their most mission-critical problems, forcing their competi‐
tion to catch up or leave the industry. Four of the top ten global retailers today use
Neo4j. Behind them, their non-adapting competitors are struggling to make it
because they’ve failed to adapt.
This ability of graph databases to colonize and radically transform an industry is
nowhere more apparent than in the emerging Internet of Things (IoT), a domain
which might more aptly be called the Internet of Connected Things, because without
the connections, there’s no point to it. When you have a lot of connected things, you
have a graph-based problem.
In recent years, a major telco equipment provider has entered the IoT space with a
product that, embedded inside large telecom networks, sniffs network traffic and
builds a model of all the connected devices on the network. If devices in one category
are all flashing red at the same time, you can easily determine if it’s truly because all of
them are simultaneously failing or if it’s because they’re all connected to a firewall and
power supply that has just gone out. That level of real-time, predictive analysis is
what you can do when taking a connected view of the IoT.
The speed with which such solutions can be developed and put into production is a
result of some significant changes to the underlying graph database technology. In
2013 we introduced Neo4j 2.0, marking a big change in the features, usability, and
performance of the product. Besides a wholly new visualization tool, Neo4j 2.0 came
with an improved data model, whose chief features, labels, optional constraints, and
declarative indexes—coupled with numerous improvements to the Cypher query lan‐
guage—make designing and developing a graph database application easier and more
intuitive than ever before.
Accompanying this maturation of the technology is an amazing growth in commu‐
nity traction. According to db-engines.com, graph databases have been the fastest
growing database category since 2013. Big data is the hottest growing sector in the
tech industry, and graph databases are at the absolute nexus of that growth. Graphs
are indeed eating the world, and there’s no turning back.
viii | Foreword