Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines : Jobs and Subscriptions - 2006: Launched public profiles (achieved portability/new features) - 2008: LinkedIn goes GLOBAL! (https://business.linkedin.com/) - 2012: Site transformation/rapid growth - 2013: ~225 million members ( ) 27 % of LinkedIn subscribers are recruiters - 2014: Next decade focused on map of digital economy 3 4 5 Three Major Data Dimensions @LinkedIn 6 LinkedIn Challenges for Web-scale OLAP ● Horizontally scalable ○ currently over 200+ million users ○ adding 2 new members per second ● Quick response time to user’s queries ● High availability ● High read & write throughput (billions of monthly page views) ● Heavy dependency on slowest node’s response as data is spread across various nodes 7 Current OLAP Solutions - not suited for high-traffic website ● What is OLAP - Online Analytical Processing ○ Long transactions ○ Complex queries ○ Mining and analyzing large amounts of data ○ Infrequent updates of data ● Traditional for Business Intelligence (i.e. SAP, Oracle and etc) ○ retrieve & consolidate partial results across nodes (causing slow responses) ● Distributed (problems: w/latency, availability and cost) ● Materialized Cubes (loading billions of page views - load too high) 8 Avatara: solution for Web-scale Analytics Products ● Provides fast scalable OLAP system ○ handles small cubes scenarios ○ simple grammar for cube construction and query at scale ○ sharding of cube dimension into key-value model ○ leverage distributed key-value store for low-latency ○ high availability access to cubes ○ leverages hadoop for joins ● Two examples of analytics features: ○ WVMP - cube sharded by member ID ■ Who’s viewed my profile? (WVMP) ○ WVTJ - cube sharded across jobs ■ Who’s viewed this job? (WVTJ) 9 Avatara: solution con’t ● Sharding (i.e horizontal scaling) ○ divides the data set and distributes the data over multiple servers. Each shard is an independent database and together the shards make up a single logical database ■ sharding on a primary key (turning a big cube into smaller ones) ● Store cube data’s in one location requires a single disk fetch ● Offline Batch Engine ○ High throughput ○ Batch processing (Hadoop Jobs) ● Online Query Engine ○ low latency, high availability ○ key-value paradigm for storing data (Voldemort) 10
Description: