ebook img

Data Mining with LinkedIn Data Using AJAX call to REST API PDF

44 Pages·2015·2.27 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Mining with LinkedIn Data Using AJAX call to REST API

Big Data Analytics in LinkedIn by Danielle Aring & William Merritt 2 Brief History of LinkedIn - Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/) - 2005: Introduced first business lines : Jobs and Subscriptions - 2006: Launched public profiles (achieved portability/new features) - 2008: LinkedIn goes GLOBAL! (https://business.linkedin.com/) - 2012: Site transformation/rapid growth - 2013: ~225 million members ( ) 27 % of LinkedIn subscribers are recruiters - 2014: Next decade focused on map of digital economy 3 4 5 Three Major Data Dimensions @LinkedIn 6 LinkedIn Challenges for Web-scale OLAP ● Horizontally scalable ○ currently over 200+ million users ○ adding 2 new members per second ● Quick response time to user’s queries ● High availability ● High read & write throughput (billions of monthly page views) ● Heavy dependency on slowest node’s response as data is spread across various nodes 7 Current OLAP Solutions - not suited for high-traffic website ● What is OLAP - Online Analytical Processing ○ Long transactions ○ Complex queries ○ Mining and analyzing large amounts of data ○ Infrequent updates of data ● Traditional for Business Intelligence (i.e. SAP, Oracle and etc) ○ retrieve & consolidate partial results across nodes (causing slow responses) ● Distributed (problems: w/latency, availability and cost) ● Materialized Cubes (loading billions of page views - load too high) 8 Avatara: solution for Web-scale Analytics Products ● Provides fast scalable OLAP system ○ handles small cubes scenarios ○ simple grammar for cube construction and query at scale ○ sharding of cube dimension into key-value model ○ leverage distributed key-value store for low-latency ○ high availability access to cubes ○ leverages hadoop for joins ● Two examples of analytics features: ○ WVMP - cube sharded by member ID ■ Who’s viewed my profile? (WVMP) ○ WVTJ - cube sharded across jobs ■ Who’s viewed this job? (WVTJ) 9 Avatara: solution con’t ● Sharding (i.e horizontal scaling) ○ divides the data set and distributes the data over multiple servers. Each shard is an independent database and together the shards make up a single logical database ■ sharding on a primary key (turning a big cube into smaller ones) ● Store cube data’s in one location requires a single disk fetch ● Offline Batch Engine ○ High throughput ○ Batch processing (Hadoop Jobs) ● Online Query Engine ○ low latency, high availability ○ key-value paradigm for storing data (Voldemort) 10

Description:
Current OLAP Solutions - not suited for high-traffic website. ○ What is Traditional for Business Intelligence (i.e. SAP, Oracle and etc) simple grammar for cube construction and query at scale . Retrieving Data Structure. 25
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.