On December 2, Simplilearn hosted a panel comprised of LinkedIn’s big data open source team members. The panel included Vasanth Rajamani, Director; Sumitha Chandran, Senior Engineering Manager; and Sunitha Beeram and Abhishek Tiwari, both Engineering Managers. The moderator for the discussion was Simplilearn’s Chief Product Officer, Anand Narayanan.

Big Data Drives LinkedIn’s Businesses

LinkedIn has a user base of about 700 million people worldwide. Each user has a profile with data about employment history, education, skills, interests, and related information. More importantly, each user has a network of connections, and the relationships in these networks are the foundation of LinkedIn’s value.  

LinkedIn has developed several AI-powered applications to extract valuable information and insights from the massive amount of data it possesses. These applications include the recommendation engines that suggest “People you may know,” “Companies you may want to follow,” “Jobs you may be interested in,” and “What people are talking about now.”

LinkedIn manages its business with metrics of application performance and outputs. For example, the True North product drives engagement and high-quality connections. A key performance metric for this product is the number of “People You May Know” invitations sent and received: the better the PYMK recommendations, the more invitations members will send and accept.

LinkedIn relies on A/B testing to make continuous improvements in its products.  Every aspect of their products undergoes A/B testing to determine if a proposed change improves performance.

Open Source is the Backbone of Big Data at LinkedIn

To manage LinkedIn's big data flow and processing, they chose to create an infrastructure based on open source. The architecture supports the flow of data from sources to computation engines to use cases and includes storage management and load balancing. Figures 1 and 2 show this infrastructure.

data-ingestion

Figure 1.  Data Ingestion Infrastructure

data-usage

Figure 2. Data Usage Infrastructure

The elements from data ingestions to workflow orchestration are all open source.  Each open source component has a developer community behind it, working to improve and extend the component for the benefit of all of its users.  This is a crucial advantage of open source for companies: it lets them make use of the developer community’s efforts and enthusiasm without bearing the entire cost of development and maintenance.

Gobblin is an open-source project that LinkedIn leads. Gobblin ingests data from a wide variety of sources and converts it into forms compatible with an equally wide range of systems. Users can create open-source connectors into and out of Gobblin for their needs, and in doing so, they add to the open source resources for Gobblin. Figure 3 gives a partial picture of the range of connectors available.

gobblin

Figure 3.  Gobblin connectors

Using a single tool like Gobblin for data ingestion gives LinkedIn the ability to audit data for compliance, regardless of what format the data arrived in. Auditing is an essential capability for LinkedIn, which has to comply with many different regulatory requirements in different countries and regions.

Scalability

LinkedIn’s user base has grown at an exponential rate from the company’s founding. Each new user not only brings in a significant amount of data but also adds substantially to the actual and potential connections in the LinkedIn network. The infrastructure for LinkedIn’s big data operations has to be able to scale rapidly to stay ahead of the growth curve.

A scalable big data ingestion solution has to provide:

  • Rich connector integration
  • Centralized state management
  • Multi-platform and scalability support
  • Extensibility
  • Self-service for internal users of the data
  • Operability

Once the data has been ingested, it needs to be put into a form for use by compute engines. LinkedIn uses several compute engines, including Pig, Hive, and Spark. LinkedIn is in the process of moving all its computation to Spark, with exceptions like using Presto for interactive SQL and Tensorflow for deep learning.  Gobblin supports this variety of data uses with its extensible connector set.

Scaling Spark poses a set of challenges that each requires solutions:

Resource management: compute resource demands tend to grow faster than the cluster resources do. YARN helps dynamically allocate resources, and Cloud computing increases access to more resources on demand.

Compute engine: to meet the challenge of the rapid growth of computing demand, LinkedIn uses Magnet to shuffle resources and compute optimization to increase the efficiency of resource use.

User productivity: the challenge is to avoid the “support trap” of users requiring more support as their usage increases. Tooling and automation help provide users with increased support without increasing the burden on the support staff. An example is the Dr. Elephant tool that helps users diagnose application efficiency and improve their application performance with self-service tools.

Master the Big Data & Hadoop frameworks, leverage the functionality of AWS services, and use the database management tool with the Big Data Engineer training course.

Actionable Career Advice

The panelists offered advice to the audience for ways to enter the field of big data. For students and freshers, they recommended joining open source software communities such as the Gobblin and Dr. Elephant communities. By writing Gobblin connectors or making contributions to Dr. Elephant (or other OSS projects), you can gain real-world experience and make connections to other community members. You should also attend meetups, conferences, and other events that give you access to professionals and let you add to your network.

Mid-career professionals who are already in related work (for example, database administration) will find it relatively easy to transition into big data. The panelists advised that when planning that transition, you should understand the big data domain and decide at which layer of the stack (applications, mid-tier, core infrastructure, etc.) you would be most comfortable and interested.

A good start on your transition into big data is looking into the resources and courses Simplilearn offers. Simplilearn offers many free resources (articles, ebooks, videos, and more) about big data. When you are ready to move ahead into a big data career, check out Simplilearn’s big data certification courses and programs.

Our Big Data Courses Duration And Fees

Big Data Courses typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees
Post Graduate Program in Data Engineering

Cohort Starts: 25 Apr, 2024

8 Months$ 3,850

Learn from Industry Experts with free Masterclasses

  • Program Overview: The Reasons to Get Certified in Data Engineering in 2023

    Big Data

    Program Overview: The Reasons to Get Certified in Data Engineering in 2023

    19th Apr, Wednesday10:00 PM IST
  • Program Preview: A Live Look at the UCI Data Engineering Bootcamp

    Big Data

    Program Preview: A Live Look at the UCI Data Engineering Bootcamp

    4th Nov, Friday8:00 AM IST
  • 7 Mistakes, 7 Lessons: a Journey to Become a Data Leader

    Big Data

    7 Mistakes, 7 Lessons: a Journey to Become a Data Leader

    31st May, Tuesday9:00 PM IST
prevNext