Blog | Apache Linkis

Deploy Linkis with Kubernetes

July 16, 2022 · 3 min read

jacktao

contributors

1. Dependencies and versions

kind github：https://github.com/kubernetes-sigs/kind

kind website：kind.sigs.k8s.io/

version:

kind 0.14.0

docker 20.10.17

node v16.0.0

Note:

Ensure that the front and back ends can compile properly
Ensure that the component depends on the version
Kind refers to the machine that uses docker container to simulate nodes. When the machine is restarted, the scheduler does not work because the container is changed.

2.Install the docker

（1）Install the tutorial

sudo yum install -y yum-utils device-mapper-persistent-data lvm2

sudo yum-config-manager --add-repo https://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

sudo sed -i 's+download.docker.com+mirrors.aliyun.com/docker-ce+' /etc/yum.repos.d/docker-ce.repo

sudo yum makecache fast

sudo yum -y install docker-ce

systemctl start docker

systemctl enable docker

（2）setting image mirrors

vi /etc/docker/daemon.json

{

"registry-mirrors": ["http://hub-mirror.c.163.com"],

"insecure-registries": ["https://registry.mydomain.com","http://hub-mirror.c.163.com"]

}

3.install the kind

（1）Manually download the Kind binary

https://github.com/kubernetes-sigs/kind/releases

（2）Install kind binary

chmod +x ./kind

mv kind-linux-amd64 /usr/bin/kind

4.Install the JDK and Maven

（1）Refer to the general installation tutorial to install the following components

jdk 1.8

mavne 3.5+

5.Install the NodeJS

（1）version

node v16.0.0

（2）install the nvm

export http_proxy=http://10.0.0.150:7890

export https_proxy=http://10.0.0.150:7890

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash

export NVM_DIR="$HOME/.nvm"

[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"  # This loads nvm

[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"  # This loads nvm bash_completion

（3）install the nodejs

nvm ls-remote

nvm install v14.19.3

（4）setting NPM

npm config set registry https://registry.npmmirror.com

npm config set sass_binary_site https://registry.npmmirror.com/binary.html?path=node-sass/

（5）Compiler front-end

npm install -g yarn

yarn

yarn build

yarn

6.Compile linkis

# 1. When compiling for the first time, execute the following command first

./mvnw -N install

# 2. make the linkis distribution package

# - Option 1: make the linkis distribution package only

./mvnw clean install -Dmaven.javadoc.skip=true -Dmaven.test.skip=true

# - Option 2: make the linkis distribution package and docker image

./mvnw clean install -Pdocker -Dmaven.javadoc.skip=true -Dmaven.test.skip=true

# - Option 3: linkis distribution package and docker image (included web)

./mvnw clean install -Pdocker -Dmaven.javadoc.skip=true -Dmaven.test.skip=true -Dlinkis.build.web=true

7.Create the cluster

dos2unix ./linkis-dist/helm/scripts/*.sh

./linkis-dist/helm/scripts/create-test-kind.sh

8.install the helm charts

 ./scripts/install-charts.sh linkis linkis-demo

9.Visit the Linkis page

kubectl port-forward -n linkis  --address=0.0.0.0 service/linkis-demo-web 8087:8087

http://10.0.2.101:8087

10.Test using the Linkis client

kubectl -n linkis exec -it linkis-demo-ps-publicservice-77d7685d9-f59ht -- bash
./linkis-cli -engineType shell-1 -codeType shell -code "echo \"hello\" "  -submitUser hadoop -proxyUser hadoop

11.install the kubectl

cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF

yum install -y --nogpgcheck kubectl

kubectl config view  
kubectl config get-contexts  
kubectl cluster-info  

How to add a GitHub Action for the GitHub repository

July 4, 2022 · 9 min read

BeaconTown

Student

1 Summary

As you know, continuous integration consists of many operations, such as capturing code, running tests, logging in to remote servers, publishing to third-party services, and so on. GitHub calls these operations as Actions. Many operations are similar in different projects and can be shared. GitHub noticed this and came up with a wonderful idea to allow developers to write each operation as an independent script file and store it in the code repository so that other developers can reference it. If you need an action, you don't have to write a complex script by yourself. You can directly reference the action written by others. The whole continuous integration process becomes a combination of actions. This is the most special part of GitHub Actions.

GitHub provides a Github Action Market for developers, we can find the GitHub Action we want from this market and configure it into the workflow of the repository to realize automatic operation. Of course, the GitHub Action that this market can provide is limited. In some cases, we can't find a GitHub Action that can meet our needs. I will also teach you how to write GitHub Action by yourself later in this blog.

2 Some terms

2.1 What is continuous integration

In short, it is an automated program. For example, every time the front-end programmer submits code to GitHub's repository, GitHub will automatically create a virtual machine (MAC / Windows / Linux) to execute one or more instructions (determined by us), for example:

npm install
npm run build

2.2 What is YAML

The way we integrate GitHub Action is to create a Github/workflow directory, with a * yaml file - this yaml file is the file we use to configure GitHub Action. It is a very easy scripting language. For users who are not familiar with yaml, you can refer to it here.

3 Start writing the first Workflow

3.1 How to customize the name of Workflow

GitHub displays the name of the Workflow on the action page of the repository. If we omit name, GitHub will set it as the Workflow file path relative to the repository root directory.

name: 
  Say Hello

3.2 How to customize the trigger event of Workflow

There are many events, for example, the user submits a pull request to the repository, the user submits an issue to the repository, or the user closes an issue, etc. We hope that when some events occur, the Workflow will be automatically executed, which requires the definition of trigger events. The following is an example of a custom trigger event:

name: 
  Say Hello
on: 
  pull_request

The above code can trigger workflow when the user submits a pull request. For multiple events, we enclose them in square brackets, for example:

name: 
  Say Hello
on: 
  [pull_request,pull]

Of course, we hope that the triggering event can be more specific, such as triggering Workflow when a pull request is closed or reopened:

name: 
  Say Hello
on: 
  pull_request:
    type: 
      [reopend,closed]

For more trigger events, please refer to document here.

3.3 How to define a job

A Workflow is composed of one or more jobs, which means that a continuous integration run can complete multiple tasks. Here is an example:

name: 
  Say Hello
on: 
  pull_request
jobs:
  my_first_job:
    name: My first job
  my_second_job:
    name: My second job

Each job must have an ID associated with it. Above my_ first_ Job and my_ second_ Job is the ID of the job.

3.4 How to specify the running environment of a job

Specify the running environment for running jobs. The operating systems available on Workflow are:

Windows
macos
linux

The following is an example of a specified running environment:

# Limited by space, the previous code is omitted
jobs:
  my_first_job:
    name: My first job
  runs-on: macos-10.15

3.5 The use of step

Each job is composed of multiple steps, which will be executed from top to bottom. Step can run commands (such as linux commands) and actions.

The following is an example of outputting "Hello World":

# Limited by space, the previous code is omitted
jobs:
  my_first_job:
    name: My first job
  runs-on: macos-10.15
  step:
    - name: Print a greeting
    # Define the environment variables of step
      env:
        FIRST_WORD: Hello
        SECOND_WORD: WORD
      # Run instructions: output environment variables
      run: |
        echo $FIRST_WORD $SECOND_WORD.

Next is the use of action, which is actually a command. For example, GitHub officially gives us some default commands. We can directly use these commands to reduce the amount of Workflow code in the repository. The most common action is Checkout, it can clone the latest code in the repository into the Workflow workspace.

# Limited by space, the previous code is omitted
  step:
    - name: Check out git repository 
      uses: actions/checkout@v2

Some actions require additional parameters to be passed in. Generally, with is used to set the parameter value:

# Limited by space, the previous code is omitted
  step:
    - name: Check out git repository 
      uses: actions/checkout@v2
      uses: actions/setup-node@v2.2.0
        with:
          node-version: 14

4 How to write your own action

4.1 Configuration of action.yml

When we can't find the action we want in the GitHub Action Market, we can write an action to meet our needs by ourselves. The customized action needs to be created a new "actions" directory under the ".gitHub/workflow" directory, and then create a directory with a custom action name. Each action needs an action configuration file: action.yml. The runs section of action.yml specifies the starting mode of the operation. There are three startup methods: node.js Script, Docker Image, and Composite Script. The common parameters of action.yml are described below:

name: Customize the name of the action
description: Declare the parameters or outputs that need to be passed in for action
inputs: Customize the parameters to be input
outputs: Output variables
runs: Startup mode

The following is a configuration example of action.yml：

name: "example action"

description: "This is an example action"

inputs:
  param1:
    description: "The first param of this action"
    required: true  #Required parameters must be set to true

  param2:
    description: "The second param of this action"
    required: true

outputs:
  out1:
    description: "The outputs of this action"

runs:
  using: node16
  main: dist/index.js
  post: dist/index.js

Setting runs.using to node16 or node12 can be specified as the starting node.js script. The script file named main is the startup file. The way to start is similar to running the command node main.js directly. Therefore, dependency will not be installed from package.json. During development, we usually use the packaging tool to package the dependencies together, output a separate JS file, and then use this file as the entry point. The runs.post can specify the cleanup work, and the content here will be run at the end of the Workflow.

4.2 Using Docker Image

If Docker is used, we need to modify the runs in action.yml to:

runs:
  using: docker
  image: Dockerfile

runs.image specifies the dockerfile required for image startup, which is specified here as the dockerfile under the project root directory. In the dockerfile, specify the startup script with ENTRYPOINT or CMD. For example, define a program that runs scripts in Python:

FROM python:3

RUN pip install --no-cache-dir requests

COPY . .

CMD [ "python", "/main.py"]

Here we can see the advantages of using docker: you can customize the running environment, and you can use other program languages.

5 GitHub Action project practice

In this section, I will describe how to write your own GitHub Action with a specific example.

Problem

Assuming that there are many issues to be processed in our GitHub repository, each pull request submitted by the user may be associated with an issue. If you have to manually close an issue after merging a pull request, it will be quite cumbersome.

Resolve

Then workflow comes in handy. We can listen to the closed event of pull request and determine whether the closed event is closed by merged or non merged. If it is merged, the associated issue will be closed.

But there is still a problem here, how to obtain the associated issue? We can ask the user to add the issue that needs to be associated in the description part when submitting the pull request, such as #345, and then extract the issue number of 345. How to realize this function? We can write GitHub Action by ourselves. In order to make the GitHub Action program more concise, here I use docker to start GitHub Action. First, prepare action.yml:

# The name of Github Action 
name: "Auto_close_associate_issue"
# The description of action
description: "Auto close an issue which associate with a PR."

# Define parameters to be input
inputs:
  # The name of first param is prbody
  prbody: 
    # The definition of the param
    description: "The body of the PR to search for related issues"
    # Required param
    required: true

outputs:
  #The name of output param
  issurNumber:
    description: "The issue number"

runs:
  # Using Docker Image
  using: "docker"
  image: "Dockerfile"

The next step is to write script files, where I use node.js. The idea of this script is: first obtain the variable value from the environment, extract the issue number, and then output it to the environment. The corresponding script (named main.js) is as follows:

// Get environment variables. All parameters passed to GitHub Action are capitalized and the prefix INPUT_ is required, which is specified by GitHub
let body = process.env['INPUT_PRBODY']; 
// Extract the number of issue by regular expression
let pattern = /#\d+/;
let issueNumber = body.match(pattern)[0].replace('#', '');
// Output the issue number to the environment
console.log(`::set-output name=issueNumber::${issueNumber}`);

Next is the image file of Docker (the file name is Dockerfile):

FROM node:10.15

COPY . .

CMD [ "node", "/main.js"]

Finally, action.yml, Dockerfile and main.js is under the directory .github/actions/Auto_close_associate_issue, and the writing of an action is over.

The last step is to write Workflow. The configuration of Workflow is described in detail in Start Writing the First Workflow, so I won't repeat it here. The specific configuration is as follows：

name: Auto close issue when PR is merged

on:
  pull_request_target:
    types: [ closed ]

jobs:
  close-issue:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: "Auto issue closer"
        uses: ./.github/actions/Auto_close_associate_issue/
        id: Closer
        with:
          prbody: ${{ github.event.pull_request.body }}

      - name: Close Issue
        uses: peter-evans/close-issue@v2
        if: ${{ github.event.pull_request.merged }}
        with:
          issue-number: ${{ steps.Closer.outputs.issueNumber }}
          comment: The associated PR has been merged, this issue is automatically closed, you can reopend if necessary.
        env:
          Github_Token: ${{ secrets.GITHUB_TOKEN }}
          PRNUM: ${{ github.event.pull_request.number }}

Apache Linkis Meet up

June 9, 2022 · 2 min read

Data	Topic & Video	Scripts
8/27/2022	Meetup 04期：2022 一站式开源大数据平台WeDataSphere社区年中大会视频集合 Meetup 04：2022 Community mid-year conference video collection of WeDataSphere One-stop Open Source Big Data Platform	演讲稿合集 A collection of speeches
6/27/2022	Meetup 03期：Linkis-1.1.0新功能介绍、上海合合信息合数据工坊IDS Meetup 03: Linkis-1.1.0 New function introduction, Shanghai HEHE Information and Data Workshop IDS	演讲稿合集 A collection of speeches
4/13/2022	Meetup 02期：Qualitis 数据质量 0.9.0版本介绍 Meetup 02：Introduction to Data Quality Platform Qualitis V0.9.0	Qualitis 数据质量0.9.0版本介绍.pptx
4/13/2022	Meetup 02期：Prophecis 机器学习Studio版本更新介绍 Meetup 02：Introduction to Mechine Learning Platform Prophecis Studio	Prophecis 机器学习Studio版本更新介绍(V0.3.0).pptx
4/13/2022	Meetup 02期：调度系统 Schedulis 0.6.2 发布 Meetup 02 : The Scheduling System Schedulis is Release v0.6.2	Schedulis 0.6.2 发布.pptx
4/13/2022	Meetup 02期：DataSphereStudio 1.0 系统介绍 Meetup 02: Introduciton to DataSphereStudio 1.0 System	DSS 1.0.1 系统介绍-WeDataSphere meetup 202204(1).pptx
2/24/2022	Meetup 01:Linkis 1.0.3 新版本发布 Meetup 01:Linkis 1.0.3 New Version Released	基于Linkis的企业大数据平台改造之路_张延召.pdf Apache Linkis V1.0.3 介绍-v0.2-邸帅.pdf

How to Download Engine Plugins Not Included in the Installation Package By Default

April 15, 2022 · 2 min read

Casion

Development Engineer of WeBank

This article mainly guides you how to download the non-default engine installation plug-in package corresponding to each version.

Considering the size of the release package and the use of plug-ins, the binary installation package released by linkis only contains some common engines /hive/spark/python/shell. Very useful engine, there are corresponding modules flink/io_file/pipeline/sqoop in the project code (there may be differences between different versions), In order to facilitate everyone's use, based on the release branch code of each version of linkis: https://github.com/apache/linkis, this part of the engine is compiled for everyone to choose and use.

Download link

linkis version	engines included	engine material package download link
1.4.0	jdbc pipeline io_file flink openlookeng sqoop presto elasticsearch trino impala	1.4.0-engineconn-plugin.tar
1.3.2	jdbc pipeline io_file flink openlookeng sqoop presto elasticsearch trino seatunnel	1.3.2-engineconn-plugin.tar
1.3.1	jdbc pipeline io_file flink openlookeng sqoop presto elasticsearch trino seatunnel	1.3.1-engineconn-plugin.tar
1.3.0	jdbc pipeline io_file flink openlookeng sqoop presto elasticsearch	1.3.0-engineconn-plugin.tar
1.2.0	jdbc pipeline flink openlookeng sqoop presto elasticsearch	1.2.0-engineconn-plugin.tar
1.1.3	jdbc pipeline flink openlookeng sqoop	1.1.3-engineconn-plugin.tar
1.1.2	jdbc pipeline flink openlookeng sqoop	1.1.2-engineconn-plugin.tar
1.1.1	jdbc pipeline flink openlookeng	1.1.1-engineconn-plugin.tar
1.1.0	jdbc pipeline flink	1.1.0-engineconn-plugin.tar
1.0.3	jdbc pipeline flink	1.0.3-engineconn-plugin.tar

engine type

Engine name	Support underlying component version (default dependency version)	Linkis Version Requirements	Included in Release Package By Default	Description
Spark	Apache 2.0.0~2.4.7, CDH >= 5.4.0, (default Apache Spark 2.4.3)	>=1.0.3	Yes	Spark EngineConn, supports SQL , Scala, Pyspark and R code
Hive	Apache >= 1.0.0, CDH >= 5.4.0, (default Apache Hive 2.3.3)	>=1.0.3	Yes	Hive EngineConn, supports HiveQL code
Python	Python >= 2.6, (default Python2*)	>=1.0.3	Yes	Python EngineConn, supports python code
Shell	Bash >= 2.0	>=1.0.3	Yes	Shell EngineConn, supports Bash shell code
JDBC	MySQL >= 5.0, Hive >=1.2.1, (default Hive-jdbc 2.3.4)	>=1.0.3	No	JDBC EngineConn, already supports Mysql,Oracle,KingBase,PostgreSQL,SqlServer,DB2,Greenplum,DM,Doris,ClickHouse,TiDB,Starrocks,GaussDB and OceanBase, can be extended quickly Support other engines with JDBC Driver package, such as SQLite
Flink	Flink >= 1.12.2, (default Apache Flink 1.12.2)	>=1.0.2	No	Flink EngineConn, supports FlinkSQL code, also supports starting a new Yarn in the form of Flink Jar Application
Pipeline	-	>=1.0.2	No	Pipeline EngineConn, supports file import and export
openLooKeng	openLooKeng >= 1.5.0, (default openLookEng 1.5.0)	>=1.1.1	No	openLooKeng EngineConn, supports querying data virtualization engine with Sql openLooKeng
Sqoop	Sqoop >= 1.4.6, (default Apache Sqoop 1.4.6)	>=1.1.2	No	Sqoop EngineConn, support data migration tool Sqoop engine
Presto	Presto >= 0.180	>=1.2.0	No	Presto EngineConn, supports Presto SQL code
ElasticSearch	ElasticSearch >=6.0	>=1.2.0	No	ElasticSearch EngineConn, supports SQL and DSL code
Trino	Trino >=371	>=1.3.1	No	Trino EngineConn， supports Trino SQL code
Seatunnel	Seatunnel >=2.1.2	>=1.3.1	No	Seatunnel EngineConn， supportt Seatunnel SQL code

Install engine guide

After downloading the material package of the engine, unzip the package

tar -xvf 1.0.3-engineconn-plugin.tar
cd 1.0.3-engineconn-plugin

Copy the engine material package to be used to the engine plug-in directory of linkis, and then refresh the engine material.

For the detailed process, refer to Installing the EngineConnPlugin Engine.

Implementation of OpenLookEng Engine

March 20, 2022 · 3 min read

Peacewong

Development Engineer of WeBank

Overview

openLooKeng is an "out of the box" engine that supports in-situ analysis of any data, anywhere, including geographically remote data sources. It provides a global view of all data through a SQL 2003 interface. openLooKeng features high availability, auto-scaling, built-in caching and indexing support, providing the reliability needed for enterprise workloads.

openLooKeng is used to support data exploration, ad hoc query and batch processing with near real-time latency of 100+ milliseconds to minutes without moving data. openLooKeng also supports hierarchical deployment, enabling geographically remote openLooKeng clusters to participate in the same query. With its cross-region query plan optimization capabilities, queries involving remote data can achieve near "local" performance. Linkis implements the openLooKeng engine to enable Linkis to have the ability to virtualize data and support the submission of cross-source heterogeneous queries, cross-domain and cross-DC query tasks. As a computing middleware, Linkis can connect more low-level computing and storage components by using openLooKeng's connector based on the connectivity capability of Linkis' EngineConn.

Development implementation

The implementation of openLooKeng ec is extended based on the EngineConn Plugin (ECP) of Linkis. Because the OpengLooKeng service supports multiple users to query through the Client, the implementation mode is the implementation mode of the multi-user concurrent engine. That is, tasks submitted by multiple users can run in one EC process at the same time, which can greatly reuse EC resources and reduce resource waste. The specific class diagram is as follows:

【Missing picture】

The specific implementation is that openLooKengEngineConnExecutor inherits from ConcurrentComputationExecutor, supports multi-user multi-task concurrency, and supports docking to multiple different openLooKeng clusters.

Architecture

Architecture diagram:

The task flow diagram is as follows:

The capabilities based on Linkis and openLooKeng can provide the following capabilities:

1. The connection capability of the computing middleware layer based on Linkis allows upper-layer application tools to quickly connect to openLooKeng, submit tasks, and obtain logs, progress, and results.
1. Based on the public service capability of Linkis, it can complete custom variable substitution, UDF management, etc. for openLooKeng's sql
1. Based on the context capability of Linkis, the results of OpengLooKeng can be passed to downstream ECs such as Spark and Hive for query
1. Linkis-based resource management and multi-tenancy capabilities can isolate tasks from tenants and use openLooKeng resources
1. Based on OpengLooKeng's connector capability, the upper-layer application tool can complete the task of submitting cross-source heterogeneous query, cross-domain and cross-DC query type, and get a second-level return.

Follow-up plans

In the future, the two communities will continue to cooperate and plan to launch the following functions:

1.Linkis supports openLooKeng on Yarn mode
1. Linkis has completed the resource management and control of openLooKeng, tasks can now be queued by Linkis, and submitted only when resources are sufficient
1. Based on the mixed computing ability of openLooKeng, the ability of Linkis Orchestrator is optimized to complete the mixed computing ability between ECs in the subsequent plan.

1 Summary​

2 Some terms​

2.1 What is continuous integration​

2.2 What is YAML​

3 Start writing the first Workflow​

3.1 How to customize the name of Workflow​

3.2 How to customize the trigger event of Workflow​

3.3 How to define a job​

3.4 How to specify the running environment of a job​

3.5 The use of step​

4 How to write your own action​

4.1 Configuration of action.yml​

4.2 Using Docker Image​

5 GitHub Action project practice​

Problem​

Resolve​

Download link​

engine type​

Install engine guide​

Overview​

Development implementation​

Architecture​

Follow-up plans​