<>1.1 inux Main work content of operation and maintenance

1, What is? linux Operation and maintenance

*
Operation and maintenance refers to the maintenance of network software and hardware that has been established by large organizations , Is to ensure that the business line and normal operation .

*
In the course of his operation , Maintain him , He assembled the Internet , system , database , development , security , Monitoring technology .

*
There are many kinds of operation and maintenance , yes DBA Operation and maintenance , Website operation and maintenance , Virtualization operation and maintenance , Monitoring operation and maintenance , Game operation and maintenance, etc .

Operation and maintenance classification :

1) Development and operation : Is to develop operation and maintenance tools and operation and maintenance platform for application operation and maintenance

2) Application operation and maintenance : It's to get the business online , Maintenance and troubleshooting , Use the tools developed by development, operation and maintenance to put the business online , maintain , Do troubleshooting

3) System operation and maintenance : It is to provide business infrastructure for application operation and maintenance , such as : system , network , monitor , Hardware, etc

2, Common work contents of basic operation and maintenance

*
Service monitoring technology : Including the development of monitoring platform , application , Accuracy of service monitoring , Real time , Comprehensive guarantee

*
Service failure management : Design of failure plan including service , Automatic implementation of plan , Fault summary and feedback to product / The design level of the system is optimized to improve the stability of the product

*
Service capacity management : Measuring the capacity of a service , Computer room construction of planning service , Expansion , Migration, etc

*
Service performance optimization : From all directions , Including network optimization , Operating system optimization , Application optimization , Client optimization, etc , Improve service performance and response speed , Improve user experience

*
Service global traffic scheduling : Traffic of access service , According to the capacity and service status, the traffic is distributed in each machine room

*
Service security : Including access security of services , Anti attack , Authority control, etc

*
Automatic service publishing and deployment : Deployment platform / Development of tools , And platform / Use of tools , Be safe , Efficient publishing service

*
Service cluster management : Server management including services , Large scale cluster management, etc

*
Service cost optimization : Reduce the resources used by service operation as much as possible , Reduce service operation cost

*
Database management (DBA): By design , Develop and manage high performance database cluster , Make database service more stable , More efficient , Easier to manage .

*
Platform development : class docker The development and management of e-commerce platform , And service access technology

<>1.2 Linux Development process of operation and maintenance work

1, Manual management stage

1) Business scale

*
The business flow is not big , The number of servers is relatively small , The system complexity is not high .

*
For daily business management operations , We are more of a login server by hand , It's a war of its own .

*
Everyone has their own way of operation , Lack of necessary operation standards , Process mechanism , For example, the business directory environment is various .

2) operating duty

*
Early operation and maintenance team in the case of fewer staff , Mainly for data center construction , Basic network construction , Server procurement and server installation and delivery .

*
Changes to online services are rarely involved , monitor , Management, etc .

*
At this time, the operation and maintenance team more belongs to the role of infrastructure , Provide a simple , The available network environment and system environment are enough .

2, Tool batch operation stage

1) Business scale

*
With the size of the server , The increase of system complexity , The full manual operation mode has been unable to meet the needs of the rapid development of business .

*
therefore , Operation and maintenance personnel gradually began to use batch operation tools , There are different script programs for different operation types .

*
here , Although the efficiency has been improved , But it soon met the bottleneck , There is not much improvement in the quality of operation .

*
We started to build a lot of process specifications , For example, review mechanism , First go online and observe a server 10 We'll continue in minutes , At least observe after an upgrade is completed 20 Minutes, etc .

*
These are mainly supervised and implemented by people , But in the actual process, the implementation is often not in place , On the contrary, it reduces the work efficiency .

2) operating duty

*
At this time, the operation and maintenance team will also undertake some server monitoring work , At the same time, I will be responsible for LVS,Nginx Business logic independent 4/7 Operation and maintenance of the first floor .

*
At this time, the service change is more manual operation one by one , Or there are some simple batch scripts .

*
The focus of monitoring is more on server status and resource usage , There is little monitoring of service application status , Monitor more use of various open source systems such as Nagios,Cacti etc .

3, Platform management phase

1) Business scale

*
At this stage , We decided to start building the operation and maintenance platform , Carrying standards through platforms , technological process , Then liberate manpower and improve quality .

*
At this point, the change action of the service is abstracted , The operation method is formed , Service directory environment , Service operation mode and other unified standards .

*
Constraint operation process through platform , As mentioned above, go online to observe a server 10 minute , The start / stop interface of the program must include start / stop interface , stop it , Heavy load, etc .

*
Force pause checkpoints in platform , After the first server operation is completed , The operation and maintenance personnel are required to fill in the corresponding inspection items , Then you can continue to perform subsequent deployment actions .

2) operating duty

*
Due to the continuous increase of business scale and complexity , The operation and maintenance team will be gradually divided into application operation and maintenance and system operation and maintenance .

*
Application operation and maintenance starts to take over online business , Gradually carry out service monitoring , Data backup and service change .

*
With the deepening of service , Application operation and maintenance engineers have the ability to start some simple optimization of services .

*
meanwhile , In response to a large number of service changes every day , We also began to write various operation and maintenance tools , For some specific services, it is very convenient to batch change .

*
With the increase of business scale , There are more and more infrastructure failures due to insufficient capacity planning or weak ability to resist risks , It forces the operation and maintenance personnel to put more energy into multi data center disaster recovery , On the direction of plan management .

4, System self scheduling phase

1) work environment

*
Larger number of services , More complex service relationships , Various operation and maintenance platforms , The original way of converting batch operation into platform operation is no longer suitable .

*
A higher level of abstraction is needed for service changes , Abstract each server into a container , According to the resource usage, the dispatching system , Scheduling services , Deploy to the right server .

*
Automation completes the linkage with the surrounding operation and maintenance systems , For example, monitoring system , Log system , Backup system, etc .

*
Through self scheduling system , Dynamically scale capacity according to service operation , Able to automatically handle common service failures .

*
The work of operation and maintenance personnel will also be advanced to the product design stage , Assist the R & D personnel to transform the service so that they can access the self dispatching system .

2) operating duty

*
When the business scale reaches a certain level , Open source monitoring system in terms of performance and function , Has been unable to meet business needs ;

*
Massive service changes , Complex service relationships , It used to be recorded manually , The way of tool change can not meet the business requirements in terms of efficiency or accuracy ;

*
In terms of security, there have been various incidents, big and small , It forces us to devote more energy to security defense .

*
gradual , Mentioned before the formation of operation and maintenance team 5 It's a big job category , Every category needs specialized talents .

*
At this time, system operation and maintenance is more focused on infrastructure construction and operation and maintenance , Provide stability , Efficient network environment , Deliver the server and other resources to the application operation and maintenance engineer .

*
Application operation and maintenance is more focused on service operation status and efficiency , Database operation and maintenance belongs to the refinement of application operation and maintenance , More focus on automation in the field of database , Performance optimization and security defense .

*
Operation and maintenance R & D and operation and maintenance security provide various platforms , tool , Further improve the work efficiency of operation and maintenance engineers , Make business services run more stable , Efficient and safe .

<>1.3 Linux Operation and maintenance work classification

1, Application operation and maintenance (SRE):

*
Application operation and maintenance is responsible for the change of online service , Service status monitoring , Service disaster recovery and data backup , Carry out routine investigation on the service , Fault emergency treatment, etc

*
The responsibilities are as follows : Design review , Service management , resource management , Routine inspection , Plan management , Data backup .

2, System operation and maintenance (SYS):

*
be responsible for IDC, network ,CDN And the construction of basic services (LVS,NTP,DNS);

*
Responsible for asset management , Server selection , Delivery and maintenance , Network construction ,LVS Load balancing and SNAT build

3, Operation and maintenance development

*
Is to develop operation and maintenance tools and operation and maintenance platform for application operation and maintenance

*
Main platforms : Work order system ,CMDB, monitoring system ,ELK Log system ,CI/CD,LDAP,FAQ, Training system ,OpenStack platform

4, Database operation and maintenance (DBA):

*
Database operation and maintenance is responsible for the design of data storage scheme , Database table design , Index design and SQL optimization ,

*
Make changes to the database , monitor , backups , High availability design, etc , The detailed work is as follows

*
Design review , Capacity planning , Data backup and disaster recovery , Database monitoring , database security , Database high availability and performance optimization

*
Automation system construction , Operation and maintenance R & D , Operation and maintenance platform , monitoring system , Automated Deployment System

5, Operation and maintenance safety (SEC):

*
Responsible for network operation and maintenance security , System and business security reinforcement work

*
Conduct regular security scanning , Penetration test , Develop safety tools and systems and deal with safety incidents

*
The work is as follows : Establishment of safety system , Safety training , risk assessment , Safety construction , Safety compliance , Emergency response .

<>1.4 Linux Daily operation and maintenance software and skills

1, Operation and maintenance platforms and tools used by operation and maintenance engineers

*
Web The server :apache,tomcat,nginx

*
monitor :prometheus,zabbix,openfalcon,nagios,cacti

*
Automatic deployment :ansible,saltstack,puttet

*
load balancing :keepalive,lvs,haproxy,nginx

*
Backup tools :rsync,wget

*
Problem tracing :netstat,top,tcpdump,last

*
container :docker,k8s,docker-compose,swarm

*
security :kerberos,selinux,acl,iptables

*
Virtualization :openstack,xen,kvm

2, Skills for operation and maintenance engineers

*
Solid basic knowledge of computer , Including computer system architecture , operating system , Network technology, etc ;

*
General applications need to understand the operating system , network , security , storage ,CDN,DB etc. , Know the relevant principles ;

*
Programming ability , Small to the development of operation and maintenance tools, large to large operation and maintenance system / Platform development requires good programming ability ;

*
Data analysis ability : Able to organize , Analyze the data of system operation , Find problems and find solutions ;

*
Rich system knowledge , Including system tools , Typical system architecture , Common platform selection, etc ;

Technology