再记服务器被黑记

早上和女朋友交欢之后,在地铁里看到自己服务器挂了,一开始看是502以为是DDoS或者是腾讯云又关我服务。到了生导课现场才知道又是人肉挖矿。背后运行的是一个sha解密文件。不过是有壳的。

ssh密码被换了,但sshd没关,不知道攻击者的脚本是怎么想的。我上次加固redis和docker以后是不太会从那个方式攻入的。所以我判断是ssh的爆破。

看了历史占用 CPU100% 这次比上次多了个储存也被占满了,用 find / -type f -size +10G 以为是/proc/kcore 可hexdump 一下却发现是纸老虎。实际不占用空间,只是个内存 的映射物,linux提供了几个args的参数。如果复用的话可能会达到128T。

这次因为ssh密码换了,而且有个后台自动修改替换,走我的华为云的自动开机重置密码无济于事。这次立功的bt-panel。 我直接把我的私钥换掉root/.ssh/authorized_keys。

ssh上去之后,crontab -e删了,删完自动又有。不过发掘是 /root/.tmp00/bash 在作恶,ps -ef | grep .tmp00 | grep -v grep | awk '{print $2}' | xargs kill -9 (注意不能删bash,会奔溃,估计脚本制作着就是这么想的)

reference: http://www.dashen.tech/2019/05/11/%E4%B8%80%E7%A7%8D%E8%AF%A1%E5%BC%82%E7%9A%84Linux%E7%A3%81%E7%9B%98%E7%A9%BA%E9%97%B4%E8%A2%AB%E5%8D%A0%E6%BB%A1%E9%97%AE%E9%A2%98/

/proc/kcore文件提供了整个机器的内存映像,和vmcore不同的是,它提供了一个运行时的内存映像,为此和vmcore一样,内核提供了一个类似的但是稍显简单的kcore_list结构体,我们比较一下它们:
struct kcore_list {
struct kcore_list *next;
unsigned long addr;
size_t size;
};
struct vmcore {
struct list_head list;
unsigned long long paddr;
unsigned long long size;
loff_t offset;
};
可 以看到vmcore比较复杂,事实上也正是如此,因此它的操作比较复杂,而且使用环境也是很复杂的,涉及到kexec和kdump机制,也许就是这个原因 它使用了内核中最普遍的list_head结构,但是对于kcore,它的结构十分简单,目的就是为了遍历整个内存,也不需要查找,删除等操作,因此它用 了自己的next字段来组成链表,如此一来可以节省一个指针的空间。
在系统初始化的时候,mem_init函数中将整个物理内存和vmalloc的动态内存都加入了kcore_list中,这样的话,这个链表中就最起码有 了两个元素,一个是物理内存,另一个是vmalloc动态内存。注意这里所说的物理内存就是一一映射的内存,其实也可以不是,你完全可以自己实现一个映射 方法代替这里的一一映射,linux内核默认的什么highmem,vmalloc_start等等还有一一映射抑或高端映射等等机制都只是一个更底层的 机制一些策略,这个更底层的机制就是linux内核的内存映射,因此在这个机制提出的约束上你可以实现很多种策略,区分物理一一映射和高端映射只是其中之 一罢了:
void __init mem_init(void)
{

kclist_add(&kcore_mem, __va(0), max_low_pfn << PAGE_SHIFT); kclist_add(&kcore_vmalloc, (void *)VMALLOC_START, VMALLOC_END-VMALLOC_START); … } void kclist_add(struct kcore_list *new, void *addr, size_t size) { new->addr = (unsigned long)addr;
new->size = size;
write_lock(&kclist_lock);
new->next = kclist;
kclist = new;
write_unlock(&kclist_lock);
}
得到kcore文件的大小,其实这个文件并不是真的占据那么大的空间,而是内核提供的“抽象”实体的意义上的大小就是那么大,这里就是整个内存映像:
static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
{
size_t try, size;
struct kcore_list *m;
*nphdr = 1;
size = 0;
for (m=kclist; m; m=m->next) { //找到最大的地址值加上长度后就是最后的结果,依据就是linux内核空间的映射方式
try = kc_vaddr_to_offset((size_t)m->addr + m->size);
if (try > size)
size = try;
*nphdr = *nphdr + 1;
}
*elf_buflen = sizeof(struct elfhdr) +
…//elf_buflen是额外的一个头部的长度
*elf_buflen = PAGE_ALIGN(*elf_buflen);
return size + *elf_buflen; //总的长度就是实际内存大小长度加上额外的头部的长度
}
procfs 是一个文件系统,是文件系统的话就要有一个file_operations结构体来实现这个文件系统的操作,可是在procfs文件系统中,每一个文件可 以有不同的操作回调函数,也就是说,procfs首先是一个文件系统,在它是文件系统的意义的基础之上,它又是另一种机制,它提供了一个内核导出信息的口 子,就是说,procfs作为文件系统的意义仅仅在于信息的导出,它里面的文件从来都不是真实的文件,但是确实有文件的接口,比如你在ls -l命令发出给/proc/kcore文件时,它给出了文件的“大小”,实际上并不会占据那么大的空间而仅仅是一个数字,该数字是从上面的 get_kcore_size中得到的。在procfs文件系统中,每个文件都是一个proc_dir_entry,这才是它真正要表达的,套在标准文件 系统之上的那一层东西,该结构中的proc_fops就是该结构代表文件的file_operations结构体,如果这么理解的话,procfs文件系 统下的每一个文件都可以有自己的file_operations了而不必统一用整个procfs的一个file_operations,就像 ext2/ext3等传统的真实文件系统一样,从OO的角度来看,procfs继承了vfs文件系统,在文件系统的基础上实现了自己的特性(其实每一个具 体文件系统都有自己的特性,都继承并实现了vfs这个抽象类,不过本文就是说procfs的一个文件的,因此它显得比较特殊)。就好像前几篇文章中所描述 的seqfile一样,它就是专门为procfs提供一个串行化读取的接口函数机制而不是一个独立的机制,它可以被用在procfs的 file_operations中,当然也可以被用到别处。我们接下来看看read_kcore,它就是/proc/kcore这个proc文件的 proc_fops即file_operations的read回调函数:
static ssize_t read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
{
ssize_t acc = 0;
size_t size, tsz;
size_t elf_buflen;
int nphdr;
unsigned long start;
read_lock(&kclist_lock);
proc_root_kcore->size = size = get_kcore_size(&nphdr, ⪙f_buflen);
if (buflen == 0 || *fpos >= size) {
read_unlock(&kclist_lock);
return 0;
}
if (buflen > size - *fpos)
buflen = size - *fpos;
…//为读出的内容添加elf头部。
start = kc_offset_to_vaddr(*fpos - elf_buflen); //物理地址到虚拟地址的转换,其实对于一一映射就是加上一个PAGE_OFFSET偏移量,这也是默认情况,当然也可以提供别的转换方式。
if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
tsz = buflen;
while (buflen) {
struct kcore_list *m;
read_lock(&kclist_lock);
for (m=kclist; m; m=m->next) { //寻找这个地址所属的kcore_list
if (start >= m->addr && start < (m->addr+m->size))
break;
}
read_unlock(&kclist_lock);
…//没有找到的错误处理
} else if ((start >= VMALLOC_START) && (start < VMALLOC_END)) { //在这种情况下,说明用户要读取的是vmalloc空间的内存映像,那么很简单,就是遍历vmalloc空间的vm_struct结构体们,然后将之上 的数据取出来。 char * elf_buf; struct vm_struct *m; unsigned long curstart = start; unsigned long cursize = tsz; elf_buf = kmalloc(tsz, GFP_KERNEL); if (!elf_buf) return -ENOMEM; memset(elf_buf, 0, tsz); read_lock(&vmlist_lock); for (m=vmlist; m && cursize; m=m->next) {
unsigned long vmstart;
unsigned long vmsize;
unsigned long msize = m->size - PAGE_SIZE;
…//限制判断
vmstart = (curstart < (unsigned long)m->addr ?
(unsigned long)m->addr : curstart);
if (((unsigned long)m->addr + msize) > (curstart + cursize))
vmsize = curstart + cursize - vmstart;
else
vmsize = (unsigned long)m->addr + msize - vmstart;
…//更新数据
memcpy(elf_buf + (vmstart - start), (char *)vmstart, vmsize);
}
read_unlock(&vmlist_lock);
if (copy_to_user(buffer, elf_buf, tsz)) //向用户拷贝内存数据

kfree(elf_buf);
} else { //最后一种情况就是读取物理内存了,其实也不一定,要看体系结构了,在x86上而且内核编译flatmem的情形下,这就是读取物理内存。
if (kern_addr_valid(start)) {
unsigned long n;
n = copy_to_user(buffer, (char *)start, tsz);
…//错误处理
}
…//更新偏移以及指针数据
}
return acc;
}
read 函数完毕之后,整个内存就被读出来了,存到一个地方保存那么这就是当时的内存运行快照,这里不得不说的是,这个信息可以用于调试,但是对于module的 调试就不是那么简单了,虽然kcore文件可以dump出整个内存,但是对于调试来说,这些信息是不够的,我们通过这些信息只能得到它当前是什么,而不能 得到它应该是什么,要想得到它应该是什么就必须有了原始的副本,幸运的是,linux的物理内存一一映射使得这个问题简化,linux内核vmlinuz 或者用于调试的vmlinux本身就是一个elf文件,-g选项编译的内核还有很多调试信息,elf连接脚本上写了符号加载的位置,以及elf的code 节,data节等等elf的要素,一一映射使得内核连接脚本的编写很简单,而且使得该脚本连接得到的内核载入内核时很容易的映射到了很简单的虚拟内存位 置,就是一个地址加上偏移。但是简单也就到此为止了,试想一下可加载的内核模块(LKM),在sys_init_module系统调用实现函数里发现模块 都是被映射到了vmalloc动态内存空间,包括它的代码,数据等等,如此一来,module的elf文件中写的节的载入地址在linux内核映射策略面 前成了一堆废物,即使你用module的原始副本来调试从/proc/kcore导出的映像也会发现很多的调试信息对不上,因此如何调试模块也就成了一个 大问题,linux的内核开发者也在着手解决这个问题…

[Algorithm] NP/NPC/NP-Hard

我们上课用的ppt 来自 Waterloo U. 作业来自 Berkeley。 非常难,但也很有趣。

既然 ppt 来自滑铁卢大学,那一定带有吹加拿大人的部分,比如这个提出21 个 NPC 问题的 Karp。

关于NP & NPC & NP-Hard 问题。

可以用用一个数轴来解决。

虽然不是很精确,但来说没问题。

现在来讲解一下npc。

在 MIT 的公开课 6.046 上 用了这样一个思维导图。

$(x_{1}\wedge x_{2}\wedge \not x_{3})v(x_{4}\wedge x_{5}\wedge x_{6})$

Open HPC

OpenHPC是一个Linux Foundation合作项目,其任务是集成以HPC为中心的组件,以提供功能全面的参考HPC软件堆栈。

Image
From Twitter(HPC Now!)
Image
openHPC

操作系统:Pintos Project3 详解

前情提要

老样子,代码开源在 http://victoryang00.xyz:5012/victoryang/pintos-team-20 .

这次的实现非常难,如果是课程要拿学分的话,一定要提早开始设计,一个良好的设计思维导图,在实现的时候非常有用,大概需要的是什么时候同步锁,三个 table 之间的关系是什么?在debug的时候一定要围绕着这张思维导图来,这样至少不会全不过。

我画的思维导图。

总之,本次作业要实现的功能是 disk-backed virtual memory. 进入本次的文件可以发现vm 文件夹里什么都没有,那就需要自己创建。我建了这些文件。

需要在 Makefile 里添加:

以及在 makefile.kernel 里面加入你新加的文件。

开始的悲惨境地

一开始 pintos 几乎没有任何对虚存的支持。 有的只是 process 地址和空间分离和userprog 的load。再看一看工具:

1.为了load & save 虚存分页的 swap partition

2.bitmap, 一种有点像两位哈希的数据结构,用于寻找在 swap partition 里面是否有剩余空间。

3. hashtable, 由于实现 page-frame frame-memoryLocation 之间的 O(1) 查找,极大优化性能。(此步可用linked list 代替,只不过查找变成 O(n) )

4.已经实现好的 lock acquire & lock release, 也就是说你不用管 syncronization ,只需要当黑盒api 调用就行。

5.已经全pass 的proj2. (syscall)

6. debug 工具。pintos-gdb,网上有一个更好用的 Docker-for-pintos,不过好像只支持macOS。这次大概的 debug 模式就是,先运行,看报错,如果找不到报错,就一步一步 gdb 来,看看输出某一个输出的时候会先调用什么,确定位置,一般是在 kernel mode 下锁不对的问题。然后确定程序在这个特定的位置应该运行什么,改一下先后顺序。一般调通一个就能过一片。如果你在 debug 用了过多时间,不如重新来一遍思维导图。

P.S. 这些一定要好好看懂。

心态解析&需要做的东西

为 load program binaries 实现 demand-paging,(纯 demand-paging)

为 stack pages 实现 demand-paging, (需要先分配一个frame 给第一个 stack page, 也就是上一个的特例)(写完这个就能拿到除24个testcase以外的所有分数)

stack

和NUS大牛喝咖啡 | a coffee with big giant of NUS

greatness of big giant

I would say that none of a boy in the great university with a decent research background will be a median people. For a people who get NRF in singapore after just getting his Ph.D degree (equivalent to the QianRen Proposal in China), he's just great. Talking a little bit of his CV.

He graduated from IIT Mumbai and still has great research relation with IIT Kunpur. Both of the cities I have been to. I would say most of the people in IIT is super intelligent, but for lack of money, they tend to focus on the theory and mathematical proof. No exception of him and the guy I met in CRVF-2019.

What's his strength? I think he thinks things really fast and directly to the essence. For the SAT-solver part, with the pysuedo-code, he could quickly come up with the useful testcase to test the usability. I think great man should be equipped with insight. He obviously has it. For the EEsolver, a new and fast solver to find negative false bugs in programs. He insists to test the uniformity of the benchmarks. Proof is not just the benchmarks. To find the algorithm inside, we should

ctf全国邀请赛-线下

又是一个躺的比赛,本来就i想着睡觉算了,反正都能那三等奖和那1000块。不过还是花了一点力气,最后第19名。除了清北没来,其他复旦白泽和交大王伊俊组的23333sj都来了,也算打出了一点点小名气。(去年前年因为q7,我们都是第一左右)

行,拿到题目总共一道pwn,三道web。环境开源在http://victoryang00.xyz:5012/victoryang/My_CTF_respo

The current state of Edge Computing

I'm always have an eye on what's edge computing's going on because I'm a fan in IoT. Honestly, I start doing my CS major with IoT projects though it was very dump.(listed is not my first dump project hhhhh)

I prepared to do sth in SHIFT, also congratulations to Prof. Yang's recent publication:
Multi-tier computing networks for intelligent IoT. But the S3L first respond to me, So I'm a security guy right now.

I've been upon thinking the idea for a while, the idea of the problem is very similar to the state of art key problems.

1. Computing power: data processing equipment is no longer a rack server, how to ensure that the performance meets the requirements

2. Power consumption: power consumption cannot be as large as the level that ordinary civil power is difficult to accept, and power consumption also means that the heat is large

3. Stability: deployment outside causes the difficulty of field maintenance to increase dramatically The improvement of stability also means the reduction of maintenance cost, which also includes the harsh environment on the user side, such as high temperature, humidity, corrosive gas, etc.

4. Cost: only the cost can cover the demand, can we deploy and meet the customer demand as much as possible, if the cost is not comparable to the network + data center, it is meaningless

Moore's law has met with a bottleneck. It is more and more difficult to make the best of both general and specific optimizations. At this time, the hardware coprocessor which integrates common AI algorithms directly in edge computing becomes the key to obtain high performance and low power consumption. A key threshold for power consumption is 6W TDP. Generally, in the design, the power consumption of the chip is less than 6W, and the fan can not be used with the heat sink. The absence of fans not only means the reduction of noise, but also means that the stability and maintainability are not affected by fan damage. In the front-end chip of edge computing class, horizon based on its self-developed computer architecture BPU has found a new balance point in various requirements. The equivalent calculation power of 4 tops provided by it has reached the calculation power of the top GPU two years ago, while the typical power consumption is only 2W, which means that not only fans are not needed, but also the whole machine can be installed in the metal case to avoid dust and corrosion caused by redundant holes.

When it comes to computing power, there is a big misunderstanding in the current industry, which often takes the peak computing power as the main index to measure the AI chip. But what we really need is the effective computing power and the algorithm performance of its output. This needs to be measured from four dimensions: the peak computing power per watt and the peak computing power per dollar (determined by chip architecture, front and rear end design and chip technology), the effective utilization rate of peak computing power (determined by algorithm and chip architecture), and the ratio of effective computing power to AI performance (mainly in terms of speed and precision, determined by algorithm). RESNET was widely used in the industry before, but today we use a smaller model with more sophisticated design like mobilenet, which can achieve the same accuracy and speed with 1 / 10 of the calculation force. However, these ingenious design algorithms bring huge challenges to the computing architecture, which often make the effective utilization rate of the traditional design of the computing architecture greatly reduced, and from the perspective of the final AI performance, even more than worth the loss. The biggest feature of horizon is to predict the development trend of key algorithms in important application scenarios, and to integrate its computing features into the design of computing architecture prospectively, so that the AI processor can still adapt to the latest mainstream algorithm after one or two years of research and development. Therefore, compared with other typical AI processors, horizon's AI processor, along with the evolution trend of the algorithm, has always been able to maintain a fairly high effective utilization rate, so as to truly benefit from the advantages brought by algorithm innovation. Horizon also optimizes the compiler's instruction sequence. After optimization, the peak effective rights are increased by 85%. This makes the processing speed of the chip increased by 2.5 times, or the power consumption reduced to 40% when processing the same number of tasks. Another feature of horizon BPU is that it can be better integrated with sensors in the field. Video often requires huge bandwidth. 1080p @ 30fps video has a bandwidth of 1.5gbit/s from camera to chip. And horizon BPU can complete the video input, field target detection, tracking and recognition at the same time, so that all necessary work can be completed on site. Both the journey series applied to intelligent driving and the sunrise series applied to the intelligent Internet of things can easily cope with the huge bandwidth and processing capacity of the scene. More importantly, the common AI calculation can be completed in 30ms. It makes the applications that are extremely sensitive to time delay become reality gradually, such as automatic driving, recognition of lane lines, pedestrians, vehicles, obstacles and so on. If the time delay is too large or unpredictable, it will cause accidents.

However, by using sunrise BPU, AI calculation can be completed within predictable time delay, which can make the development of automatic driving more convenient. The application of edge computing has been limited by the performance of computing, the strict limitation of sensors and power consumption since it was put forward, and the development of edge computing is slow. And horizon BPU series chips, seeking a new balance in function and performance, can also effectively help edge computing applications to be more easily deployed to the site, so that all kinds of Internet of things applications can more effectively serve everyone.

credit: https://www.zhihu.com/question/274787680