近纲写的几篇文章

以士绅阶层变动论中美贸易战是否是“清军入关”
On whether the trade war between China and the U.S. is "the Qing Dynasty's entry into China" based on the change of the scholar and gentry class

摘要：在历史学研究中，拓扑思维和历史周期率是恒久不变的两种手段，不同时间的不同无非是局部的不同，而在简单的比对中总是有很多谬误，就比如最近甚嚣尘上的“入关学”，即当今的中国是皇太极领导下的大清而美国是当时的大明。本质上是一种否认两者统治思路本质和两者目的的区别的片面推断，这与蓬佩奥所提冷战2.0，将当今中国与上世纪50年代的苏联相提并论如出一撤。我提出一种由士绅阶层变动为导向讨论中华民族统治的本质，从而驳斥“入关学”。

关键词：入关士绅拓扑比对中美格局

士绅阶层广泛存在于中国社会，在古代，中国人没有所谓的家国情怀，而是认同“杨家寨”、“李家帮”所代表的部族，在这个部族中，相对自给自足。为何在历史的任何时代都有如此多的“汉奸”，从经济帐上来看，只是因为在部族中的个体对部族而非家国的效忠。《天朝的崩溃》中提及有人提供粮食给英军，日伪满政府中有大把满人组成的“伪军”，本质上是“琦善卖国说”。从这个意义上来讲，中华民族从来都是部族的个体意志大于集体主义的趋利避害思维。明清以来，“皇权不入乡”与“天高皇帝远”规范了部族的领袖作为承上启下的“传话筒”，百姓只要相信此类人，利益瓜分更加有效率，也能自得其乐。对朝代更替，其实并无想法。
士绅其实指的就是士族和乡绅的结合体。前者是当地的名门望族，后者是通过科举选拔制度留守在当地的读书人，这确保了智囊和财富累积的人才供给，他们的普遍想法决定了中国的思潮走向。这就是精英专政。科举制度、联姻及“轻商”保证了此类人的流动性。这点与林语堂的《吾国与吾民》中后续推演的家国情怀相互佐证。
直至近代，为何孙中山十数次的革命没有成果，却在辛亥革命前后引发了全国性的相应，本质上是保路运动让士绅认清推翻先有政府能让原有政策的既得利益集团的利益能被士绅瓜分。蒋介石下令北伐就能成功，不是因为他是黄埔军校的校长，而是他了解江南财阀的需求，如果他忤逆士绅想抗日的决心，也会被历史所左右。抗日花了十数年，而毛泽东以何一年解放中国成功，他做了推翻先有士绅阶级（土豪），建立了新的，以共产党为领导的工农阶级（无产阶级）所形成的士绅。现在的士绅便是有学识的商人及中国共产党党员。天下以农民起义得政权的人物不多，也就刘邦、朱元璋、李自成、洪秀全及毛润之，如果取得不了士绅的信任，也就数十年光景。
“入关”，在柯娇燕的语境中是指在中央对边境控制的过程中仅扮演一个财富提供商，由江浙的财阀、当地的士绅、女真等少数名族在边境形成转口贸易市场，本质上是中央撒币来维护周边少数民族的稳定。可中央无钱时，养好的肥肉都纷纷起义，这就是皇太极。这不是一种殖民，而是在维系统治的一种手段，而人靠利益联结在了一起。可蛮夷还是蛮夷。
“入关”本身是一个正常的市场行为，如果中央财政拨不出钱就由我来管理。再重建政治制度并建立中央银行重新分配民族之间的关系。满人确实利用了五族之间的克制关系建立了一个稳定的政权，同时边境问题也基本被解决。五胡乱华之后，中华民族的思维是如果你认同中华民族的文化就是中国人，而不是靠血统，这无疑让满人汉化，虽然在传统儒家的观念里，这是夷狄乱了华夏，可也是一种反相同化的过程。
通过拓扑思维来链接清军与当今之中国，明朝与美国，把中国“取代”美国成为世界上最强大的国家这一即将发生的历史事件，表面似是相同，但本质完全不同。
表面上，美国是当今世界文明中心和全球事务主导者，这种支配地位正式当年的大明所拥有的。从文明来看，大明对周边实行的是王道，同为世界中心，郑和对当时认知下的全世界的访问，带去的是和平与繁荣，明成祖希望的是来自“天朝上国”的恩泽，一个超级强国对弱国没有倚强凌弱，没有征服与殖民，却行王道，厚往而博来，此之谓文明。而当今的美国确带来的是杀戮与灾难。1989年美国出兵巴拿马，为的是巴拿马运河的经营权，总统直接入狱40年，他只是拒签1977年巴美签订的1999年所有美军从巴拿马撤走的续签条款。除此之外，朝鲜、越南、伊拉克、阿富汗、叙利亚，不知道有多少国家的人民被美国搞得生灵涂炭、颠沛流离。
文明不是一蹴而就的，200年的岛洲历史和盎格鲁撒克逊人的剥削文化并不能让其在强大科技和军力财力的过程中，看到、思考这么多，美国的精英只能从短暂的历史中寻找答案。
同时，中国不是满清，从民族上，当今的中国更像是汉族专政的大明，无论是对边境的事物上，还是国际事物上，所以从历史周期论讲，中华民族伟大复兴更像是“反清复明”。入关前的清是夷狄，非文明社会，是东北亚的渔猎民族，他们的社会形态是松散的部落，政体上和军事上没有像样的制度和架构，就是原始的奴隶制。明朝初期，他们被朝鲜人奴役，明太祖心软，才放他们在外兴安岭繁衍存续。
从本意上来讲，文明解放军在对待不共戴天的日本战俘和内战俘获国军战俘的人道主义精神，着实值得尊敬。在上世纪中叶的援助非洲，非洲兄弟报之以琼瑶——把中华人民共和国投进了联合国。亚投行的初心虽然是去产能、供给侧及后来地缘政治的考量，却也验证了文明的“你好我好大家好”的思维，与郑和下西洋如出一撤。反观朝鲜战场美军如何面对战俘？水刑。罄竹难书、无所不用其极。
中国的士绅阶级经过了五四的启蒙，文革的苦难，《河殇》的转折，至今，一个强大统一而又全面崛起的中国，正在面对新的“华夷之辨”。「夷狄之人貪而好利，被髮左衽，人面獸心。」，将中国的士绅阶级比做夷狄，着实是华夷错位。入关以后，扬州被屠城、湖广填四川；和平演变之后，南斯拉夫再也无往昔之荣光。士绅阶级在
美国对中国在经济和话语权上的打压来看，很有可能是先把你打趴下再继续贸易，和边民想收到中央的恩惠一样，是一种强盗思维。
从中美贸易战的博弈过程中来看，美国似是国立大势已去的清军，明成祖穿越于努尔哈赤的时代，把清军弄个底朝天。如若将中国和美国和明清的关系易位，是否能满足拓扑比较理论呢？其实还是不全面的，我认为美国做的还不如清军。
回到士绅阶级的论述，清军是如何占领士绅的呢？清军入关时写道：“俺汉人，百敌一，都是有剩；为什么，寡胜众，反易天常？只缘我，不晓得，种族主义；为他人，杀同胞，丧尽天良。”表面上是中国人自己人打自己人，或者映射到现在就是以公知为首的人提倡全盘西化，又有一帮人在走自己的中国特色社会主义道路。在入关以前，皇太极已经取得了边民少数民族的士绅阶级，从而通过利益绑定渗透到边民士绅和江南财阀。彼时新“天下”概念，汉人已经占其治下的多数。通过“剧场国家”理论，皇太极是蒙古大汗，清朝皇帝以及满族首领。中华士绅认同的只是他清朝皇帝的概念，同时又对蒙古人、旗人有所忌惮。已然形成了一个正统王朝的必然条件，通过控制士绅来控制汉人王国。
而美国想在中美贸易战中对中国进行制裁，居然打出了“反共不反中”的口号，他的言下之意是士绅阶级与中国是割裂开的。美国的执政者以蛮夷之姿，全连中国的统治本质都不了解，甚至连皇太极都不如。
至于历史周期论，还是有很多门道的，历史上任何掌权的人都是有头有脑的人，虽然有很大局限性，譬如川普政府之于空一格公，再如我认为当今的中国在重演明的历史。从蓬佩奥对新冷战的定义我感觉就是他们能从自己的历史中找到的比对样本太少，绝非像在中国历史乃至世界历史这个图灵完备的课本中。诚然，现代有更好的统计工具去做各个指标的对应，可中国仍由士绅阶级所决定。明清的历史沙盘纵使推演上千遍，也不及抓住这个本质重要，这也是让如今的士绅看清国际局势，向历史有所借鉴的最好座右铭。

中国内外交困近代史统治者心态变化的缩影
——荣国府的流水账
The Dust of Chinese Modern History

摘要：历史是一个任人打扮的小姑娘，无论现代人如何对历史总结及汇总，总是少不了谬误，以及对谬误的翻案。我不是一个考据的学者，如绝大多数人一样，以史为鉴，唯取一瓢饮。既然中国近代史是中国统治集团内部危机和外部势力的实力博弈过程，明清又有着不同的“超我”、“本我”以及“自我”，我想以一个容国府的视角，窥探明清直至明国的统治者心态及对外部敌对势力的心态变化。

关键词：内外交困统治者心态变化荣国府

在贾府的决策者中，一直存在这三种思想，对应弗洛伊德所提出的“超我”、“本我”以及“自我”。“超我”便是贾夫人死后，贾府中再没有在朝廷中有与之对应荣华富贵的爵位，贾母如何延续荣国府的往日荣光，如何继续给予如刘姥姥等下层人高高在上的目光。活在牢笼中的人是可怜的，容易被无限的财富蒙蔽了双眼，殊不知“白茫茫一片真干净”已在明天，半奴半主王熙凤用往日的威严压迫家中奴隶。“自我”便是实际管理者王熙凤采取利用月钱放高利贷的手段为贾府的庞大开支续命，可贾雨村、贾政等最多只能混上五品官，并不能罩住这个勾当，贾探春尝试过用改革的方法治理，被王夫人五十九回的“还得请一个人打理”给打破了，后来就没了后文，探春也出嫁了，再后来探春就薨了。
索隐派红楼梦学者，如北大校长蔡元培先生认为《明月宝鉴》是一部政治隐喻史。蔡元爬杖生推断书中多次提到的红代表明朝政权，而曹雪芹批阅十载也是在悼红轩，似乎有悼明之意。书中第一回说到空道人改《石头记》为《情僧录》，古人常言，清风明月，比如雍正年间著名的文字狱，清风不识字，何故乱翻书，虽然这就是本身无关政治，但可以看出当时的文人或许会用一些隐晦的词表达自己对时尚的看法，所以风月有可能影射清明两个政权。书中男子都是满人，女子都是汉人。我国古代哲学常以阴阳二字说明一切对待着事物，男女分别对应养阴，石头记第三十一回，湘云说，天是阳，地就是阴，熊为阳，祠为阴，翠绿道主子为阳，奴才为阴。蔡元培认为，由于本书满人为主，汉人为奴才，那么宝玉所说我见到女子便觉清爽，见到男人就觉浊臭逼人，在隐晦的表达作者的复明之意。
如此红楼梦与清朝的关系有所确立。清朝统治者眼里，似也存在这三种思想，“超我”对应清朝的治理方向与水平。想要以后金来正统上位，获得朝代潮流士大夫的赏识，做的却是污蔑明朝成就的勾当，想要以收复台湾复汉官之威仪，以义和团战斗来复天朝上国之威仪，出来混，总是要还的，“本我”之下台湾的收复真的是郑成功运气好，面对荷兰的船坚炮利，当时的舰队只能说旗鼓相当，而非完全碾压，而是占据了天时地利人和。郑成功后人没能发挥他祖宗的才能，被康熙打败了。早在清中早期的统治者面前，国库还算富庶的情况下，他们就知道了西方的科技，只是没能领略这是跨代的差距，可以说是被短暂的成功蒙蔽了眼睛。清朝要巩固的是五族的统一，实话说在嘉庆年间就有农民起义能共进北京城。清朝的文字狱是最猖獗的，也是统治者背后悬梁锥刺股的寒意。
1 贾府的选择及“剧场国家”理论
几乎所有贾府的管理者在府上都是千人千面的。贾母虽然是家中的最后敲板人，可在王熙凤面前就会露出少有的怜爱，在刘姥姥二进大观园的档口，没有体现出高高在上的姿态，虽然认知水平和底层有冲突，可还是喜迎喜送。贾母曾经同样爱惜过贾探春以及李纨等人，她们在凤姐面前还确实是不值一提。凤姐在贾府的丫鬟，其他管理者面前还是体现出一种威严的感觉，同时剥削让她们活得无法喘气，似有点今日的日本“社畜”。王熙凤是在长时间掌握贾府月钱分发实权之后才敢放贷给缺钱的人，对外人的求情，转化为可以量化成钱的交易，徇私枉法。高价购买伪劣产品或者将贪污的钱入在流年账上，分别填补开销过去。好似当今的公司股价一天天的下跌，可是懂的人早就最高价甩卖走人了。如果有什么能占便宜的活，马上截取下来，成为新的谋钱手段。
清朝人不靠皇权而统治汉人，“剧场国家”理论认为满清的皇帝在不同种族人面前都在扮演其对应民族的最高统帅。我们典型的观念面，或者说汉人的史书上，清代的崛起，是北方游牧民族政府南方农耕民族的过程，其本质和五胡乱华、蒙古建立元朝不无差别，这是夷狄乱了华夏。而这本质上也是不对的。从满洲人的角度上看，我从边疆朝贡贸易不能赚到中央财政拨过来的钱，那就和关内的人合伙打劫，一起骗中央财政，可是中央再也拨不出钱的时候，等到边民没有那么有能力的时候就趁火打劫。夺取政权以后，清朝皇帝也是千人千面。他们似在关外就想清楚了这个问题。无论国家、家族抑或公司都需要一个想象共同体来支撑，可以是共同的梦想，可以设计假想敌，也可以对不同人采取不同的洗脑策略。美国当今干的勾当是竖切人群，拔起分成各类少数族裔，从而让每个人相信对政府的维权可以持续，而每个人都成不了气候。
“剧场国家”是顺延明朝对中国的统治的基础上完成对满人、藏人、新疆少数族裔等的统一管理。在更远的边疆，比如当今的西伯利亚贝加尔湖，管理没有那么强有力，但汉人及满人的精神还是在的，只是被沙俄帝国主义的流放犯人的无耻打败了。在边疆以外，清朝采取了与明朝相似的边境贸易策略从而维护松散的繁荣假象。在观看了同好游历海参崴、西伯利亚、蒙古国等地，我感觉清朝显然对其没有有效的控制，连教育都没能在时代的长河中对当地人产生些许的影响。蒙古国的思维完全是沙俄思维，美国的影响也有。在新疆等地，他们的民族向心里没有那么深，在游历土耳其后，我发觉突厥语的典籍和汉文化的完全不一致，突厥人的后裔可以说是表面兄弟他们由他们完整的一套话语体系。我认为剧场国家在对边疆的控制尽如此，更多的是对汉人的威慑，其他三族的管控可以说几乎没有。
贾府在后续逐渐的荒凉我认为并不全是人的慢慢离去，而是，自上而下大家觉得维护这样一个贾府还很昌盛的“剧场”太无聊了，早就失去实权的府上，再能敛财也不过是公家的钱，再能在最后撑足面子做一个贾府的寿宴。慈溪能在最后还想着给自己庆生。“剧场国家”的维护过程与面子无异。
2 商品经济的涌入与早期的资本阶级萌芽
王熙凤的高利贷的存在对应晚清兴盛的钱庄。可问题是这些钱庄做不了现代银行薅羊毛的手段：印钞票。钱庄是不是一个资本主义的集散地呢？是的。可是钱庄的敛财能力还不够，或者说当时的商品经济的发达水平还不足以孕育一个合格的资本主义市场，从而更好的投机。王熙凤的行为，更像是初出茅庐的投机者，借父母的银子玩一玩。商品经济的萌芽实际上在晚明早清的时候就有，红楼梦中的一种碗具就是进口自英吉利。其实当时的货物交换就已然有了，不过只在顶级富豪家才有。
我暑假实习的公司的竞争对手一家股票投资商人Optiver，他们在做宣传的时候就说他们是荷兰大航海时代对他们船上的货物的担保，可见商业文明的发达。在晚清各大银行搜刮了大清的铸币税，随意的在上海发钱投机，说实话就是清晚期受到资本的降维打击。同时也对当时的钱庄有所影响，可以从各大银行相互借钱还钱。说到晚清商人的投机，许多人想到的就是罪恶的资本主义世界里贪婪的商人形象，最容易与经济泡沫、金融海啸联系起来，直观上感觉投机是百害而无一利的。许多人认为投机就是参与一场赌博或者骗局，少数人暴富而多数人破产，所有投机者都试图通过不劳而获暴富，而损失让别人承担，因此堪称是资本主义罪恶性的集中体现。
投机的目的是为了获取超额收益，也就是超过市场整体平均的收益。完美的市场是达到均衡状态的市场，所有卖方都能将出售的商品以均衡价格卖出，而所有买方都能将需求的商品以均衡价格买入，价格完美地由供求决定。在这样的市场上投机收益的期望就是市场平均的收益，这样的市场被称为有效市场（Efficient Market）。有效市场是市场的最佳形态，买卖双方的利益均达到了最大化，资源配置达到了帕累托最优。上世纪二十年代的上海就是这样一个投机的地方。
明清两代的统治者对金融的管控自古以来就很强，北京对江浙财阀的限制永远是从开国时的紧到慢慢松的过程。这种东西就是靠脑子就能玩的很强的，只是到一定程度的投机也会如凤姐一样被一锅端。这种完全靠吸百姓血而生的勾当势必会被当局所处置。
3 五胡乱华与满洲人的关系
明代对抗外族侵袭的方法就是边境贸易，让利一份给边民从而换取和平。中央对边境控制的过程中仅扮演一个财富提供商，由江浙的财阀、当地的士绅、女真等少数名族在边境形成转口贸易市场，本质上是中央撒币来维护周边少数民族的稳定。可中央无钱时，养好的肥肉都纷纷起义，这就是皇太极。这不是一种殖民，而是在维系统治的一种手段，而人靠利益联结在了一起。可蛮夷还是蛮夷。大明在面对两者的联合是没办法翻身的。
清朝治下的旗人和满洲人是完全不一样的概念。旗人指的是跟随大清入关的所有人。其中既有满洲八旗，也有汉军八旗，就是汉人，还有蒙古八旗。“旗人”在政治、经济、社会各方面享有诸多特权。比如，旗人世代不必从事劳动生产，其生活来源全部由国家承担。旗人比民人享有更多的机会做官。旗人只是对应特权阶级，我们可以很清楚的看到曹家是满人也是汉军正白旗人。曹家在文化上变成了满人，汉化的满人。《东华录》中有载，雍正十八年六月，逝祖遗诏下令满人见习，汉族，又有康熙十五年十月议政王大臣等上奏，八旗子弟由于武备危机，事后请将旗下子弟考试生源举人进士暂令停止。也就是说亲弟为了笼络汉人巩固统治，已经下令清廷贵族学习汉族文化。只是到了辛亥革命前后，革命党为了强调民族区别，才把“旗人”和“满洲人”这两个概念混同起来。
近代史中，统治者从来都是想着如何继承前朝的正统性，从来没有崖山之后无中华或者崇祯被入关之后无中华的臆想。统治者从来不会不承认元清两代，至少不会放弃他们所代表的疆域，没有实力的时候至少先买个关子待定，等被对方完全控制的时候再承认也不迟。这种中庸思想在非洲抑或北美都是无法想象的。
4 荣国府的边界以及近代史的边界
荣国府的边界说是荣国府的疆域范围，包括大观园等。荣国府的势力范围并不仅仅局限于此。可以说，荣国府的边界，就是其利用其权力范围可以做到的自救所产生的边界。它的财富染指当地的平民，他的权力被隔壁宁国府的王夫人、邢夫人觊觎。但从荣国府的人来说，我觉得每个人都被压得死死的，无法传奇。每个人形形色色的如出水马桶，能够博眼球的也就无私或者手段阴险，而两者的动态平衡从来都是刀光剑影，没有人能改变现状。从大明郑和的航海时代，中国生产的财富以及白银，以一个朝贡贸易的形式对外输出，然而这一切三世而终。中国只有走出去才能避免内卷，近代史就是缺失在走出去。内部的人不断内卷，从而消磨光了国库，清朝的海禁从罗马教廷礼仪之争开始，只有一口通商。近代史的边界就是大清的疆域。
荣国府在对底层老百姓有公司对下属的威信但没有到士绅对乡民的威信，最大的原因是他们的利益没有挂钩。荣国府与手下及外部的百姓的收入是相反的。可以说荣国府代表队形象就是大清代言人，可是他们用惯了奴婢文化，使惯了尊卑思想，让下层人民群起反抗。
大清的读书人被文字狱给害了，本可以从三纲五常的戒律中退出来的，可是却一步一步的陷入进去。明代的科学之繁盛，常微分方程的解法比牛顿莱布尼茨的解法早了百余年就可怀疑是否是明代徐光启把中国的科学通过马可波罗带回到欧洲的；明晚期的朱明理学，船山学派在清朝都没了传承之人，取而代之的是桐城派这种为自己谋利益的地方学派。这些被清帝毁于一旦。可以说在文化上，清帝只是一个合格的另一个北魏孝文帝，而非使张骞出史开拓贸易的明君。
半封建半殖民地的中国大地，受到最多恩惠的是留美留日的一群人。这是中国所能冲出黎明的一群人。同时期的还有下南洋的华人，香港、澳门受到见识的人。他们拓展了中国人的智识边界。何奈中国人赔给日本的钱早就让他们获得了这份智识。
5 如果“超我”能实现
如何自救？1840年开始的中国史的疆域，从对口岸的掌握，到对沿海的全部缺失，白银外流，国库空虚，走上了渐衰的国运，内部官员相互竞争剩余的价值。直至义和团的“白茫茫一片真干净”。由于乾隆，嘉庆皇帝的盲目以天朝自居，对外开放贸易限制逐渐变大，导致与世界隔绝，错失发展的良机。如果乾隆皇帝答应对外贸易，开放口岸，清朝会有多少好处，每年可以给清朝带来多少税收与国库收入。如果没有闭关锁国，在正常的情况下可以一分为二的看：一就是成功的进入帝国主义，就像日本一样，发展自身的军事实力，不惧怕列强的侵扰；二就是外因过于强大导致清王朝内部崩溃。可问题是，持这种观点的在儒家文化圈在少数，以致整个国家在面对日本的甲午战争的时候只有李鸿章这个部门能迎击。
而我认为荣国府的颓势是无法通过开放来解决的，就算已经实现了，贾母死了，王熙凤也只能与其他人内耗。这是一个无解的超我盘，大概只有与民同被割韭菜吧，毕竟更强大的权贵能在很多时候高人一等。
6 参考文献
《荣国府的经济账》
《尼加拉：十九世纪巴厘剧场国家》
《红楼梦与中国文化论稿》(胡文彬)
Wikipedia. Chinese Modern History. Wikipedia.org 2020

November 23, 2020February 7, 2022

一个概率论Bound问题

昨晚和以前实习的同学讨论一个上界的问题，如果在未来博士的过程中也能有这样的氛围就好了。

主要就是一道概率论题

已知\(\begin{array}{l}A \sim B i n o m(n, p) \ B \sim B i n o m(\frac{A(A-1)} 2, q)\end{array}\)，求H（B），即B的entropy。

这里的难点是如何求二项随机分布的二项分布。直观上感觉后者的熵值是前者的 \(log(log())\) 这种。可对A的展开太过繁琐。敲在mathmetica当中可以是 \(P(B=i) = Sum[P(B=i|A=j)*P(A=j),{j,0,n}]\)，暂时我只想到这种解法。

这个问题我找了找网络前两节课上的信息论推荐的书，上面有类似对于二项式分布的相关性质，可是唯一提到的也就是在 \(p\) 上做文章。fix n，H（A）的 max 在 p 取 \(\frac12\) 时取到。然而没啥卵用。

概率论与图论背后的算法

算这个事为了做一个算法去recover这个
ER random graph，given每次只能query graph的一小部分里面有没有edge的存在。

这个 random graph 很有名，很多概率图都是基于此。也是 TCS 求 lower bound 的一种方式，很多人梦寐以求的方向。

October 25, 2020February 8, 2022

[Parallel Computig] Cuda intrinsics plus

Cuda intrinsics

基本线程模型

Reference

1.CUDA学习笔记(5) 原子操作

October 23, 2020February 11, 2022

谈体系结构的进步对网络的影响

最近量子位又发了一篇体系结构的进步，TCAM，所谓的三态内容寻址储存器。可以说，从图灵机的角度来说，上层建筑下的基层还有很多没有解决，从现在那么多Startup 在真正的做业务Oriented 的数据库及网络链路优化，体系结构还有很多可以探索的部分。同时，新的架构真的是否安全，如TPU的数据通路是否有没有被侦测到的部分可以被攻击。多数的攻击来自于软硬结合，汇集了多少工程师的智慧结晶。

交换机的简化结构

这是一个去掉2个要素的冯诺伊曼体系结构图，交换机的OutBound 和Input的Throughput是显见的bottleneck，除此之外还有延时，这就需要主存储器性能或者包的传输协议的革新。

三态内容寻址储存器（TCAM）

我在当年写VB的时候记得有个slide的参数，是一个三进制数来表示不动，向上滑和向下滑的参数。而这种0、-1，1三进制在苏联当年的计算设备上有所尝试，可惜最终失败了。0.5或许是更好的一种表示中间态或者亚稳态的编码方式，可以用于模糊匹配，或者Not Set。

CAM本质上是一个数据查找硬件方法，读写数据的速度与RAM相同，查找数据能相对模糊的匹配到数据。

这时ARP 协议从报头或者CRC来验证数据正确性起到了很大的作用，就是不管怎样，数据到了，不管对不对，以最快的速度发出去，等到了再做检验的思路是一样的。（有点像高频交易架构的gateway。

Reference

Constant-time Alteration Ternary CAM with Scalable In-Memory Architecture
三态内容寻址存储器(TCAM)工作原理

October 21, 2020February 11, 2022

[Network] 网络链路层路由算法总结

Routing protocols

Routing information protocols（RIP）
- Algorithm: Distance Vector
Open Shortest Path First (OSPF)
- Algorithm: Link State
BorderGatewayProtocol (BGP)
- Another type of vector Routing: widely used in the AS Protocals
- demo
- Problem with integrate wit intra domain Routing
  - static method ALL unknown IPs send to D.
  - Entry Translation: high cost
  - interior BGP

ARP

之前打ctf 的时候搞过这个协议，主要是在MAC 层能对网卡欺骗，从而把destination 当成自己，就能抓取同路由器下的其他设备的包。

这种需要规避就得ip-mac 绑定。

October 15, 2020February 9, 2022

[Network] 几个Mac层的协议

我校网络老师不太Care 概率模型下的网络分析，只Care 实现。（但是我最近正好在学概率统计，权当一道作业题复习

ALOHA协议

主要思路就是让所有能发的人都发，有错误就随机掷骰子决定发送，如果碰撞，随机范围翻倍再掷骰子。

可问题是，非常容易冲突。

如果我们做一道概率题。

帧时T：发送一个标准长的帧所需的时间
吞吐率S：在一个帧时T内发送成功的平均帧数（0<S<1，S=1时信道利用率100%）
运载负载G：一个帧时T内所有通信站总共发送的帧平均值（包括原发和重发帧）（G≥S，G=S表示无冲突）
P0：一帧发送成功（未发生冲突）的概率，发送成功的分组在已发送分组的总数中所占的比例；公式：S = G*P0

两个标准长的帧才会第二次碰撞，舍冲突危险期为2T，同时设这是的帧平均值为2G一个T内生成k个帧到的概率符泊松分布。

由柏松分布可知：

\(\operatorname{Pr}[\mathrm{k}]=\frac{\mathrm{G}^{\mathrm{k}} \mathrm{e}^{-\mathrm{G}} }{ \mathrm{k} !}\)

\(P(\text{success in 2T})=Pr(0)\times Pr(0)=e^{-2G}\)

带入S= G*P0 得
\(\mathrm{S}=\mathrm{Ge}^{-2 \mathrm{G}}\)

最高信道利用率是18.4%

Time-slotted ALOHA

分隙ALOHA是把时间分成时隙（时间片），时隙的长度对应一帧的传输时间
新帧的产生是随机的，但分隙ALOHA不允许随机发送，凡帧的发送必须在时隙的起点
冲突只发生在时隙的起点，冲突发生时只浪费一个时隙，一旦某个站占用时隙并发送成功，则在该时隙内不会出现冲突

显然刚刚 \(P(\text{success in 2T})=Pr(0)=e^{-G}\), \(S=\mathrm{Ge}^{-2 \mathrm{G}}\)
最高信道利用率是36.8%

CSMA/CD

一种带有冲突检测的载波监听多路访问，可以检测Mac传输的冲突。

主要流程是

以广播发，看看有无其他节点（carrier sense）没有其他包就发包。
检测 carrier detection。如果碰撞再广播碰撞了。然后掷骰子重新发包
15次失败报告timeout

CSMA/CA

WLAN 中实现不了CSMA/CD 主要原因是有hidden和exposed 的情况。一种不太好的解决方法是RTS-CTS。可这比较容易被攻击。

October 9, 2020February 9, 2022

组里的几个邀请

上周、上上上周和上上上上周，我们迎来了组里的讲座，一个是新加坡国立做Software Analysis的Prof Visit。一个是用DTMC来解释音频的项目。The author XIAONING DU is now in Sydney Tech。另一个是在交大读的研究生，NTU 读的两面本和博。 The author Hongxu Chen

我对前者的直观感受就是，可能这个方向的博士不是很难拿吧，但deepstellar当时还是热点，现在有点凉，但证明能干一些事情了，她趁火拿了好几篇adv和benign。我对后者的感受，很强，我到博士不一定有其水平的一半。

RNN to DTMC

就是一坨统计堆出来的可解释性。
进一步抽象

这里用的公式\(\operatorname{Dist}\left(\hat{s}, \hat{s}^{\prime}\right)=\Sigma_{d=1}^{k}\left|I^{d}(\hat{s})-I^{d}\left(\hat{s}^{\prime}\right)\right|\)

最终的结果

最后用统计做了一堆相似性的bound “证明”，就说RNN聚类抽象出来的state 和DTMC 一一对应。想法很新颖，但其实回头想没什么内涵。

MUZZ

这里 Thread-aware 就很厉害。

相当于独自开个领域

可我事后问了他scalable的问题，他的回答：

go channel不行 java lock 有几个坑 lock threading 不能很准 java分析 fuzzing oracle 异常 jlang 方舟编译器。 scala native不靠谱 动态 gc llvm 不会做 jvm 抽象等级 z一致性 印度 chenhao 学术界。llvm uiuc 
爷 工业界 fuzzing ok2。

他还有几篇

坚定了我去美国找个工位的目标，加油～

October 7, 2020February 7, 2022

[Network] Mac/CIDR/Sliding Window

Mac 地址

Mac 地址， IP 地址及ARP 协议。

IPv4 地址

滑动窗口与信道复用

停止等待可靠传输

TCP 滑动窗口

波特率

Reference

bilibili@湖科大教书匠

October 7, 2020February 9, 2022

On database of Go and Kubernetes and Rust

A few days during the long vacation in China, I found some stuff good to play. The intensive was to figure out a fast framework for my ugly blog, which turns out to be non-sense. But the process of figuring out them is so so interesting.

My demands

My need is to have gitment embeddings work so that the trash comments will be filtered. Those can be handwritten by myself, but I don’t currently have time to do systematic research on javascript, only applying the latest wheel is enough for me. Besides, I’m in great need of markdown writing experience, so gatsby, Hexo and Hugo are my choices.

Rust

First, I consult on some rust written blog, with which I provide a great amount of efforts. https://github.com/ramsayleung/blog was a fantastic one. I found it utilize diesel to make mapping by struct in rust like:

table! {
    post (id) {
        id -> Int4,
        title -> Varchar,
        subtitle -> Varchar,
        raw_content -> Text,
        rendered_content -> Text,
        create_time -> Timestamp,
        modify_time -> Timestamp,
        post_type -> Int4,
        hit_time -> Int4,
        published -> Bool,
        slug_url -> Varchar,
        enable_comment -> Bool,
        tag -> Jsonb,
    }
}

table! {
    user (id) {
        id -> Int4,
        username -> Varchar,
        hashed_password -> Varchar,
        create_time -> Timestamp,
        modify_time -> Timestamp,
        email -> Varchar,
        avatar_url -> Nullable<Varchar>,
    }
}

table! {
    visitor_log (id) {
        id -> Int4,
        ip -> Inet,
        access_time -> Timestamp,
        user_id -> Int4,
    }
}

allow_tables_to_appear_in_same_query!(
    post,
    user,
    visitor_log,
);

I consult on the database of our schools'. I thought it was pgdb, which I guess right. In terms of the static website for blogging, the database seems laggy and out of date. I eventually found out that even the database API is written in rust, but the speed of calling prosgredb is not that fast, within 10s ms per outer key consultation. The web parts is pure js, but not wasm. The rust part only account for its request logic dealing with backend, still the js to get data from the database, which is sad.

Rust is still not a frontend ready language, although it claims to be a fast and high throughput language in terms of dealing with data. Although they have https://github.com/SASUKE40/yew-starter for wasm but still javascript, so why not just javascript?

nearly all the data storing in the language that utilize API use mapping

For example JSON

HUGO

Hugo is written in go. At the jump, I have some experience of dealing with time serialized data (LSM) of HPC data using go API. go is really an out-of-box language so you don’t care much about the memory leakage and semaphore stuff for multithreading programs. Because many of the company is utilizing the language, there’s a bunch of resources and society for CRUD and business code, from database to HTTP sever, from JSON to YAML. HUGO is just another part of it. I gain much information from the blog there https://draveness.me/few-words-time-management/.

Gatsby

React implementation, React components required. I’m not so familiar with javascript and only had one project with LEAFERX, a nice guy. I eventually turn back to php using wordpress.

Why Rust is not ready and Go is ready.

Inside the choice of blog, I talked about the Rust right now is porting everything out of its good & safe logic of itself. The scheme of the rust deisey is just dumby. Rust is not ready for high throughput program unless it has better package for native web deployment. Go is ready for it has its own coroutine, c++2a is catch up with it later on. But go is combining the java developer to make it has c++ speed with single lines. Like Drogan/Drongan.

http package of `go`

The net/http language of Go wraps both the HTTP client and server implementations, in order to support better scalability, it introduces the net/http. An interface to the HTTP request, where the caller takes the request as an argument to get a response to the request, and net/http. Handler is mainly used by the HTTP server to respond to client requests.

scheduler of `go`

Signal-Based Preemptor Dispatcher - 1.14 ~ so far

Enabling signal-based true preemption dispatch.
Garbage collection triggers preemption scheduling when the stack is scanned.

Not enough time points have been seized to cover the full range of edge cases.

static void schedule(G *gp) {
schedlock();
if(gp != nil) {
    gp->m = nil;
    uint32 v = runtime·xadd(&runtime·sched.atomic, -1<<mcpuShift);
    if(atomic_mcpu(v) > maxgomaxprocs)
        runtime·throw("negative mcpu in scheduler");
    switch(gp->status){
    case Grunning:
        gp->status = Grunnable;
        gput(gp);
        break;
    case ...:
    }
} else {
    ...
}
gp = nextgandunlock();
gp->status = Grunning;
m->curg = gp;
gp->m = m;
runtime·gogo(&gp->sched, 0);
}

How overlay network is written in `go`.

Overlay networking is not actually a new technology, it is a computer network built on another network, a form of network virtualization technology that has been facilitated by the evolution of cloud virtualization technology in recent years.

In practice, we typically use Virtual Extensible LAN (VxLAN) to set up an Overlay network. In the following diagram, two physical machines can access each other over a three-layer IP network.

Reference

https://draveness.me/whys-the-design-overlay-network/
Kubernetes 源码剖析

September 26, 2020March 25, 2022

[Computer Architecture] Sniper Intro

Info

The code is available at http://victoryang00.xyz:5012/victoryang/sniper_test.

The raw result is

admin@ubuntu_1604:~/sniper/test/lab0$ make
../../run-sniper -c ./config-lab0.cfg -- ./toy-lab0
[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
User program begins
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
User program ends
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 2.83 seconds
[SNIPER] Simulated 0.0M instructions, 0.1M cycles, 0.36 IPC
[SNIPER] Simulation speed 11.9 KIPS (11.9 KIPS / target core - 84229.9ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 3.06 seconds


Optional: Run '../../tools/cpistack.py' in this directory to generate cpi-stack output for this run
Optional: Run '../../tools/mcpat.py' in this directory to generate power output for this run
Optional: Run '../../tools/dumpstats.py' in this directory to view detailed statistics for this run
Optional: Run '../../tools/gen_topology.py' in this directory to view the system topology for this run

The modified code is http://victoryang00.xyz:5012/victoryang/sniper_test/blob/master/common/performance_model/performance_model.cc#L310

if(ins->opcode==4542892)
      fprintf(stderr, "[[email protected], %s, %d] clflush to be run\n",  __func__, __LINE__);

Introduction

The Sniper simulator allows one to perform timing simulations for both multi-program workloads and multi-threaded, shared-memory applications with 10s to 100+ cores. The maintainer is a researcher at NUS, Cambridge, Intel, and Ghent University.

Cache implementation

We have the cfg for the cache. So I consult their dispatch process in the source code.

# Configuration file for the Sniper simulator

# This file is organized into sections defined in [] brackets as in [section].
# Sections may be hierarchical withsub-sections split by the '/' character as
# in [section/sub_section].
#
# values can be "strings" , numbers, or true/false, existing values
# should indicate the type

# This section controls various high-level simulation parameters.
[general]
magic = false # Enable performance simulation straight away (false), or wait for Roi{Begin,End} magic instruction (true)
roi_script = false # Allow ROI to be set by a script, and ignore Roi{Begin,End} magic instructions
inst_mode_init = cache_only
inst_mode_roi = detailed
inst_mode_end = fast_forward
inst_mode_output = true
syntax = intel # Disassembly syntax (intel, att or xed)
issue_memops_at_functional = false # Issue memory operations to the memory hierarchy as they are executed functionally (Pin front-end only)
num_host_cores = 0 # Number of host cores to use (approximately). 0 = autodetect based on available cores and cpu mask. -1 = no limit (oversubscribe)
enable_signals = false
enable_smc_support = false # Support self-modifying code
enable_pinplay = false # Run with a pinball instead of an application (requires a Pin kit with PinPlay support)
enable_syscall_emulation = true # Emulate system calls, cpuid, rdtsc, etc. (disable when replaying Pinballs)
suppress_stdout = false # Suppress the application's output to stdout
suppress_stderr = false # Suppress the application's output to stderr

# Total number of cores in the simulation
total_cores = 64

enable_icache_modeling = false

# This section is used to fine-tune the logging information. The logging may
# be disabled for performance runs or enabled for debugging.
[log]
enabled = false
stack_trace = false
disabled_modules = ""
enabled_modules = ""
mutex_trace = false
pin_codecache_trace = false
circular_log = false

[progress_trace]
enabled = false
interval = 5000
filename = ""

[clock_skew_minimization]
scheme = barrier
report = false

[clock_skew_minimization/barrier]
quantum = 100                         # Synchronize after every quantum (ns)

# This section describes parameters for the core model
[perf_model/core]
frequency = 1        # In GHz
type = oneipc        # Valid models are oneipc, interval, rob
logical_cpus = 1     # Number of SMT threads per core

[perf_model/core/interval_timer]
#dispatch_width = 4
#window_size = 96
issue_contention = true
num_outstanding_loadstores = 8
memory_dependency_granularity = 8 # In bytes
lll_dependency_granularity = 64 # In bytes. Model the MSHR for overlapping misses by adding additional dependencies on long-latency loads using cache-line granularity
lll_cutoff = 30
issue_memops_at_dispatch = false # Issue memory operations to the cache hierarchy at dispatch (true) or at fetch (false)

# This section describes the number of cycles for
# various arithmetic instructions.
[perf_model/core/static_instruction_costs]
add=1
sub=1
mul=3
div=18
fadd=3
fsub=3
fmul=5
fdiv=6
generic=1
jmp=1
string=1
branch=1
dynamic_misc=1
recv=1
sync=0
spawn=0
tlb_miss=0
mem_access=0
delay=0
unknown=0

[perf_model/branch_predictor]
type=one_bit
mispredict_penalty=14 # A guess based on Penryn pipeline depth
size=1024

[perf_model/tlb]
# Penalty of a page walk (in cycles)
penalty = 0
# Page walk is done by separate hardware in parallel to other core activity (true),
# or by the core itself using a serializing instruction (false, e.g. microcode or OS)
penalty_parallel = true

[perf_model/itlb]
size = 0              # Number of I-TLB entries
associativity = 1     # I-TLB associativity

[perf_model/dtlb]
size = 0              # Number of D-TLB entries
associativity = 1     # D-TLB associativity

[perf_model/stlb]
size = 0              # Number of second-level TLB entries
associativity = 1     # S-TLB associativity

[perf_model/l1_icache]
perfect = false
passthrough = false
coherent = true
cache_block_size = 64
cache_size = 32 # in KB
associativity = 4
address_hash = mask
replacement_policy = lru
data_access_time = 3
tags_access_time = 1
perf_model_type = parallel
writeback_time = 0    # Extra time required to write back data to a higher cache level
dvfs_domain = core    # Clock domain: core or global
shared_cores = 1      # Number of cores sharing this cache
next_level_read_bandwidth = 0 # Read bandwidth to next-level cache, in bits/cycle, 0 = infinite
prefetcher = none

[perf_model/l1_dcache]
perfect = false
passthrough = false
cache_block_size = 64
cache_size = 32 # in KB
associativity = 4
address_hash = mask
replacement_policy = lru
data_access_time = 3
tags_access_time = 1
perf_model_type = parallel
writeback_time = 0    # Extra time required to write back data to a higher cache level
dvfs_domain = core    # Clock domain: core or global
shared_cores = 1      # Number of cores sharing this cache
outstanding_misses = 0
next_level_read_bandwidth = 0 # Read bandwidth to next-level cache, in bits/cycle, 0 = infinite
prefetcher = none

[perf_model/l2_cache]
perfect = false
passthrough = false
cache_block_size = 64 # in bytes
cache_size = 512 # in KB
associativity = 8
address_hash = mask
replacement_policy = lru
data_access_time = 9
tags_access_time = 3  # This is just a guess for Penryn
perf_model_type = parallel
writeback_time = 0    # Extra time required to write back data to a higher cache level
dvfs_domain = core    # Clock domain: core or global
shared_cores = 1      # Number of cores sharing this cache
prefetcher = none     # Prefetcher type
next_level_read_bandwidth = 0 # Read bandwidth to next-level cache, in bits/cycle, 0 = infinite

[perf_model/l3_cache]
perfect = false
passthrough = false

[perf_model/l4_cache]
perfect = false
passthrough = false

[perf_model/llc]
evict_buffers = 8

[perf_model/fast_forward]
model = oneipc        # Performance model during fast-forward (none, oneipc)

[perf_model/fast_forward/oneipc]
interval = 100000     # Barrier quantum in fast-forward, in ns
include_memory_latency = false # Increment time by memory latency
include_branch_misprediction = false # Increment time on branch misprediction

[core]
spin_loop_detection = false

[core/light_cache]
num = 0

[core/cheetah]
enabled = false
min_size_bits = 10
max_size_bits_local = 30
max_size_bits_global = 36

[core/hook_periodic_ins]
ins_per_core = 10000  # After how many instructions should each core increment the global HPI counter
ins_global = 1000000  # Aggregate number of instructions between HOOK_PERIODIC_INS callbacks

[caching_protocol]
type = parametric_dram_directory_msi
variant = mesi                            # msi, mesi or mesif

[perf_model/dram_directory]
total_entries = 16384
associativity = 16
max_hw_sharers = 64                       # number of sharers supported in hardware (ignored if directory_type = full_map)
directory_type = full_map                 # Supported (full_map, limited_no_broadcast, limitless)
home_lookup_param = 6                     # Granularity at which the directory is stripped across different cores
directory_cache_access_time = 10          # Tag directory lookup time (in cycles)
locations = dram                          # dram: at each DRAM controller, llc: at master cache locations, interleaved: every N cores (see below)
interleaving = 1                          # N when locations=interleaved

[perf_model/dram_directory/limitless]
software_trap_penalty = 200               # number of cycles added to clock when trapping into software (pulled number from Chaiken papers, which explores 25-150 cycle penalties)

[perf_model/dram]
type = constant                           # DRAM performance model type: "constant" or a "normal" distribution
latency = 100                             # In nanoseconds
per_controller_bandwidth = 5              # In GB/s
num_controllers = -1                      # Total Bandwidth = per_controller_bandwidth * num_controllers
controllers_interleaving = 0              # If num_controllers == -1, place a DRAM controller every N cores
controller_positions = ""
direct_access = false                     # Access DRAM controller directly from last-level cache (only when there is a single LLC)

[perf_model/dram/normal]
standard_deviation = 0                    # The standard deviation, in nanoseconds, of the normal distribution

[perf_model/dram/cache]
enabled = false

[perf_model/dram/queue_model]
enabled = true
type = history_list

[perf_model/nuca]
enabled = false

[perf_model/sync]
reschedule_cost = 0 # In nanoseconds

# This describes the various models used for the different networks on the core
[network]
# Valid Networks :
# 1) magic
# 2) emesh_hop_counter, emesh_hop_by_hop
# 3) bus
memory_model_1 = emesh_hop_counter
system_model = magic
collect_traffic_matrix = false

[network/emesh_hop_counter]
link_bandwidth = 64 # In bits/cycles
hop_latency = 2

[network/emesh_hop_by_hop]
link_bandwidth = 64   # In bits/cycle
hop_latency = 2       # In cycles
concentration = 1     # Number of cores per network stop
dimensions = 2        # Dimensions (1 for line/ring, 2 for 2-D mesh/torus)
wrap_around = false   # Use wrap-around links (false for line/mesh, true for ring/torus)
size = ""             # ":"-separated list of size for each dimension, default = auto

[network/emesh_hop_by_hop/queue_model]
enabled = true
type = history_list
[network/emesh_hop_by_hop/broadcast_tree]
enabled = false

[network/bus]
ignore_local_traffic = true # Do not count traffic between core and directory on the same tile

[network/bus/queue_model]
type=contention

[queue_model/basic]
moving_avg_enabled = true
moving_avg_window_size = 1024
moving_avg_type = arithmetic_mean

[queue_model/history_list]
# Uses the analytical model (if enabled) to calculate delay if cannot be calculated using the history list
max_list_size = 100
analytical_model_enabled = true

[queue_model/windowed_mg1]
window_size = 1000        # In ns. A few times the barrier quantum should be a good choice

[dvfs]
type = simple
transition_latency = 0 # In nanoseconds

[dvfs/simple]
cores_per_socket = 1

[bbv]
sampling = 0 # Defines N to skip X samples with X uniformely distributed between 0..2*N, so on average 1/N samples

[loop_tracer]
#base_address = 0 # Start address in hex (without 0x)
iter_start = 0
iter_count = 36

[osemu]
pthread_replace = false   # Emulate pthread_{mutex|cond|barrier} functions (false: user-space code is simulated, SYS_futex is emulated)
nprocs = 0                # Overwrite emulated get_nprocs() call (default: return simulated number of cores)
clock_replace = true      # Whether to replace gettimeofday() and friends to return simulated time rather than host wall time
time_start = 1337000000   # Simulator startup time ("time zero") for emulated gettimeofday()

[traceinput]
enabled = false
address_randomization = false # Randomize upper address bits on a per-application basis to avoid cache set contention when running multiple copies of the same trace
stop_with_first_app = true    # Simulation ends when first application ends (else: when last application ends)
restart_apps = false          # When stop_with_first_app=false, whether to restart applications until the longest-running app completes for the first time
mirror_output = false
trace_prefix = ""             # Disable trace file prefixes (for trace and response fifos) by default
num_runs = 1                  # Add 1 for warmup, etc

[scheduler]
type = pinned

[scheduler/pinned]
quantum = 1000000         # Scheduler quantum (round-robin for active threads on each core), in nanoseconds
core_mask = 1             # Mask of cores on which threads can be scheduled (default: 1, all cores)
interleaving = 1          # Interleaving of round-robin initial assignment (e.g. 2 => 0,2,4,6,1,3,5,7)

[scheduler/roaming]
quantum = 1000000         # Scheduler quantum (round-robin for active threads on each core), in nanoseconds
core_mask = 1             # Mask of cores on which threads can be scheduled (default: 1, all cores)

[scheduler/static]
core_mask = 1             # Mask of cores on which threads can be scheduled (default: 1, all cores)

[scheduler/big_small]
quantum = 1000000         # Scheduler quantum, in nanoseconds
debug = false

[hooks]
numscripts = 0

[fault_injection]
type = none
injector = none

[routine_tracer]
type = none

[instruction_tracer]
type = none

[sampling]
enabled = false

Cache source code evaluation

Files related to cache in Sniper
config folder

gainestown.cfg contains the configuration of the L3 cache. The nesting contains the nehalem.cfg file

The nehalem.cfg file contains the configuration of L2 cache and L1 cache.

The default sniper argument is the gainestown.cfg file.

typedef int64_t SInt64;  
typedef int32_t SInt32;  
typedef int16_t SInt16;  
typedef int8_t  SInt8;  
typedef UInt8 Byte;  
typedef UInt8 Boolean;  
typedef uintptr_t IntPtr;  
extern UInt64 PC;

\sniper\commoncore\memory_subsystem Contains the definition and specific implementation of the storage system in sniper.

parametric_dram_directory_msi\cache_cntlr.cc

Determine if the current access cache misses or hits, if it is a hit to access cache (including write back and read cache), if the cache misses, then insert cache.

HitWhere::where_t  
CacheCntlr::processMemOpFromCore(Core::lock_signal_t lock_signal,  
Core::mem_op_t mem_op_type,IntPtr ca_address, UInt32 offset,  
Byte* data_buf, UInt32 data_length,bool modeled,bool count);  
/* Accepts a store access or a store write request, determines if the current cache access is hit or missing, and then calls a different processing function.  */  
SharedCacheBlockInfo* CacheCntlr::insertCacheBlock(IntPtr address,  
CacheState::cstate_t cstate, Byte* data_buf,  
core_id_t requester, ShmemPerfModel::Thread_t thread_num);  
/* This is called by the previous method when a cache misses. The main function is to find cache blocks to replace */  
void CacheCntlr::accessCache(  
Core::mem_op_t mem_op_type, IntPtr ca_address, UInt32 offset,  
Byte* data_buf, UInt32 data_length, bool update_replacement);  
/* The operation when the cache does not have a hit is also called by the processMemOpFromCore method. It consists of two main functions: read cache/write cache。 */

cache\cache.cc&cache.h

Each actual cache is defined as an object by the Cache class, such as L1-icache, which contains the basic information about the cache, including size, type, connectivity, and some operations to get the information. The cache class also includes two methods to access and insert the cache: accessSingleLine and insertSingleLine, both of which are called from CacheCntlr.

/* cache attributes */  
// Cache counters  
UInt64 m_num_accesses;  
UInt64 m_num_hits;  
// Generic Cache Info  
cache_t m_cache_type;  
CacheSet** m_sets;  
CacheSetInfo* m_set_info;  
/* cache constructor */  
Cache(String name,String cfgname,core_id_t core_id,UInt32 num_sets,  
UInt32 associativity, UInt32 cache_block_size,  
String replacement_policy, cache_t cache_type,  
hash_t hash = CacheBase::HASH_MASK,  
FaultInjector *fault_injector = NULL,  
AddressHomeLookup *ahl = NULL);  
/* accessSingleLine: When a cache hit occurs, the Cache controller calls the accessCache method, which in turn calls this method in the cache class. This method reads and writes to the cache. */    
CacheBlockInfo* accessSingleLine(IntPtr addr,  
access_t access_type, Byte* buff, UInt32 bytes,  
SubsecondTime now, bool update_replacement);  
/* insertSingleLine: When cache misses, the Cache controller calls the insertCacheBlock method, where it further calls this method in the cache class.    */  
void insertSingleLine(IntPtr addr, Byte* fill_buff,  
bool* eviction, IntPtr* evict_addr,  
CacheBlockInfo* evict_block_info, Byte* evict_buff,  
SubsecondTime now, CacheCntlr *cntlr = NULL);

.\cache\cache_base.h

The CacheBase class includes some basic information about the cache, such as connectivity, cache size, and also includes some type definitions, such as replacement policy. It also includes some type definitions, such as replacement policy, which needs to be changed if a replacement algorithm is added.

enum ReplacementPolicy  
{  
ROUND_ROBIN = 0,LRU,LRU_QBS,  
NRU,MRU,NMRU,PLRU,  
SRRIP,SHCT_SRRIP,  
SRRIP_QBS,RANDOM,  
NUM_REPLACEMENT_POLICIES,SHCT_LRU  
};//replace the enum type

cache\cache_set.cc和cache_set.h

The cache substitution algorithm is a set of cache lines in a group, the number of cache lines is the degree of connectivity. The substitution algorithm selects an appropriate cacheline in the group to be replaced. Each group is defined as an object by the CacheSet class, which includes more basic operations on cache. accessSingleLine method calls the read_line and write_line methods, and insertCacheBlock calls the insert method.

/* cache hit, used for data reading */  
void read_line(UInt32 line_index, UInt32 offset, Byte *out_buff,  
UInt32 bytes, bool update_replacement);  
/*  cache hit, used for data writing back */  
void write_line(UInt32 line_index, UInt32 offset, Byte *in_buff,  
UInt32 bytes, bool update_replacement);  
/*  cache miss, apply the agorithm to replace item in the cache  */  
void insert(CacheBlockInfo* cache_block_info, Byte* fill_buff,  
bool* eviction, CacheBlockInfo* evict_block_info,  
Byte* evict_buff, CacheCntlr *cntlr = NULL);

In addition to the access methods for cacheset, the following two methods need to be changed if you need to add your own replacement algorithm.

/* Create corresponding cache_set objects depending on the replacement algorithm. */  
CacheSet* CacheSet::createCacheSet(String cfgname, core_id_t  
core_id,String replacement_policy,  
CacheBase::cache_t cache_type,  
UInt32 associativity, UInt32 blocksize,  
CacheSetInfo* set_info);  
/* Create corresponding cachesetinfo objects according to the replacement algorithm. */  
CacheSetInfo* CacheSet::createCacheSetInfo(String name,  
String cfgname, core_id_t core_id,  
String replacement_policy, UInt32 associativity);  
/* Determine the type of the substitution algorithm according to the input string of the substitution algorithm. */  
CacheBase::ReplacementPolicy  
CacheSet::parsePolicyType(String policy);

.\cache\cache_block_info.cc&cache_block_info.h

Each cacheline will have an object created by the class cacheBlockInfo to hold additional information about the cache line, such as tag bits, used bits, etc. If the addition of a replacement algorithm requires additional information, consider adding it in this place or in the previous layer of cacheset.

IntPtr m_tag;  
CacheState::cstate_t m_cstate;  
UInt64 m_owner;  
BitsUsedType m_used;  
UInt8 m_options;  
// large enough to hold a bitfield for all available option_t's

.\cache\cache_set_lru.cc和cache_set_lru.h

The lru algorithm that comes with sniper, whose base classes are both cacheset classes, implements the getReplacementIndex method and updateReplacementIndex method of the base class. The former is used to select the appropriate cache line to be replaced when looking for a replacement cache line, according to the algorithm that determines the replacement. The latter is used for the update operation that the replacement algorithm needs to perform when a certain cache line is accessed (read, write back, insert) (updating itself with additional information, such as the LRU access record).

`clflush`

clflush is often executed when a hacker is carrying out the spectre attack. Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the linear address specified with the memory operand. If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory. The source operand is a byte memory location. The semantics is defined below:

flush Cache Line

Opcode / Instruction	Op/En	64-bit Mode	Compat/Leg Mode	Description
NP 0F AE /7 CLFLUSH m8	M	Valid	Valid	Flushes cache line containing m8.

Instruction Operand Encoding

Op/En	Operand 1	Operand 2	Operand 3	Operand 4
M	ModRM:r/m (w)	NA	NA	NA

CLFLUSH operation is the same in non-64-bit modes and 64-bit modes.

Operation

Flush_Cache_Line(SRC);

Intel C/C++ Compiler Intrinsic Equivalents

void _mm_clflush(void const *p)

Protected Mode Exceptions

#GP(0)	For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
#SS(0)	For an illegal address in the SS segment.
#PF(fault-code)	For a page fault.
#UD	If CPUID.01H:EDX.CLFSH[bit 19] = 0.
If the LOCK prefix is used.

Real-Address Mode Exceptions

#GP	If any part of the operand lies outside the effective address space from 0 to FFFFH.
#UD	If CPUID.01H:EDX.CLFSH[bit 19] = 0.
If the LOCK prefix is used.

The Process of finding the execution point

It’s hard first to execute the code just with the static analysis. So it’s natural to utilize gdb just with option --gdb on for run_sniper. However, the sift requires the thread synchronization but gdb is hard to make all thread synchronized. So I gave up. First browsing the code, I found function void PerformanceModel::queueInstruction(DynamicInstruction *ins) which first padding the code into a queue and simulate them with the iterator.

void PerformanceModel::iterate()
{
   while (m_instruction_queue.size() > 0)
   {
      // While the functional thread is waiting because of clock skew minimization, wait here as well
      #ifdef ENABLE_PERF_MODEL_OWN_THREAD
      while(m_hold)
         sched_yield();
      #endif
      DynamicInstruction *ins = m_instruction_queue.front();
      LOG_ASSERT_ERROR(!ins->instruction->isIdle(), "Idle instructions should not make it here!");
      if (!m_fastforward && m_enabled){
         handleInstruction(ins);
      }
      delete ins;
      m_instruction_queue.pop();
   }
   synchronize();
}

The instruction is dispatched there. So I first printf("sb"); to test whether it can be interleaved inside the instruction. The result is yes, but so many sbs, around a thousand both even before the program begins and after the program shuts. My guess is that the emulator has the init.S for OS booting, and some C runtime loaded in the first place. And clflush should be selected from other syscalls.

So our problem converted to how to identify the 4 clflush from other syscalls. First, I did’t find the identifier as my code in the riscv simulator. Then, I found ins has a lot of identifiers:

class DynamicInstruction {
private:
    // Private constructor: alloc() should be used
    DynamicInstruction(Instruction *ins, IntPtr _eip) {
        instruction = ins;
        eip = _eip;
        branch_info.is_branch = false;
        num_memory = 0;
    }

public:
    struct BranchInfo {
        bool is_branch;
        bool taken;
        IntPtr target;
    };
    struct MemoryInfo {
        bool executed; // For CMOV: true if executed
        Operand::Direction dir;
        IntPtr addr;
        UInt32 size;
        UInt32 num_misses;
        SubsecondTime latency;
        HitWhere::where_t hit_where;
    };
    static const UInt8 MAX_MEMORY = 2;
    Instruction *instruction;
    IntPtr eip; // Can be physical address, so different from instruction->getAddress() which is always virtual
    BranchInfo branch_info;
    UInt8 num_memory;
    MemoryInfo memory_info[MAX_MEMORY];

    static Allocator *createAllocator();

    ~DynamicInstruction();

    static DynamicInstruction *alloc(Allocator *alloc, Instruction *ins, IntPtr eip) {
        void *ptr = alloc->alloc(sizeof(DynamicInstruction));
        DynamicInstruction *i = new(ptr) DynamicInstruction(ins, eip);
        return i;
    }

    static void operator delete(void *ptr) { Allocator::dealloc(ptr); }

    SubsecondTime getCost(Core *core);

    bool isBranch() const { return branch_info.is_branch; }

    bool isMemory() const { return num_memory > 0; }

    void addMemory(bool e, SubsecondTime l, IntPtr a, UInt32 s, Operand::Direction dir, UInt32 num_misses,
                   HitWhere::where_t hit_where) {
        LOG_ASSERT_ERROR(num_memory < MAX_MEMORY, "Got more than MAX_MEMORY(%d) memory operands", MAX_MEMORY);
        memory_info[num_memory].dir = dir;
        memory_info[num_memory].executed = e;
        memory_info[num_memory].latency = l;
        memory_info[num_memory].addr = a;
        memory_info[num_memory].size = s;
        memory_info[num_memory].num_misses = num_misses;
        memory_info[num_memory].hit_where = hit_where;
        num_memory++;
    }

    void addBranch(bool taken, IntPtr target) {
        branch_info.is_branch = true;
        branch_info.taken = taken;
        branch_info.target = target;
    }

    SubsecondTime getBranchCost(Core *core, bool *p_is_mispredict = NULL);

    void accessMemory(Core *core);
};

We first found how Instruction size for identifier, but other than clflush, there’s other instructions of size of 9. Then we got meminfo, take meminfo->addr, but it’s changeable every run. Then we found something for identifier well, that is opcode, which unique to every ISA. Their code section is quite different.

概率论与图论背后的算法

基本线程模型

Reference

交换机的简化结构

三态内容寻址储存器（TCAM）

Reference

Routing protocols

ARP

ALOHA协议

Time-slotted ALOHA

CSMA/CD

CSMA/CA

RNN to DTMC

Mac 地址

IPv4 地址

停止等待可靠传输

TCP 滑动窗口

波特率

Reference

My demands

Rust

nearly all the data storing in the language that utilize API use mapping

HUGO

Gatsby

Why Rust is not ready and Go is ready.

http package of go

scheduler of go

How overlay network is written in go.

Reference

Info

Introduction

Cache implementation

Cache source code evaluation

clflush

flush Cache Line

Instruction Operand Encoding

Operation

Intel C/C++ Compiler Intrinsic Equivalents

Protected Mode Exceptions

Real-Address Mode Exceptions

The Process of finding the execution point

http package of `go`

scheduler of `go`

How overlay network is written in `go`.

`clflush`