From 080ea4c0ff0d4e6b5ce86f664fa6645c1cb02bf0 Mon Sep 17 00:00:00 2001 From: Steven Baltakatei Sandoval Date: Tue, 14 Feb 2023 13:27:57 +0000 Subject: [PATCH] feat(doc/up/bkshuf/):Add TeXmacs documentation - Note: Compiled PDF uploaded to: https://reboil.com/res/2023/txt/20230214T1327+00..bkshuf_article.pdf --- doc/unitproc/bkshuf/article.tm | 374 +++++++++++++++++++++++++++++++++ 1 file changed, 374 insertions(+) create mode 100644 doc/unitproc/bkshuf/article.tm diff --git a/doc/unitproc/bkshuf/article.tm b/doc/unitproc/bkshuf/article.tm new file mode 100644 index 0000000..0030c11 --- /dev/null +++ b/doc/unitproc/bkshuf/article.tm @@ -0,0 +1,374 @@ + + + + +<\body> + |>>|> + + + + is a -like utility designed to output randomly + sized groups of lines with a group size distribution centered around some + characteristic value. + + + + The author desires to create a -like utility named + to mix line lists in order to produce output line lists with + the following somewhat conflicting properties: + + <\description> + An output line's position should not + contain information about its input line position. + + Two neighboring lines in the + input stream should have a high probability of remaining neighbors in the + output stream. + + + The objective is to improve the value of a short random scan of a small + fraction of a potentially large input list; output that demonstrates ROP as + well as some degree of PIR may achieve this objective. In contrast, the + utility provides PIR but no ROP: a line's neighbor in the + output of is equally likely to be any other random line from + the input.\ + + In other words, output produced by should group together + sequential segments of the input lines in order to partially preserve + relationships that may exist between sequential files. For example, this + could be done by jumping to a random position in the input lines, consuming + (i.e. reading, outputting, and marking a line not to be read again) some + amount of sequential lines, then repeating the process until every line is + consumed. The amount of sequential lines to read between jumps affects how + well the above desired properties are satisfied. + + The objective of is not to completely prevent the + possibility of reassembling the input given the output. Additionally, a + valuable property desired of is output which demonstrates + sufficiently high PIR compared to ROP such that only a short (compared to + the logarithm of the input list size) sequential scan of the output list + from a random starting position is required before a jump to a new group is + is encountered; this would permit the overal contents of very large input + line lists to be sampled. + + + + + + <\eqnarray*> + ||>>|>>||>>|>>||>>|||>>|||>>|>>>||>>|,0>|)>>||>>>|,0>>||>>>> + + + + + <\enumerate-numeric> + Acquire and count input lines (via or positional + arguments). + + Calculate line count >> . + + Calculate target group size . + + Select random unconsumed input line and consume it to + output. + + Consume the next sequential line with probability + >>. Otherwise if some input lines remain + unconsumed, go to step . Otherwise, exit. + + + + + + + The simultaneous presence of ROP and PIR properties in the output depends + upon the amount of sequential lines that are read before + jumps to a new random position in the input list. This amount is the + , ; it is the \Ptarget\Q since + represents the average of a distribution of group sizes that may be + selected, not a single group size. In this analysis, the total number of + lines in the input list is >>. For small input line + counts, (e.g. >\10>) the target group size + should be nearly one (e.g. 1>) since group sizes any larger + than this would have almost no PIR (e.g. a group size of for + >=10> would be 80% identical to the input). For + modest line input counts (e.g. >\100>), the + target group size may be allowed to be larger, such as a tenth of the input + line count (e.g. 10>); this would provide some PIR + (approximately permutations between the approximately + >|s>\\10> groups) + while each line in groups around size would have a low + probability of not being next to its neighbor ( of the 10 lines + would retain the same two neighbors while the two ends would retain one + each). For very large input line counts (e.g. + >\1000000>), + however, breaking up and randomizing the input into ten groups of + 000> offers very little PIR; the benefit + of the very high ROP is also lost since sequential scanning of tens of + thousands of lines is required before a random jump to a new group may be + encountered; therefore, the target group size should be a much smaller + fraction of >>, (e.g. 20>) while still + increasing. The relationship between a desireable target group size + and the input line count >> is non-linear. + The author believes a reasonable approach is to scale the group size to the + logarithm of input line count. + + > shows an example plot of + >|)>> that is tuned to achieve a target + group size of >=10|)>=25> for an + input list length of >=10> lines.\ + + <\big-figure||gr-frame|>|gr-geometry||gr-grid||1>|gr-grid-old||1>|gr-edit-grid-aspect|||>|gr-edit-grid||1>|gr-edit-grid-old||1>|gr-grid-aspect-props|||>|gr-grid-aspect|||>|magnify|1.18920711463847|gr-auto-crop|false|>|>|>|>|>|>|>||||>||>||>||>||>||>|>|>|>|>|>|>|>|>|>|||||>>> + A plot of a possible function that relates target + group size and input lines >> that provide + some ROP and PIR. The function is tuned to achieve + >=10|)>=25>.\ + + + The following is a set of equations that are used to derive a definition + for >|)>> that satisfies the plot in + >.\ + + <\eqnarray*> + || + >|)>=>|)>| + >>>|>||>>>|>|| + ,0>|)>=,0>|)>| + >>>|>>||,0>>>|||>|>||+1>>|=25>||\+1>>|||\+1>>||)>>|||)>+1>>||)>-1|x>>||>>||||)>-1|x>>>>||||)>-1>|x>>>|>|||)>-1|x>>>|>|)>>|||)>-1|x>|)>\>|)>| + >|)>+1>>||| + |ln ,0>|)>>|)>\|)>-1|)>\>|)>| + >|)>+1>>|>|)>>||,0>|)>-1|)>\>|)>|ln + ,0>|)>>|)>+1>>|>|)>>||,0>|)>-1|,0>|)>|]>>|)>\>|)>|]>+1>>>> + + + Equation defines a quadratic equation for the + linear range and the logarithmic domain . is + defined in terms of >> via a domain transformation + defined by >. The result is + > which defines + >|)>> as a function of + >> and parameters + ,0>|)>> and ,0>>. + The parameters define how quickly or slowly the quadratic equation grows. + In other words, if a user wishes for a 000000> + line input to be split into groups each containing, on average, + lines, then they should plug in ,0>=1000000> + and ,0>|)>=25> into > as is done in >. This equation can then be used to calculate target group + sizes as a function of other input line counts + >> besides >=1000000>. + For example, plugging >=500> into > yields > which specifies a target group size of + 6>. + + <\eqnarray*> + >|)>>||,0>|)>-1|,0>|)>|]>>|)>\>|)>|]>+1>>|>|)>>||000000|)>|]>>|)>\>|)>|]>+1>>|||>|>=500|)>>||000000|)>|]>>|)>\|]>+1>>|>=500|)>>||>>> + + + \; + + + + A method may employ to decide when read the next sequential + unconsumed input line is to simply do so with probability + >> such that the expected value of the average group + size trends towards . + + <\eqnarray*> + >>||>|)>>>|>>||>>>|||>>=>>>>|||>>>>|>>||>>|>-1>||>>|>>||>|)>>>>>|>>||>|)>>>>>|||>|>>|)>>||,0>|)>-1|,0>|)>|]>>|)>\>|)>|]>+1|]>>>>>> + + + + + Another method may employ to decide when to read the next + sequential unconsumed input line is to use an inverse gaussian + distribution. This may be done by generating from the distribution a float + sampled from the inverse gaussian with range 0 to infinity with mean + > whenever a new random position in the input list is selected; + the float is rounded to the nearest integer.<\footnote> + See "Generating Random Variates Using + Transformations with Multiple Roots". 1976. + . + Then, after consuming an input line, this integer is + decremented by one and another sequential line is consumed provided the + integer does not become less than or equal to zero. The inverse gaussian + distribution requires specifying the mean > and the shape + parameter >; a higher > results in a + greater spread. An upper bound may also be specified since the distribution + has none except for that imposed by its programming implementation. + + The result of using an inverse gaussian distribution is an output with + potentially much more regular group sizes than using the previously + mentioned expected value method. However, the implementation of the inverse + gaussian sampling operation described by (Michael, 1976) requires several + exponent calculations and a square root calculation in addition to various + multiplication and division operations. If sufficient processing power is + available, this may not necessarily be an issue. + + + + Regardless of whether group sizes are determined by the expected value + method or using variates of an inverse gaussian distribution, mimicking the + property of all input lines being present in the output, + albeit rearranged, results in a side effect: the first output lines are + more likely to contain groups with uninterrupted sequence runs (high ROP) + while groups in the last output lines are almost certain to contain + sequence jumps within a group (less ROP). The reason for this is that + , when it encounters an input line that has already been + consumed, will skip to the next available input line. The decision could be + made to skip to a new random line, but, it is simpler to simply read the + next available input line. The author's original intention of sampling only + a short initial portion of the output is compatible with the behavior that + ROP is preserved mostly at the beginning of the output. + + + + <\big-table||||>|>||>|.>>>>>> + A table listing versions of . + + + <\description> + Initial implementation in + with and + and tested on . Saved + to the author's repository<\footnote> + See commit at + . + . + + + + +> + +<\references> + <\collection> + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + > + + + +<\auxiliary> + <\collection> + <\associate|figure> + |1>|> + A plot of a possible function that relates target group size + |s> and input lines + |l|in>>> that + provide some ROP and PIR. The function is tuned to achieve + |s|in>>=10|)>=25>. + |> + + <\associate|table> + |1>|> + \; + |> + + <\associate|toc> + |math-font-series||1Summary> + |.>>>>|> + + + |math-font-series||2Objective> + |.>>>>|> + + + |math-font-series||3Design> + |.>>>>|> + + + |3.1Definitions + |.>>>>|> + > + + |3.2Process + |.>>>>|> + > + + |3.3Parameter analysis + |.>>>>|> + > + + |3.3.1Target group size + calculation |.>>>>|> + > + + |3.3.2Jump from expected value + |.>>>>|> + > + + |3.3.3Jump from random variate + of inverse gaussian distribution |.>>>>|> + > + + |3.3.4Output structure + |.>>>>|> + > + + |math-font-series||4Version + History> |.>>>>|> + + + + \ No newline at end of file -- 2.30.2