<\body> |>>||> is a -like utility designed to output randomly sized groups of lines with a group size distribution centered around some characteristic value. The author desires to create a -like utility named to mix line lists in order to produce output line lists with the following somewhat conflicting properties: <\description> An output line's position should not contain information about its input line position. Two neighboring lines in the input stream should have a high probability of remaining neighbors in the output stream. The objective is to improve the value of a short random scan of a small fraction of a potentially large input list; output that demonstrates ROP as well as some degree of PIR may achieve this objective. In contrast, the utility provides PIR but no ROP: a line's neighbor in the output of is equally likely to be any other random line from the input.\ In other words, output produced by should group together sequential segments of the input lines in order to partially preserve relationships that may exist between sequential files. For example, this could be done by jumping to a random position in the input lines, consuming (i.e. reading, outputting, and marking a line not to be read again) some amount of sequential lines, then repeating the process until every line is consumed. The amount of sequential lines to read between jumps affects how well the above desired properties are satisfied. The objective of is not to completely prevent the possibility of reassembling the input given the output. Additionally, a valuable property desired of is output which demonstrates sufficiently high PIR compared to ROP such that only a short (compared to the logarithm of the input list size) sequential scan of the output list from a random starting position is required before a jump to a new group is is encountered; this would permit the overal contents of very large input line lists to be sampled. <\eqnarray*> ||>>|>>||>>|>>||>>|||>>|||>>|>>>||>>|,0>|)>>||>>>|,0>>||>>>> <\enumerate-numeric> Acquire and count input lines (via or positional arguments). Calculate line count >> . Calculate target group size . Select random unconsumed input line and consume it to output. Consume the next sequential line with probability >>. Otherwise if some input lines remain unconsumed, go to step . Otherwise, exit. The simultaneous presence of ROP and PIR properties in the output depends upon the amount of sequential lines that are read before jumps to a new random position in the input list. This amount is the , ; it is the \Ptarget\Q since represents the average of a distribution of group sizes that may be selected, not a single group size. In this analysis, the total number of lines in the input list is >>. For small input line counts, (e.g. >\10>) the target group size should be nearly one (e.g. 1>) since group sizes any larger than this would have almost no PIR (e.g. a group size of for >=10> would be 80% identical to the input). For modest line input counts (e.g. >\100>), the target group size may be allowed to be larger, such as a tenth of the input line count (e.g. 10>); this would provide some PIR (approximately permutations between the approximately >|s>\\10> groups) while each line in groups around size would have a low probability of not being next to its neighbor ( of the 10 lines would retain the same two neighbors while the two ends would retain one each). For very large input line counts (e.g. >\1000000>), however, breaking up and randomizing the input into ten groups of 000> offers very little PIR; the benefit of the very high ROP is also lost since sequential scanning of tens of thousands of lines is required before a random jump to a new group may be encountered; therefore, the target group size should be a much smaller fraction of >>, (e.g. 20>) while still increasing. The relationship between a desireable target group size and the input line count >> is non-linear. The author believes a reasonable approach is to scale the group size to the logarithm of input line count. > shows an example plot of >|)>> that is tuned to achieve a target group size of >=10|)>=25> for an input list length of >=10> lines.\ <\big-figure||gr-frame|>|gr-geometry||gr-grid||1>|gr-grid-old||1>|gr-edit-grid-aspect|||>|gr-edit-grid||1>|gr-edit-grid-old||1>|gr-grid-aspect-props|||>|gr-grid-aspect|||>|magnify|1.18920711463847|gr-auto-crop|false|>|>|>|>|>|>|>||||>||>||>||>||>||>|>|>|>|>|>|>|>|>|>|||||>>> A plot of a possible function that relates target group size and input lines >> that provide some ROP and PIR. The function is tuned to achieve >=10|)>=25>.\ The following is a set of equations that are used to derive a definition for >|)>> that satisfies the plot in >.\ <\eqnarray*> || >|)>=>|)>| >>>|>||>>>|>|| ,0>|)>=,0>|)>| >>>|>>||,0>>>|||>|>||+1>>|=25>||\+1>>|||\+1>>||)>>|||)>+1>>||)>-1|x>>||>>||||)>-1|x>>>>||||)>-1>|x>>>|>|||)>-1|x>>>|>|)>>|||)>-1|x>|)>\>|)>| >|)>+1>>||| |ln ,0>|)>>|)>\|)>-1|)>\>|)>| >|)>+1>>|>|)>>||,0>|)>-1|)>\>|)>|ln ,0>|)>>|)>+1>>|>|)>>||,0>|)>-1|,0>|)>|]>>|)>\>|)>|]>+1>>>> Equation defines a quadratic equation for the linear range and the logarithmic domain . is defined in terms of >> via a domain transformation defined by >. The result is > which defines >|)>> as a function of >> and parameters ,0>|)>> and ,0>>. The parameters define how quickly or slowly the quadratic equation grows. In other words, if a user wishes for a 000000> line input to be split into groups each containing, on average, lines, then they should plug in ,0>=1000000> and ,0>|)>=25> into > as is done in >. This equation can then be used to calculate target group sizes as a function of other input line counts >> besides >=1000000>. For example, plugging >=500> into > yields > which specifies a target group size of 6>. <\eqnarray*> >|)>>||,0>|)>-1|,0>|)>|]>>|)>\>|)>|]>+1>>|>|)>>||

000000|)>|]>>|)>\>|)>|]>+1>>|||>|>=500|)>>||

000000|)>|]>>|)>\|]>+1>>|>=500|)>>||>>> \; A method may employ to decide when read the next sequential unconsumed input line is to simply do so with probability >> such that the expected value of the average group size trends towards . <\eqnarray*> >>||>|)>>>|>>||>>>|||>>=>>>>|||>>>>|>>||>>|>-1>||>>|>>||>|)>>>>|>>||>|)>>>>|||>|>>|)>>||,0>|)>-1|,0>|)>|]>>|)>\>|)>|]>+1|]>

>>>> Another method may employ to decide when to read the next sequential unconsumed input line is to use an inverse gaussian distribution. This may be done by generating from the distribution a float sampled from the inverse gaussian with range 0 to infinity with mean > whenever a new random position in the input list is selected; the float is rounded to the nearest integer.<\footnote> See "Generating Random Variates Using Transformations with Multiple Roots". 1976. . Then, after consuming an input line, this integer is decremented by one and another sequential line is consumed provided the integer does not become less than or equal to zero. The inverse gaussian distribution requires specifying the mean > and the shape parameter >; a higher > results in a greater spread. An upper bound may also be specified since the distribution has none except for that imposed by its programming implementation. The result of using an inverse gaussian distribution is an output with potentially much more regular group sizes than using the previously mentioned expected value method. However, the implementation of the inverse gaussian sampling operation described by (Michael, 1976) requires several exponent calculations and a square root calculation in addition to various multiplication and division operations. If sufficient processing power is available, this may not necessarily be an issue. Regardless of whether group sizes are determined by the expected value method or using variates of an inverse gaussian distribution, mimicking the property of all input lines being present in the output, albeit rearranged, results in a side effect: the first output lines are more likely to contain groups with uninterrupted sequence runs (high ROP) while groups in the last output lines are almost certain to contain sequence jumps within a group (less ROP). The reason for this is that , when it encounters an input line that has already been consumed, will skip to the next available input line. The decision could be made to skip to a new random line, but, it is simpler to simply read the next available input line. The author's original intention of sampling only a short initial portion of the output is compatible with the behavior that ROP is preserved mostly at the beginning of the output. <\big-table|||