Workflow Type: Galaxy
V 16 - including updated diff-tool
Associated Tutorial
This workflows is part of the tutorial Text-Mining Differences in Chinese Newspaper Articles, available in the GTN
Features
- Includes a Galaxy Workflow Report
- Uses Galaxy Workflow Comments
Thanks to...
Workflow Author(s): Daniela Schneider
Tutorial Author(s): Daniela Schneider
Tutorial Contributor(s): Björn Grüning, Daniela Schneider, Saskia Hiltemann, Teresa Müller
Funder(s): German Competence Center Cloud Technologies for Data Management and Processing (de.KCD)
Inputs
ID | Name | Description | Type |
---|---|---|---|
Input censored text | #main/Input censored text | Upload the censored text containing replacement characters like ‘×’. |
|
Input uncensored text | #main/Input uncensored text | Upload the uncensored text without replacement characters. |
|
Steps
ID | Name | Description |
---|---|---|
2 | Preprocessing of censored text | This step uses Regular Expressions to delete all empty spaces (\s) and show only one character per line (\1\n). The result is a cleaned and reformatted text showing only one character per line. toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0 |
3 | Preprocessing of uncensored text | This step uses Regular Expressions to delete all empty spaces (\s) and show only one character per line (\1\n). The result is a cleaned and reformatted text showing only one character per line. toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy0 |
4 | Comparison with diff - user version | The diff tool compares the two cleaned texts. This version (HTML version) creates an HTML file, which colour codes differences as additions (green) or extractions (red) when comparing the texts. toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy1 |
5 | Comparison with diff - computer version | The diff tool compares the two cleaned texts. This version of the output (raw output) is used for the further steps of the analysis. It is less intuitive for users. Therefore, the second diff below includes a more visual version of the output (HTML). toolshed.g2.bx.psu.edu/repos/bgruening/diff/diff/3.10+galaxy1 |
6 | Extracting only censored passages | This step selects all lines from the diff file that contain the censorship symbol ×. The condition "ord(c1) == 215" means that lines in column c1, which contain the censored text, are selected if they match ×. The symbol × is unspecific, therefore, the Unicode identifier of the character (215) is used for clarity in this condition. This step does not show an output. If the filtered symbol is empty in the second text, this file lacks columns to compute the following steps. This is invisible for users but would cause a technical error. The compute step covers this and makes sure all necessary columns exist. It shows the output for both steps (Extracting and Compute) correctly. Add another Unicode here if you want to select a different character, for example, '□' or '△'. You can get the respective code, for example, on this website: https://www.mauvecloud.net/charsets/CharCodeFinder.html Copy the character you want to filter in the "input" window and select "Decimal Character Codes" as an output. If you do this for symbol ×, you get 215. Filter1 |
7 | Compute | This step unifies the formatting and adds potentially missing columns, should lines extracted before coming up empty in the second text. This ensures the proper number of columns and allows the smooth running of the next steps. toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.1 |
8 | Cut | This step selects only column 9, which contains the uncensored characters from text two. The result is only one column with different rows of Chinese characters. This step allows scaling words by frequency the word cloud in the next step. meaning characters that appear more often appear bigger, making the results evident at first sight. Cut1 |
9 | Datamash | This step sums up how often which character appeared in the table before. toolshed.g2.bx.psu.edu/repos/iuc/datamash_ops/datamash_ops/1.8+galaxy0 |
10 | Generate a word cloud | This step shows, which characters were censored in the first text. The bigger the word, the more often it appeared in the text. toolshed.g2.bx.psu.edu/repos/bgruening/wordcloud/wordcloud/1.9.4+galaxy1 |
11 | Sort | Sorts the quantified results from those appearing most to those appearing least. sort1 |
Outputs
ID | Name | Description | Type |
---|---|---|---|
output_csv | #main/output_csv | n/a |
|
output_graphic | #main/output_graphic | n/a |
|
Version History

Creators
Not specifiedSubmitter
Discussion Channel
Activity
Views: 40 Downloads: 4 Runs: 0
Created: 2nd Jun 2025 at 11:01


None